How to Estimate AI API Costs: A Practical Guide for ChatGPT, Claude and Gemini
Most AI projects ship over budget. Not because the models are slow, but because nobody modeled the bill before turning the API on. Here is how to do it properly - with the formulas, the gotchas, and the numbers from real workloads.
The cost formula every AI builder should memorize
Every modern LLM API charges per token, with separate prices for input (your prompt) and output (its response). The cost of a single API call is:
cost = (input_tokens / 1,000,000) × input_price + (output_tokens / 1,000,000) × output_price
For an entire workload, multiply by the number of calls per day, then by 30 for monthly cost. That is it. Most spreadsheets used in production are just this formula in disguise.
The token-to-character ratio
A token is a sub-word unit, not a character or word. As a working approximation:
- 1 token ≈ 4 characters in English text
- 1 token ≈ 0.75 of a word
- 100 tokens ≈ 75 words ≈ 5 average sentences
Code, JSON, and non-English text consume more tokens per character. Chinese, Japanese, and Korean often run 2-3× more tokens than English for the same conceptual content. If your app is multilingual, model the worst case.
What everything actually costs (May 2026)
| Model | Input / 1M | Output / 1M |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o mini | $0.15 | $0.60 |
| Claude Opus 4 | $15.00 | $75.00 |
| Claude Sonnet 4 | $3.00 | $15.00 |
| Claude Haiku 4.5 | $0.80 | $4.00 |
| Gemini 1.5 Pro | $1.25 | $5.00 |
| Gemini 1.5 Flash | $0.075 | $0.30 |
Note: output tokens cost 3-5× more than input across every provider. This is not a price-gouging quirk - generating tokens requires a forward pass through the model for each one, while input tokens can be processed in parallel.
Worked example 1: a basic chatbot
You're building a customer-support chatbot on top of GPT-4o. Average prompt is the system message (300 tokens) plus the last 5 messages of context (500 tokens). The model responds with around 200 tokens.
- Input per call: 800 tokens. Output per call: 200 tokens.
- Cost per call: (800 ÷ 1M) × $2.50 + (200 ÷ 1M) × $10 = $0.002 + $0.002 = $0.004
- 10,000 conversations per day, 5 turns each = 50,000 calls / day
- Daily cost: $200. Monthly: $6,000.
Switch to GPT-4o mini and the same workload costs $360 / month. Switch to Claude Haiku and it's $1,560 / month. The model choice can be a 16× multiplier on your cloud bill.
Worked example 2: RAG pipeline
RAG (Retrieval Augmented Generation) appends retrieved documents to every prompt. This explodes the input token count. A typical RAG call retrieves 5-10 chunks of 500 tokens each, on top of the user's query and system prompt.
- System: 200 tokens. Query: 50 tokens. Retrieved chunks: 8 × 500 = 4,000 tokens. Input total: 4,250 tokens.
- Output: 400 tokens (a thoughtful answer with citations).
- On Claude Sonnet 4: (4,250 / 1M) × $3 + (400 / 1M) × $15 = $0.0128 + $0.006 = $0.019 per call.
- 5,000 RAG queries / day = $95 / day = $2,850 / month.
The fix is prompt caching. Both Anthropic and OpenAI offer 90% discounts on cached input tokens. If your system prompt and retrieval index are stable, cache them - this can drop RAG costs by 60-80%.
Worked example 3: agent workflows
Agents are by far the most expensive AI workload. They make multiple model calls per user request - thinking, calling tools, reading tool output, thinking again. A typical agent run is 8-15 LLM calls, each with growing context.
- Average per-run: 12 calls, 3,000 input tokens accumulated, 800 output tokens total.
- On GPT-4o: 12 × ((3000 / 1M) × $2.50 + (800 / 1M) × $10) = 12 × $0.0155 = $0.186 per agent run.
- 1,000 agent runs / day = $5,580 / month.
Agents look 50-100× more expensive than chat for a reason: they actually do more. Budget accordingly, or build them on Haiku/Mini variants.
Six ways to cut your bill in half
- Use a cheaper model for routine work. Most workloads do not need the flagship. Test mini/Haiku/Flash variants - they are 10-50× cheaper.
- Cache system prompts. 90% off cached input tokens via Anthropic and OpenAI prompt caching.
- Cap output tokens. Set
max_tokenson every call. Without it, models can ramble for thousands of (expensive) output tokens. - Trim your prompts. Remove example-of-the-day blocks, redundant instructions, low-signal context. 30% prompt reduction is achievable on most production prompts.
- Use streaming for UX, not throughput. Streaming costs the same; but feels faster. Use it.
- Batch when latency does not matter. OpenAI and Anthropic offer 50% discounts on batch inference (24-hour SLA). Great for nightly summaries, classification jobs, or bulk processing.
The simplest way to estimate
Plug a representative prompt into our AI Token Counter and Cost Estimator, pick the model you intend to ship on, and multiply by your expected daily volume. That is the number to put in your budget review. The actual bill will land within 10-15% if you account for retries, system overhead, and tokenizer differences.
If you are running anything north of $1,000 / month in API spend, the savings from a careful cost model usually pay for the engineering time to build it within a week.
Related
- AI Token Counter & Cost Estimator - the tool referenced throughout
- AI & LLM VRAM Calculator - for self-hosted alternatives
- All Tech & AI tools