Compare pricing across GPT-4o, Claude Sonnet, Gemini, Llama, Mistral, DeepSeek, and more. Choose a use-case preset or enter custom token counts to see costs side by side.
~600 words
~300 words
6,000/month
Concrete inputs and the resulting cost from this calculator at GPT-4o standard rates. Numbers verified Apr 2026; rerun the tool against your provider's current rate card for production budgeting.
| Scenario | Inputs | Calculator output (monthly) |
|---|---|---|
| Minimal — small chatbot | 500 input tokens, 200 output tokens per request, 1,000 requests/day, GPT-4o | ~$112/month at standard rates; ~$56/month with prompt caching enabled on the system prompt |
| Typical — production RAG assistant | 3,000 input tokens (retrieved context), 400 output tokens, 5,000 requests/day, GPT-4o | ~$1,575/month on GPT-4o; ~$155/month if routed to GPT-4o-mini for queries that do not need full reasoning |
Large language model providers charge per token — a token is roughly ¾ of a word. Most APIs charge separately for input tokens (your prompt) and output tokens (the model's response), with output tokens typically costing 2-5× more than input tokens.
The most effective strategy is model routing — use a fast, cheap model for simple tasks and route complex queries to a premium model. Combined with prompt caching and response streaming, most teams can reduce LLM costs by 60-80% without sacrificing quality.
Cost is one dimension — latency, accuracy, and context window size matter too. Our AI engineering team evaluates models against your specific use case and builds cost-optimized pipelines. Get a free architecture review.
Output tokens require autoregressive generation — the model predicts one token at a time, running a full forward pass for each. Input tokens are processed in parallel during a single pass. This computational asymmetry is why providers charge 2–5× more for output. Use our AI token counter to estimate your input/output ratio before calculating costs.
Pricing is sourced from each provider's published rate card and updated regularly. Actual costs may vary with volume discounts, prompt caching, batch processing, or committed-use agreements. The estimates give a reliable baseline for planning.
Model routing — sending simple queries to a cheap model and complex ones to a premium model — typically reduces costs by 60–80% without sacrificing quality. Our AI development team builds intelligent routing pipelines tailored to your use case. Check the RAG cost estimator if your workload includes retrieval-augmented generation.
Common gaps: hidden system prompts and tool-call JSON balloon input tokens, chain-of-thought or multi-step agents fan out per user turn, retries on errors double-charge, and verbose output formats inflate output tokens. Measure actual tokens per end-to-end task, not per API call.
Anthropic caches at 90% discount on cache reads with a small write premium; OpenAI caches at 50% on reads with no write fee. Enter your steady-state input tokens assuming cache hits for a fair comparison. Cache only pays off when the same prefix is reused 3+ times.
For RAG (input-heavy), Gemini Flash, GPT-4o Mini, and Claude Haiku are typically cheapest. For agentic workflows with long tool loops, factor in total output tokens — cheaper output wins overall even if input rate is slightly higher. DeepSeek is strong on both.
Fine-tuning pays off at roughly 1M+ inferences/month on a narrow task where a smaller fine-tuned model can replace a larger base model. Below that, well-written prompts plus retrieval are cheaper. Factor training cost (hundreds to low thousands) into break-even math.
Production token usage per request often varies 2-3x above the average. Budget the p90 token count, not the mean, and add a 20-30% buffer for retries, fallback models, and prompt iteration during the first quarter of deployment.