LLM API Pricing Comparison 2026: GPT-4o vs Claude vs Gemini vs Open-Source
Author
ZTABS Team
Date Published
Choosing the right LLM for your application is no longer just about capability—it's about cost efficiency at scale. A model that costs $0.01 per request doesn't sound expensive until you're handling 100,000 requests per day. At that point, the difference between GPT-4o and GPT-4o-mini is $25,000 per month.
This guide gives you the complete pricing picture for every major LLM API in 2026, along with practical strategies to cut costs by 50–90% without sacrificing quality.
Quick Pricing Overview
All prices are per 1 million tokens. Prices current as of February 2026.
Frontier Models (Highest Capability)
| Model | Provider | Input (per 1M tokens) | Output (per 1M tokens) | Context Window | Best For | |-------|----------|----------------------|------------------------|---------------|----------| | GPT-4o | OpenAI | $2.50 | $10.00 | 128K | General-purpose, complex reasoning | | Claude 3.5 Sonnet | Anthropic | $3.00 | $15.00 | 200K | Long-context, code, safety-critical | | Gemini 1.5 Pro | Google | $1.25 | $5.00 | 2M | Massive context, multimodal | | GPT-4o (batch) | OpenAI | $1.25 | $5.00 | 128K | Non-real-time bulk processing | | Claude 3.5 Sonnet (batch) | Anthropic | $1.50 | $7.50 | 200K | Non-real-time bulk processing |
Mid-Tier Models (Best Value)
| Model | Provider | Input (per 1M tokens) | Output (per 1M tokens) | Context Window | Best For | |-------|----------|----------------------|------------------------|---------------|----------| | GPT-4o-mini | OpenAI | $0.15 | $0.60 | 128K | High-volume, cost-sensitive apps | | Claude 3.5 Haiku | Anthropic | $0.80 | $4.00 | 200K | Fast, capable, long-context | | Gemini 1.5 Flash | Google | $0.075 | $0.30 | 1M | Cheapest commercial option | | Gemini 2.0 Flash | Google | $0.10 | $0.40 | 1M | Latest Flash with better reasoning |
Open-Source Models (Self-Hosted or API)
| Model | Parameters | API Provider | Input (per 1M tokens) | Output (per 1M tokens) | Context Window | |-------|-----------|-------------|----------------------|------------------------|---------------| | Llama 3.1 405B | 405B | Together AI | $3.50 | $3.50 | 128K | | Llama 3.1 70B | 70B | Together AI | $0.88 | $0.88 | 128K | | Llama 3.1 8B | 8B | Together AI | $0.18 | $0.18 | 128K | | Mixtral 8x22B | 176B (MoE) | Together AI | $1.20 | $1.20 | 64K | | Mistral Large 2 | 123B | Mistral API | $2.00 | $6.00 | 128K | | DeepSeek V3 | 671B (MoE) | DeepSeek API | $0.27 | $1.10 | 128K | | Qwen 2.5 72B | 72B | Together AI | $0.90 | $0.90 | 128K |
Detailed Provider Breakdown
OpenAI
OpenAI remains the default choice for most production applications, with the broadest model range and best developer tooling.
Pricing tiers:
| Model | Input | Output | Cached Input | Batch Input | Batch Output | |-------|-------|--------|-------------|------------|-------------| | GPT-4o | $2.50 | $10.00 | $1.25 | $1.25 | $5.00 | | GPT-4o-mini | $0.15 | $0.60 | $0.075 | $0.075 | $0.30 | | o1 (reasoning) | $15.00 | $60.00 | $7.50 | $7.50 | $30.00 | | o1-mini | $3.00 | $12.00 | $1.50 | $1.50 | $6.00 | | o3-mini | $1.10 | $4.40 | $0.55 | $0.55 | $2.20 |
Free tier: $5 credit for new accounts, expires after 3 months.
Rate limits (Tier 1): 500 RPM, 30,000 TPM for GPT-4o. Increases with usage to Tier 5 (10,000 RPM, 30M TPM).
Key advantage: Prompt caching gives 50% discount on repeated context. Batch API gives 50% off for non-urgent processing.
Anthropic (Claude)
Claude excels at long-form content, code generation, and tasks requiring careful reasoning. The 200K context window is the largest among frontier models (excluding Gemini).
Pricing tiers:
| Model | Input | Output | Cached Input | Batch Input | Batch Output | |-------|-------|--------|-------------|------------|-------------| | Claude 3.5 Sonnet | $3.00 | $15.00 | $0.30 | $1.50 | $7.50 | | Claude 3.5 Haiku | $0.80 | $4.00 | $0.08 | $0.40 | $2.00 | | Claude 3 Opus | $15.00 | $75.00 | $1.50 | $7.50 | $37.50 |
Free tier: Limited free usage through claude.ai. API requires prepayment.
Rate limits: 50 RPM, 40,000 TPM on free tier. 4,000 RPM, 400,000 TPM on Scale tier.
Key advantage: Prompt caching gives a 90% discount on cached tokens—the most aggressive caching discount of any provider. System prompts, examples, and tool definitions cached automatically.
Google (Gemini)
Gemini offers the cheapest frontier-class models with the largest context windows. The 2M token context on Gemini 1.5 Pro means you can process entire codebases or book-length documents in a single call.
Pricing tiers:
| Model | Input | Output | Context Caching | Free Tier | |-------|-------|--------|----------------|----------| | Gemini 1.5 Pro | $1.25 | $5.00 | $0.3125 (75% off) | 50 RPD (free) | | Gemini 1.5 Flash | $0.075 | $0.30 | $0.01875 (75% off) | 1,500 RPD (free) | | Gemini 2.0 Flash | $0.10 | $0.40 | $0.025 (75% off) | 1,500 RPD (free) |
Free tier: Gemini 1.5 Flash and 2.0 Flash are free for up to 1,500 requests per day. This is the most generous free tier available.
Rate limits: Free tier has 15 RPM. Paid tier scales to 2,000 RPM.
Key advantage: Cheapest per-token cost among commercial models. Free tier is substantial enough for prototyping and low-traffic production use.
DeepSeek
DeepSeek has disrupted pricing with models that compete with GPT-4o at a fraction of the cost. DeepSeek V3 uses a Mixture-of-Experts architecture to deliver high quality with lower inference cost.
Pricing:
| Model | Input | Output | Cached Input | Context Window | |-------|-------|--------|-------------|---------------| | DeepSeek V3 | $0.27 | $1.10 | $0.07 | 128K | | DeepSeek R1 (reasoning) | $0.55 | $2.19 | $0.14 | 128K |
Key advantage: 90% cheaper than GPT-4o with competitive quality on many benchmarks. R1 reasoning model is dramatically cheaper than o1.
Key risk: Based in China; may not meet data residency requirements for some enterprises.
Open-Source Self-Hosted
Running models yourself eliminates per-token costs but introduces infrastructure costs.
Estimated self-hosting costs (GPU rental):
| Model | GPU Required | Monthly Cost (cloud) | Effective Per-1M-Token Cost | Breakeven vs API | |-------|-------------|---------------------|---------------------------|-----------------| | Llama 3.1 8B | 1x A100 40GB | $1,500/mo | ~$0.02 | ~8M tokens/mo | | Llama 3.1 70B | 4x A100 80GB | $8,000/mo | ~$0.10 | ~9M tokens/mo | | Llama 3.1 405B | 8x A100 80GB | $20,000/mo | ~$0.25 | ~6M tokens/mo | | Mistral 7B | 1x A10G | $500/mo | ~$0.01 | ~3M tokens/mo |
Self-hosting makes financial sense when you consistently process more than 5–10 million tokens per month and can handle the operational overhead.
Latency Benchmarks
Cost isn't the only factor. Response time directly impacts user experience.
| Model | Time to First Token (P50) | Tokens per Second | P95 Latency (500 tokens) | |-------|--------------------------|-------------------|--------------------------| | GPT-4o | 300ms | 80–100 | 6.5s | | GPT-4o-mini | 200ms | 120–150 | 3.8s | | Claude 3.5 Sonnet | 350ms | 70–90 | 7.2s | | Claude 3.5 Haiku | 250ms | 100–130 | 4.5s | | Gemini 1.5 Flash | 150ms | 150–200 | 2.8s | | Gemini 2.0 Flash | 180ms | 140–180 | 3.2s | | DeepSeek V3 | 400ms | 60–80 | 8s | | Llama 3.1 70B (Together) | 350ms | 70–90 | 7s |
For latency-critical applications (chatbots, autocomplete), Gemini Flash and GPT-4o-mini are the clear winners.
Real-World Cost Scenarios
Let's calculate actual monthly costs for common use cases.
Customer Support Chatbot
Assumptions: 10,000 conversations/day, average 8 turns per conversation, 500 tokens per turn (input + output combined), 60% input / 40% output split.
| Model | Monthly Cost | Quality | |-------|-------------|---------| | GPT-4o | $18,600 | Excellent | | GPT-4o-mini | $1,116 | Very good | | Claude 3.5 Sonnet | $25,200 | Excellent | | Claude 3.5 Haiku | $6,720 | Good | | Gemini 1.5 Flash | $558 | Good | | Gemini 2.0 Flash | $744 | Very good | | DeepSeek V3 | $1,860 | Good |
Recommendation: Use GPT-4o-mini or Gemini 2.0 Flash for the main flow. Route complex queries to GPT-4o or Claude 3.5 Sonnet.
RAG-Powered Internal Knowledge Base
Assumptions: 2,000 queries/day, average 3,000 tokens input (query + retrieved context), 500 tokens output.
| Model | Monthly Cost | Quality | |-------|-------------|---------| | GPT-4o | $1,950 | Excellent | | GPT-4o-mini | $117 | Very good | | Claude 3.5 Sonnet (with caching) | $315 | Excellent | | Gemini 1.5 Flash | $52 | Good |
Recommendation: Claude 3.5 Sonnet with prompt caching is the best value for RAG—the 90% cache discount dramatically reduces cost when your system prompt and tool definitions are repeated across queries.
Content Generation Pipeline
Assumptions: 500 articles/day, 2,000 tokens input (prompt + examples), 3,000 tokens output per article.
| Model | Monthly Cost | Quality | |-------|-------------|---------| | GPT-4o | $5,625 | Excellent | | Claude 3.5 Sonnet | $7,875 | Excellent | | GPT-4o-mini | $337 | Good | | DeepSeek V3 | $907 | Good |
Recommendation: GPT-4o for highest quality output. GPT-4o-mini for drafts that humans will edit.
Cost Optimization Strategies
These strategies can reduce your LLM costs by 50–90%.
1. Prompt Caching
If your system prompt, tool definitions, or few-shot examples stay the same across requests, caching avoids re-processing those tokens.
| Provider | Cache Discount | How to Enable |
|----------|---------------|--------------|
| OpenAI | 50% on cached tokens | Automatic for prompts > 1,024 tokens |
| Anthropic | 90% on cached tokens | Add cache_control to message blocks |
| Google | 75% on cached tokens | Use context caching API |
response = anthropic.messages.create(
model="claude-3-5-sonnet-20241022",
system=[{
"type": "text",
"text": long_system_prompt,
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": user_message}]
)
2. Model Routing
Use cheaper models for simple tasks and expensive models only when needed.
def route_query(query: str) -> str:
complexity = classify_complexity(query)
if complexity == "simple":
return "gpt-4o-mini"
elif complexity == "moderate":
return "gpt-4o"
else:
return "o1-mini"
3. Batch Processing
For non-real-time tasks, use batch APIs at 50% discount.
batch = client.batches.create(
input_file_id=file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
4. Response Caching
Cache LLM responses for identical or near-identical queries.
import hashlib
import redis
cache = redis.Redis()
def get_llm_response(messages, model="gpt-4o-mini"):
cache_key = hashlib.sha256(str(messages).encode()).hexdigest()
cached = cache.get(cache_key)
if cached:
return json.loads(cached)
response = openai.chat.completions.create(model=model, messages=messages)
cache.setex(cache_key, 3600, json.dumps(response.choices[0].message.content))
return response.choices[0].message.content
5. Prompt Optimization
Shorter prompts = lower costs. Techniques include:
- Remove redundant instructions
- Use structured output (JSON mode) to reduce output tokens
- Set
max_tokensto prevent unnecessarily long responses - Use abbreviations in system prompts (LLMs understand them fine)
6. Token-Aware Input Processing
Trim retrieved context in RAG systems to only include the most relevant chunks, rather than filling the entire context window.
def trim_context(chunks, max_tokens=3000):
total = 0
selected = []
for chunk in chunks:
tokens = count_tokens(chunk.page_content)
if total + tokens > max_tokens:
break
selected.append(chunk)
total += tokens
return selected
When to Use Which Model
| Use Case | Recommended Model | Why | |----------|------------------|-----| | Customer-facing chatbot | GPT-4o-mini | Best quality/cost ratio, fast | | Complex reasoning tasks | o3-mini or GPT-4o | Strongest reasoning capabilities | | Code generation | Claude 3.5 Sonnet | Best coding performance | | Long document analysis | Gemini 1.5 Pro | 2M context window, good price | | High-volume classification | Gemini 1.5 Flash | Cheapest, fast enough | | RAG with caching | Claude 3.5 Sonnet | 90% cache discount | | Budget-constrained projects | DeepSeek V3 | 90% cheaper than GPT-4o | | Data-sensitive applications | Llama 3.1 (self-hosted) | Full data control | | Real-time autocomplete | GPT-4o-mini or Gemini Flash | Lowest latency | | Batch content processing | GPT-4o (batch API) | 50% off, highest quality |
Free Tiers and Getting Started
| Provider | Free Credits/Tier | Expiration | Best For | |----------|------------------|-----------|----------| | OpenAI | $5 credit | 3 months | Evaluating GPT models | | Anthropic | Limited (claude.ai only) | Ongoing | Testing Claude capabilities | | Google | 1,500 RPD free (Flash) | Ongoing | Prototyping, low-traffic apps | | Together AI | $5 credit | 30 days | Testing open-source models | | Groq | Free tier (rate-limited) | Ongoing | Ultra-fast inference testing | | DeepSeek | $5 credit | None | Budget-conscious evaluation |
Calculating Your Costs
The right model depends on your specific usage patterns. Use our LLM Cost Calculator to model costs across providers based on your actual query volume, token usage, and required features.
Key inputs to estimate:
- Average queries per day
- Average input tokens per query
- Average output tokens per query
- Percentage of queries that can use a cheaper model
- Percentage that benefit from caching
Building Cost-Efficient LLM Applications
Selecting the cheapest model is easy. Building a system that optimizes cost across the entire stack—model routing, caching, prompt engineering, batch processing—requires experience.
If you need help integrating LLMs into your product or optimizing an existing deployment, ZTABS offers GPT integration services and LLM fine-tuning to help you get the best results at the lowest cost.
The LLM pricing landscape changes every quarter. Bookmark this page—we update it as providers adjust their pricing.
Need Help Building Your Project?
From web apps and mobile apps to AI solutions and SaaS platforms — we ship production software for 300+ clients.
Related Articles
AI Agent Orchestration: How to Coordinate Agents in Production
AI agent orchestration is how you coordinate multiple agents, tools, and workflows into reliable production systems. This guide covers orchestration patterns, frameworks, state management, error handling, and the protocols (MCP, A2A) that make it work.
10 min readAI Agent Testing and Evaluation: How to Measure Quality Before and After Launch
You cannot ship an AI agent to production without a testing strategy. This guide covers evaluation datasets, accuracy metrics, regression testing, production monitoring, and the tools and frameworks for testing AI agents systematically.
10 min readAI Agents for Accounting & Finance: Bookkeeping, AP/AR, and Reporting
AI agents automate accounting tasks — invoice processing, expense management, reconciliation, and financial reporting — reducing manual work by 60–80% while improving accuracy. This guide covers use cases, ROI, compliance, and implementation.