AI Cost Optimization at Scale: How We Cut LLM Bills 60% Without Quality Loss
TL;DR: Running 10 in-house AI products and 100+ client AI deployments, we have a playbook for cutting LLM bills without losing quality. Model routing, prompt caching, output minimization, structured outputs, and the cost gotchas teams find at $20K-$200K/month.
Running 10 in-house AI products and 100+ client AI deployments, we have a working playbook for cutting LLM bills without losing quality. ZTABS clients in the $20K-$200K/month LLM-spend range typically see 40-70% reductions when we apply the levers below systematically. This is the operator's guide.
TL;DR — the 6 levers ranked by impact
- Output token discipline (highest impact) — force structured outputs, cap
max_tokens, eliminate "explain your reasoning" when not needed - Prompt caching — Anthropic and OpenAI both offer prompt caching; Anthropic's cached input reads price at ~10% of normal input (a ~90% discount on the cached portion)
- Model routing — cheap model first; escalate only when needed
- Context trimming — every token you load into context is paid for; aggressively trim irrelevant context
- Batch API where latency allows — Anthropic and OpenAI Batch APIs are both 50% off list price (within a 24h SLA)
- Self-host (last resort) — break-even ~$35K-$55K/month, then it's a real operational decision
| Lever | Typical savings | Engineering effort | Best for |
|---|---|---|---|
| Output token discipline | 30-50% | Low (config + prompt tweak) | Verbose-by-default workloads |
| Prompt caching | 20-50% on RAG | Low-medium | Stable knowledge base + repeat queries |
| Model routing | 30-60% | Medium (eval infra + router) | Mixed-complexity workloads |
| Context trimming | 15-30% | Medium (retrieval tuning) | Long-context RAG |
| Batch API | 50% on those calls | Low (where applicable) | Async / non-interactive workloads |
| Self-hosted | 50-80% net | High | $50K+/month + ops capacity |
We deploy levers 1-4 on every production engagement. Lever 5 (Batch) and 6 (self-host) are situation-dependent.
Lever 1 — Output token discipline
Output tokens cost 3-5x input tokens at frontier vendors. This makes output-minimization the single biggest lever in most workloads.
Three sub-tactics:
(a) Force structured output. Use the vendor's JSON-mode or structured-output feature. Eliminates verbose prose responses where you only needed three fields.
Before: "The customer's intent appears to be requesting a refund. Their order is from 3 weeks ago, which is within the refund window. The product was marked as faulty. I would recommend processing the refund and offering a 10% discount on their next order to retain the customer."
After: {"intent": "refund", "eligible": true, "recommended_action": "process_refund", "retention_offer": "10pct_discount"}
Token count drops 5-10x. Quality stays equal or improves.
(b) Set explicit max_tokens caps. Every call. The model will pad if you let it. We default to max_tokens: 200 for classification, 400 for extraction, 800 for short generation, 2000+ only when long-form output is genuinely required.
(c) Strip "explain your reasoning" unless you need it. Many prompts include "explain step by step" by default — copied from prompt-engineering tutorials. If you're not using the reasoning trace (you're just consuming the final answer), strip it. Save 2-5x on output tokens.
Output discipline alone has cut bills 30-50% in deployments we've audited.
Lever 2 — Prompt caching
Anthropic and OpenAI both offer prompt caching: cache the long prefix of a prompt (e.g., system instructions + RAG context) and pay roughly 10% of the normal input-token price on cached portions (a ~90% discount on the cached prefix). Latency also drops 2-5x. Note Anthropic charges a small write premium (1.25x base input price for a 5-minute cache, 2x for a 1-hour cache), so caching pays off after just a handful of reads.
Where it pays off most:
- RAG with a stable knowledge-base context (the docs don't change; the user query does)
- Multi-turn agentic loops (system prompt is stable across turns)
- Batch processing with shared instructions
- A/B testing prompt variants where most of the prompt is the same
Where it doesn't help:
- One-shot queries where the prefix changes per call
- Workloads where calls are spaced beyond the cache TTL (Anthropic offers 5-minute and 1-hour cache windows; longer gaps lose the discount)
- Workloads where the prefix is shorter than the cache minimum (~1024 tokens)
Implementation pattern:
[CACHED PREFIX]
System instructions (5K tokens)
Knowledge-base context (15K tokens)
Few-shot examples (3K tokens)
[CACHED PREFIX END]
[UNCACHED]
User query: "What's the refund policy for digital goods?"
The first call pays a small premium to write the 23K-token prefix to cache. Subsequent calls inside the cache TTL pay ~10% on the prefix and full price only on the query.
On a 1M-query/month RAG workload with 80% cache hit rate, savings land at 35-50%.
Lever 3 — Model routing
The classic two-tier pattern: cheap model handles the easy 80%; strong model handles the hard 20%.
Pattern A — Confidence-based routing:
- Send query to cheap model (Claude Haiku 4.5, GPT-5.4 mini, Gemini 3.5 Flash)
- Model returns answer + self-reported confidence
- If confidence > threshold → return cheap model's answer
- Else → escalate to strong model (Claude Sonnet 4.6, GPT-5.4/5.5, Gemini 3.1 Pro)
- Return strong model's answer
Pattern B — Pre-classification routing:
- Cheap model classifies query into "easy/hard/expert"
- Easy → cheap model handles directly
- Hard → strong model handles
- Expert → strong model + tool calls + human review
Pattern C — A/B-tested static routing:
- Run evals on representative query set
- Identify which query patterns the cheap model handles reliably
- Route deterministically by query pattern (e.g., "summarize" → cheap; "explain why" → strong)
- Periodically re-evaluate
Pattern A is most common. Pattern C produces cleaner billing math but more brittle to query distribution shift.
Common gotcha: model self-reported confidence is unreliable. Don't trust the model's confidence score — use an external eval or judge model to score outputs and gate the route. See our agent testing + observability guide for the eval infrastructure.
Lever 4 — Context trimming
Every token in your prompt is paid for. Most production RAG deployments load 3-5x more context than they actually need.
Tactics:
- Re-rank, then chunk-prune. Retrieve top 50 chunks; re-rank with a smaller model; keep only top 5-10 that actually answer the query.
- Use embedding-based deduplication. Many retrieved chunks contain near-duplicate information. Dedupe before adding to context.
- Strip headers/footers/boilerplate from documents during indexing, not at query time. Every doc indexed clean saves on every future retrieval.
- Use compressed summaries for long historical context. A 50K-token conversation history doesn't need to be loaded verbatim; load a 2K-token summary instead with the most recent 5K of full context.
Context trimming wins 15-30% in RAG-heavy workloads.
Lever 5 — Batch API
Anthropic and OpenAI both offer Batch APIs at 50% off list price on both input and output tokens. The trade-off: results may take up to 24 hours, no streaming, no real-time. (Anthropic's Batch discount also stacks with prompt caching.)
Where it works:
- Overnight processing pipelines (embedding generation, document analysis, summary refresh)
- Backfilling AI features over historical data
- Async classification of incoming data (emails, support tickets) that don't need instant response
- Large-scale eval runs (offline benchmarking)
Where it doesn't:
- User-facing real-time interactions
- Anything customer-waiting
For workloads with significant async batch component, this is free money.
Lever 6 — Self-host (last resort)
When closed-source API spend exceeds ~$35K-$55K/month, self-hosted Llama 4, Mistral Medium 3.5, DeepSeek V4, or Qwen 3.5 can be cheaper net — but the operational burden is real.
Total cost includes:
- GPU infrastructure ($8K-$25K/month for 8xH100 inference cluster)
- DevOps engineer time (1-2 FTE for serious production)
- Model evaluation overhead (you own quality regressions when you swap model versions)
- On-call burden (cloud APIs have their own SREs; you're now an SRE)
Where it makes sense:
- HIPAA / Schrems II / GDPR workloads where data can't transit closed-source APIs
- $50K+/month spend where the GPU break-even works
- Workloads where you've invested in proprietary fine-tuning (the cost of hosting includes serving the fine-tune)
- Companies with existing GPU infrastructure capacity (already operating ML pipelines)
Where it doesn't:
- Sub-$50K/month spend
- Teams without on-call capacity
- Workloads with bursty volume that overwhelms fixed GPU capacity
See our self-hosted LLM guide for the full architecture.
The observability layer — instrument before you optimize
Don't optimize blind. Before applying any lever, instrument:
- Per-request cost tagged by feature, tenant, user, model
- Token counts for input + output split
- Latency per call + per feature
- Quality scores from your eval setup (without quality measurement you can't know if optimization broke something)
Tools: Langfuse, Braintrust, Helicone, or custom OpenTelemetry. See our agent testing + observability guide for picking between them.
Common discovery pattern: teams come to us assuming Feature X is their expensive workload. Instrument the bill and discover Feature Y (a background classifier nobody thought about) is 60% of cost. Optimization moves where the money actually goes, not where you assume it goes.
Real-world before/after we've shipped
Three anonymized engagements:
Engagement 1 — SaaS customer support AI ($45K/mo → $18K/mo, 60% reduction)
- Output discipline: removed verbose prose responses, forced JSON for routing decisions (-35%)
- Prompt caching on the knowledge-base context (-20%)
- Model routing: Haiku for classification, Sonnet only for response generation (-15% net after edge-case escalations)
Engagement 2 — Document processing pipeline ($120K/mo → $35K/mo, 70% reduction)
- Switched the 80%-volume classification step from Sonnet to a fine-tuned smaller model (Haiku-class, -50%)
- Batched the async document analysis through Batch API (-25% on those calls)
- Context trimming reduced average input tokens 40% (-15%)
Engagement 3 — AI agent workload ($28K/mo → $12K/mo, 57% reduction)
- Hard
max_tokenscaps on every tool-call response (-30%) - Removed redundant "explain your reasoning" steps in agentic loops (-15%)
- Prompt caching on system prompt + tool definitions (~12K tokens cached, -25%)
Patterns differ; the cumulative impact is consistent.
What ZTABS builds for cost optimization
We ship LLM cost-optimization engagements typically:
- AI cost audit + optimization roadmap — 2 weeks, includes per-feature cost attribution, lever-by-lever savings estimate, recommended priority
- Model routing layer + eval infra — 4-8 weeks, includes router service, eval framework, monitoring dashboard
- Prompt caching + context optimization — 2-4 weeks, includes prompt audit, RAG retrieval tuning, cache strategy
- Self-hosted migration for clients above break-even — 8-16 weeks, includes infra setup, model selection + evaluation, migration plan, ongoing ops handoff
Reach out via /services/ai-consulting or /contact.
Related reading
- Claude vs GPT vs Gemini 2026 — picking the model for routing
- AI agent testing and evaluation — eval infra for cost-aware routing
- Self-hosted LLM guide — when to bring inference in-house
- AI agent development cost — broader cost model for AI agent builds
- LLM Cost Calculator — compare per-call cost across 11 models
- AI Agent ROI Calculator — break-even math for replacing tasks with agents
LLM vendor pricing, cache discount rates, and model price tiers change quarterly. All specific numbers tagged for editorial fact-check before publish.
Frequently Asked Questions
How much can I really cut my LLM bill in 2026?
Typical ranges we've seen across deployments: meaningful cost reduction (often 40-70%) without measurable quality loss is common, and deeper cuts are possible with aggressive model routing + caching at the cost of some tail-case quality. The lever availability depends on workload shape — RAG-heavy workloads gain most from caching; conversational workloads gain most from model routing; structured-output workloads gain most from output minimization.
What's the single biggest cost lever?
For most production deployments: output tokens. Cloud frontier LLMs charge 3-5x more for output than input. A prompt asking the model to "explain your reasoning step by step" or "provide detailed examples" can 10x the cost vs "respond in JSON only." Force structured outputs, set hard max_tokens limits, and audit what your model is actually outputting before optimizing elsewhere.
When does prompt caching pay off?
When the same prompt prefix is reused multiple times within ~5 minutes. RAG with a stable knowledge-base context, multi-turn agentic loops, batch processing with shared instructions — these are the wins. Anthropic and OpenAI both offer prompt caching with ~90% discount on cached portions. Latency drops 2-5x in addition to the price drop.
Is model routing worth the engineering effort?
Yes above ~$10K/month of LLM spend. Below that, the engineering overhead doesn't pay back. The pattern that works: cheap model handles 80%+ of queries; an eval scores confidence; low-confidence cases escalate to a stronger model. Net effect at $50K/month spend is typically 40-60% savings.
What about open-source / self-hosted to cut costs?
Break-even on self-hosted Llama 4, Mistral Medium 3.5, or DeepSeek V4 lands around $35K-$55K/month of closed-source API spend. Below that, closed-source APIs are cheaper once you factor engineering, infrastructure ops, and reliability burden. Above that, self-hosting math works — but the operational cost (on-call, model evaluation, infrastructure) is real.
How do I measure AI costs per feature in production?
Tag every LLM call with feature_id, user_id, tenant_id, and model_id at the SDK / proxy layer. Pipe to your observability stack (Langfuse, Braintrust, Helicone, or custom). Aggregate monthly per dimension. Without this instrumentation you're flying blind — most teams discover their "expensive" feature is actually 5% of cost and the cheap-looking infra feature is 60%.
Explore Related Solutions
Need Help Building Your Project?
From web apps and mobile apps to AI solutions and SaaS platforms — we ship production software for 300+ clients.
Related Articles
AI Browser Automation in 2026: ChatGPT Agent, Computer Use, and What Actually Ships
AI browser automation matured in 2024-2026. OpenAI's ChatGPT agent (and its CUA model), Anthropic Computer Use, browser-use, and Playwright MCP all ship. Here's what works in production, what breaks, and how to pick between them — from a team that's shipped agentic browser automation for clients in retail, travel, and ops automation.
10 min readBlockchain Development in 2026: What's Actually Worth Building
After two cycles of hype-and-bust, blockchain in 2026 has a small set of use cases that actually work in production — and a long list that still don't. This is the honest engineer's guide to what's worth building, what's not, and which stack to pick if you must.
13 min readClaude vs GPT vs Gemini in 2026: A Production Engineer's Frontier-Model Comparison
We ship AI in production across 10 in-house SaaS products and 100+ client projects. This is the frontier-model comparison we actually use to pick between the Claude 4.x, GPT-5.x, and Gemini 3.x families — pricing, real context limits, rate-limit behavior, and the failure modes nobody talks about.