When your LLM bill becomes a real line item, you have to optimize. The default "send everything to GPT-5" approach is fine at low volume and expensive at scale. The good news: there are 7 stackable optimizations, each cutting cost by 20-60%, and most don't compromise quality.
1. Prompt caching
The single biggest win. Anthropic, OpenAI, and Google all support cache for repeated prompt content (system prompts, RAG context, few-shot examples). Cache hits cost 10% of normal input tokens.
For a typical RAG application where the system prompt is 1000 tokens and stays the same:
- Without caching: pay full input rate every query
- With caching: pay 10% of that rate after the first cache write
- Savings: 80-90% on input cost for repeated prefixes
Implementation: mark the cacheable prefix in your API call. Both Claude and OpenAI support this with a parameter or breakpoint. Cache TTL is usually 5 minutes (renewed each hit), so works well for chat sessions.
2. Model routing
Not every query needs frontier intelligence. A simple classifier (often a smaller model) routes:
- 60% of queries to a cheap model (Haiku 4.5, GPT-5 Mini, DeepSeek)
- 30% to mid-tier (Sonnet 4.5)
- 10% to frontier (Claude Opus 4.7, GPT-5 Pro, o3)
Real example: customer support agent. Most questions are routine FAQ-style ($0.001 cost on Haiku). Complex multi-step issues escalate to Sonnet. Truly novel cases use Opus.
Result: 5-10× cost reduction with imperceptible quality drop.
Implementation: a tiny classifier prompt to a small model: "Is this query simple or complex?" Use the answer to pick the model.
3. Output token capping
Most APIs charge 3-5× more for output than input. Long unnecessary completions are pure waste.
- Set
max_tokensaggressively. If your output is bounded (200 words for a summary), cap at 400 tokens. - For tool-use loops, cap each turn so the model doesn't ramble.
- Use stop sequences to end output at natural boundaries.
A 50% reduction in average output length can mean 30-40% bill reduction since output dominates cost.
4. Switch from chat to completion-style
For structured tasks (classification, extraction, summarization), don't use the full chat API. Some APIs offer cheaper "completion" endpoints or batch APIs that are 50% the price of synchronous chat.
OpenAI Batch API: 50% off, 24-hour SLA. Great for non-realtime workloads. Anthropic Batch API: 50% off, similar tradeoff. Provider-specific deals: check your specific provider for batch pricing.
5. Smaller open-source for commodity tasks
For tasks where quality differences are imperceptible (classification, extraction, simple summarization), self-hosted Llama 3.1 70B, Qwen 2.5, or DeepSeek can replace frontier API calls.
Real numbers from production teams:
- 5M tokens/day on Claude Sonnet: ~$1500/month
- Same volume on Llama 3.1 70B self-hosted: ~$300/month after GPU + ops costs
- Quality drop on simple tasks: minimal
- Quality drop on complex reasoning: substantial — keep frontier for that
6. Reduce context aggressively
- Trim conversation history. Most chats don't need all 20 prior turns.
- Summarize old context. Replace 10 old turns with one summary turn.
- Don't include retrieved chunks if they're not actually relevant. Use a reranker to pick top-3 instead of top-10.
- Audit system prompts. Most are 2-3× longer than they need to be.
A 30% context reduction translates roughly to a 30% input cost reduction.
7. Batch and async where possible
Many operations can be batched:
- Embed 100 documents in one API call instead of 100 calls (lower per-call overhead)
- Process overnight reports in async batch APIs
- Pre-compute responses for common queries
For real-time UX, batching adds latency. For background processing, batch APIs are 50% off across most providers.
Compounding the savings
Applied together, the optimizations multiply rather than add:
- Prompt caching: 50% off input cost (assumes 80% cache hit rate)
- Model routing: 70% off blended cost (most queries on cheap model)
- Output capping: 30% off output cost
- Context trimming: 30% off remaining input cost
- Batch APIs (where applicable): 50% off batch portion
A team that applies all of these can see total bills drop by 70-80% with no perceived quality decrease.
What NOT to do
Don't use the cheapest model for everything. False economy when quality drops cause customer complaints, more support tickets, more refunds.
Don't disable streaming to "save money." Streaming has no cost difference; it's UX and possibly time-to-first-token.
Don't switch providers daily based on pricing. The engineering cost of integration changes outweighs the price differences for most teams.
Don't add complexity before measuring. If your bill is $200/month, optimizing isn't worth engineering time. Optimize when it's $2000+/month.
Measuring before optimizing
Before optimizing:
- Log every API call with: model, input tokens, output tokens, cost
- Aggregate by feature / endpoint to find what's expensive
- Identify the top 20% of calls driving 80% of cost
- Optimize those, leave the rest
Without measurement, you'll optimize the wrong things. Set up basic logging in week 1; optimize from week 2.
Tools that help
- Helicone, Langfuse, Portkey — observability platforms with cost dashboards
- OpenRouter — multi-provider router with built-in fallback and pricing
- Martian, Anyscale — automated model routing services
- PostHog + custom — track LLM costs per feature
For early-stage teams, custom logging is sufficient. For production at scale, use a dedicated platform.
When NOT to obsess
- Total bill under $500/month: focus on product, not optimization
- Pre-product-market-fit: cost optimization can wait
- Burst traffic that's still small: amortized cost is what matters, not peak
Optimize when cost actually matters. Until then, ship faster.
Decision tree
- Bill < $500/month: don't optimize, focus on product
- Bill $500-5k/month: enable prompt caching + output capping + model routing
- Bill $5k-50k/month: all 7 optimizations + observability platform
- Bill > $50k/month: dedicated infrastructure team focus, possibly self-host
Next steps
- Set up cost logging by feature this week
- Enable prompt caching tomorrow (5-minute change for big savings)
- Add a simple model router for your top use case
- Read about specific provider batch APIs for non-realtime workloads