How to cut your LLM API bill in half without dropping quality

When your LLM bill becomes a real line item, you have to optimize. The default "send everything to GPT-5" approach is fine at low volume and expensive at scale. The good news: there are 7 stackable optimizations, each cutting cost by 20-60%, and most don't compromise quality.

1. Prompt caching

The single biggest win. Anthropic, OpenAI, and Google all support cache for repeated prompt content (system prompts, RAG context, few-shot examples). Cache hits cost 10% of normal input tokens.

For a typical RAG application where the system prompt is 1000 tokens and stays the same:

Without caching: pay full input rate every query
With caching: pay 10% of that rate after the first cache write
Savings: 80-90% on input cost for repeated prefixes

Implementation: mark the cacheable prefix in your API call. Both Claude and OpenAI support this with a parameter or breakpoint. Cache TTL is usually 5 minutes (renewed each hit), so works well for chat sessions.

2. Model routing

Not every query needs frontier intelligence. A simple classifier (often a smaller model) routes:

60% of queries to a cheap model (Haiku 4.5, GPT-5 Mini, DeepSeek)
30% to mid-tier (Sonnet 4.5)
10% to frontier (Claude Opus 4.7, GPT-5 Pro, o3)

Real example: customer support agent. Most questions are routine FAQ-style ($0.001 cost on Haiku). Complex multi-step issues escalate to Sonnet. Truly novel cases use Opus.

Result: 5-10× cost reduction with imperceptible quality drop.

Implementation: a tiny classifier prompt to a small model: "Is this query simple or complex?" Use the answer to pick the model.

3. Output token capping

Most APIs charge 3-5× more for output than input. Long unnecessary completions are pure waste.

Set max_tokens aggressively. If your output is bounded (200 words for a summary), cap at 400 tokens.
For tool-use loops, cap each turn so the model doesn't ramble.
Use stop sequences to end output at natural boundaries.

A 50% reduction in average output length can mean 30-40% bill reduction since output dominates cost.

4. Switch from chat to completion-style

For structured tasks (classification, extraction, summarization), don't use the full chat API. Some APIs offer cheaper "completion" endpoints or batch APIs that are 50% the price of synchronous chat.

OpenAI Batch API: 50% off, 24-hour SLA. Great for non-realtime workloads. Anthropic Batch API: 50% off, similar tradeoff. Provider-specific deals: check your specific provider for batch pricing.

5. Smaller open-source for commodity tasks

For tasks where quality differences are imperceptible (classification, extraction, simple summarization), self-hosted Llama 3.1 70B, Qwen 2.5, or DeepSeek can replace frontier API calls.

Real numbers from production teams:

5M tokens/day on Claude Sonnet: ~$1500/month
Same volume on Llama 3.1 70B self-hosted: ~$300/month after GPU + ops costs
Quality drop on simple tasks: minimal
Quality drop on complex reasoning: substantial — keep frontier for that

6. Reduce context aggressively

Trim conversation history. Most chats don't need all 20 prior turns.
Summarize old context. Replace 10 old turns with one summary turn.
Don't include retrieved chunks if they're not actually relevant. Use a reranker to pick top-3 instead of top-10.
Audit system prompts. Most are 2-3× longer than they need to be.

A 30% context reduction translates roughly to a 30% input cost reduction.

7. Batch and async where possible

Many operations can be batched:

Embed 100 documents in one API call instead of 100 calls (lower per-call overhead)
Process overnight reports in async batch APIs
Pre-compute responses for common queries

For real-time UX, batching adds latency. For background processing, batch APIs are 50% off across most providers.

Compounding the savings

Applied together, the optimizations multiply rather than add:

Prompt caching: 50% off input cost (assumes 80% cache hit rate)
Model routing: 70% off blended cost (most queries on cheap model)
Output capping: 30% off output cost
Context trimming: 30% off remaining input cost
Batch APIs (where applicable): 50% off batch portion

A team that applies all of these can see total bills drop by 70-80% with no perceived quality decrease.

What NOT to do

Don't use the cheapest model for everything. False economy when quality drops cause customer complaints, more support tickets, more refunds.

Don't disable streaming to "save money." Streaming has no cost difference; it's UX and possibly time-to-first-token.

Don't switch providers daily based on pricing. The engineering cost of integration changes outweighs the price differences for most teams.

Don't add complexity before measuring. If your bill is $200/month, optimizing isn't worth engineering time. Optimize when it's $2000+/month.

Measuring before optimizing

Before optimizing:

Log every API call with: model, input tokens, output tokens, cost
Aggregate by feature / endpoint to find what's expensive
Identify the top 20% of calls driving 80% of cost
Optimize those, leave the rest

Without measurement, you'll optimize the wrong things. Set up basic logging in week 1; optimize from week 2.

Tools that help

Helicone, Langfuse, Portkey — observability platforms with cost dashboards
OpenRouter — multi-provider router with built-in fallback and pricing
Martian, Anyscale — automated model routing services
PostHog + custom — track LLM costs per feature

For early-stage teams, custom logging is sufficient. For production at scale, use a dedicated platform.

When NOT to obsess

Total bill under $500/month: focus on product, not optimization
Pre-product-market-fit: cost optimization can wait
Burst traffic that's still small: amortized cost is what matters, not peak

Optimize when cost actually matters. Until then, ship faster.

Decision tree

Bill < $500/month: don't optimize, focus on product
Bill $500-5k/month: enable prompt caching + output capping + model routing
Bill $5k-50k/month: all 7 optimizations + observability platform
Bill > $50k/month: dedicated infrastructure team focus, possibly self-host

Next steps

Set up cost logging by feature this week
Enable prompt caching tomorrow (5-minute change for big savings)
Add a simple model router for your top use case
Read about specific provider batch APIs for non-realtime workloads