If you're using a frontier model (Claude 4.7 Opus, GPT-5) for every single user query, you're overspending by 5-10×. The vast majority of queries — "what's the weather," "summarize this email," "thanks" — are trivial. They run perfectly well on Haiku, GPT-5 mini, or Gemini Flash at 1/10th the cost.
LLM routing is the discipline of picking the right model per query. Done well, it cuts your bill 60-80% with imperceptible quality loss. Done badly, it routes hard questions to weak models and your users hate the product.
This post is the practical playbook.
The cost ladder
In 2026, frontier API pricing roughly tiers like this:
- Frontier reasoning (GPT-5, Claude 4.7 Opus): $15-25/M input, $75-90/M output.
- Frontier general (Claude Sonnet 4.7, Gemini 3 Pro): $3-5/M input, $15-20/M output.
- Mid-tier (Claude Haiku 4.7, GPT-5 mini, Gemini Flash 3): $0.25-1/M input, $1-5/M output.
- Tiny (Llama 3.2 3B self-hosted, Phi-5, Qwen 2.5 7B): essentially $0/M when self-hosted at scale.
The gap from mid-tier to frontier is ~10×. The gap from tiny to frontier is ~50-100×.
If you can identify which queries go in which tier, the savings are massive.
The taxonomy of routing strategies
Four mainstream approaches in 2026, in increasing complexity:
1. Manual rules (the boring solution)
Write if-statements. Different workflows go to different models:
- Customer support FAQ → Haiku.
- Hard reasoning / coding tasks → Opus.
- Document summarization → Sonnet.
- Spam / language detection → tiny self-hosted classifier.
This covers 70% of routing wins for 5% of the engineering effort. Don't skip this in pursuit of fancier methods.
2. Cascade (escalation)
Try the cheap model first. If it admits low confidence or fails a quality check, escalate.
def answer(query):
cheap_response = haiku.answer(query)
if cheap_response.confidence > 0.8 and passes_check(cheap_response):
return cheap_response
return opus.answer(query)
The "check" can be:
- A quick LLM-as-judge call from another cheap model.
- A heuristic ("did the response actually contain a code block?").
- A user thumbs-down signal that retroactively triggers Opus.
Works well when most queries succeed at the cheap tier. Cost overhead: when you escalate, you pay for both calls.
3. Classifier-based routing
Train (or prompt) a small model to classify each query into a difficulty tier. Route to the matching model.
class = classifier.classify(query) # "easy" / "medium" / "hard"
model = MODEL_BY_CLASS[class]
return model.answer(query)
The classifier is typically:
- A small LLM with a few-shot prompt.
- A fine-tuned BERT-style classifier.
- An embedding-based nearest neighbor on labeled examples.
Faster than cascade (no double-call) but requires you to label difficulty examples up front.
4. Learned routing (RouteLLM, LLM-Blender, etc)
Frameworks like RouteLLM (LMSYS, 2024) train a router model on Chatbot Arena pairwise data. Given a query, it predicts the probability that a strong model would beat a weak model. Route based on that probability.
More accurate than classifier-based, but requires:
- Either using a pre-trained router (works for general chat) or...
- Training your own on your domain's data (expensive, only worth it at scale).
Serious teams in 2026 with high-volume traffic use this. Most teams should not yet.
A concrete starter implementation
In 2026, here's the simplest router that beats most pretty-engineered solutions:
def route(query: str, conversation_history: list) -> str:
# Hard cases: always use the strong model
if any(kw in query.lower() for kw in ["code", "debug", "why doesn't", "prove"]):
return "opus"
if len(query) > 1000:
return "opus" # Likely complex/document-heavy
# Cheap cases: simple chitchat or one-line ask
if len(query) < 50 and not any(c in query for c in "?:;"):
return "haiku"
# Default middle
return "sonnet"
This gives you 60% cost savings with the quality of Sonnet+ on hard cases. It's also five lines and completely understandable.
Iterate on it by logging which queries got routed to which model and which generated user thumbs-down. Adjust rules.
The trap: under-routing on hard cases
The biggest failure mode of routers is mis-classifying a hard case as easy. The cheap model gives a confidently wrong answer; the user receives garbage.
Mitigations:
- Bias toward escalation. When in doubt, use the stronger model. The 10× cost on the rare hard query is much cheaper than user churn.
- Add escape valves. Let the cheap model say "I'm not sure, can you elaborate?" instead of bullshitting. System prompt: "If you cannot answer with high confidence, respond with [ESCALATE]."
- A/B test with quality scoring. Send 5% of routed traffic to both models, compare. If quality drops > 3%, your router is too aggressive.
- Use stronger models for first-time users. New users have zero patience for bad answers; bias their queries toward Opus until they've done 10+ messages.
Two patterns that work in 2026
Pattern A: Multi-tier with escalation
User query →
Tiny classifier (Phi-5, 3B) decides easy/medium/hard.
Easy → Haiku. Medium → Sonnet. Hard → Opus.
After: simple LLM judge checks output quality.
If judge says "low quality," silently re-run with the next tier up.
Good balance of cost and reliability. ~70% of queries land at Haiku/Sonnet, ~30% at Opus. Total cost ~25% of pure-Opus baseline.
Pattern B: Specialized models per task type
User query →
Intent classifier categorizes into: chitchat, code, search, math, image-gen, etc.
Each intent has a hand-picked best-cost model:
chitchat → Haiku
code → Sonnet
math (hard) → o3 / Opus reasoning
image-gen → Flux / Imagen, not LLM
Useful when your product is structurally multi-modal — chat assistant + code assistant + image generator in one. Less useful for pure text Q&A.
Tools and frameworks
You don't have to build this from scratch in 2026. Options:
- OpenRouter. Hosted multi-provider gateway with built-in routing. Single API, dynamic provider selection by price/availability. Best for getting started.
- Portkey / TrueFoundry. Enterprise gateways with policy-based routing and observability.
- RouteLLM (open source). Pre-trained routers from LMSYS. Free.
- Langchain Router / LlamaIndex Router. Framework-level routing components. Good if already on those stacks.
- Custom proxy. Many teams write a 200-line Python proxy that does manual rules. Works.
When NOT to route
- Low traffic (<1000 req/day). The complexity isn't worth the savings.
- Premium product where every bad answer is a churn risk. Stick with Opus until you have data showing safe downgrade paths.
- You haven't measured quality. Without an eval set you'll never know if your router is hurting users; routing decisions become superstition.
What to measure
For each model in your router:
- Routed traffic share. What % of queries went to each model?
- Cost per query (and total).
- Quality score (LLM-as-judge or human review on a sample).
- Latency. Smaller models are usually faster; verify it's the right tradeoff.
- Escalation rate. How often does the cheap model fail and we re-run on expensive?
If escalation rate > 30%, your cheap tier is too weak or your classifier is too eager.
Further reading
- RouteLLM: Learning to Route LLMs with Preference Data (LMSYS, 2024).
- LLM-Blender (Jiang et al, 2023) — pairwise routing approach.
- OpenRouter documentation.
- Look up: cascade inference, model gateway, prompt classification, LLM-as-judge.