There is a chance your product sends every request to the same model. A buyer asks, "What are your opening hours?" and that goes to Claude Opus.
A customer asks you to analyse a 50-page contract and summarise the key risks, and again, that also goes to Claude Opus. One of those tasks requires a world-class reasoning model. The other could be handled by something 60 times cheaper.
The price gap between model tiers is enormous. Premium models, such as Claude Opus or GPT-4, cost $30-60 per million input tokens.
Mid-tier models, such as Claude Sonnet or GPT-4 Turbo, cost $10-15. Lightweight models like Claude Haiku or GPT-4o-mini cost under $2. The difference is 30-60x.
LLM routing, also known as model routing or smart model selection, sends each request to the smallest and cheapest model that can handle it.
Instead of relying on a model for everything, a routing layer analyses each prompt, estimates its complexity, and sends it to the appropriate tier.
Simple tasks go to cheap, fast models.
Complex tasks go to expensive, capable ones.
The result is that organisations that implement routing report 30-70% cost reductions while maintaining the same quality scores.
Let’s dig in!
How a Router Works
An LLM router sits between your application and the models it calls. When a request comes in, the router decides which model should see it. That decision has three parts.
1. Classification.
The router determines what type of task this is. Is it summarisation? Code generation? Simple Q&A? Data extraction? Creative writing?
A small, fast model, such as GPT-4o-mini or Claude Haiku, runs a quick classification pass. This classification costs almost nothing. That's roughly $0.0001 per request.
2. Complexity scoring.
Within each task type, the router estimates the hardness of the specific request. A two-sentence summary request is different from summarising a 50-page legal document.
The router scores complexity based on factors such as token count, question structure, and the total reasoning the task requires.
3. Model selection.
Based on the task type and complexity score, the router selects a model. A simple lookup ("What is the capital of France?") goes to Haiku.
A multi-step analysis goes to Opus. Everything in between goes to Sonnet. The entire routing decision adds negligible latency, typically under 50 milliseconds.

In many cases, routing actually reduces total latency because cheaper models respond faster than premium ones. Think of it like a hospital triage desk.
Not everybody needs to see the head surgeon.
The triage nurse evaluates each case and sends a person with a sprained ankle to a nurse practitioner and a person with chest pain to a cardiologist.
The hospital runs more efficiently, patients with simple problems get seen faster, and the most expensive specialists focus on those who actually need them.
Five Routing Strategies
There is no single way to route.
Teams choose based on cost, quality, reliability, or speed. Intent-based routing classifies every request by task type and routes to specialised models.
A coding question gets directed to a model skilled in coding.
A creative writing task hits one strong at prose.
However, this requires knowing what different models can do better, and which can also produce the best quality for every dollar.
Cascading routing begins with the cheapest model. The lightweight model always gets the request first and escalates only when needed.
If a quality check flags the response as insufficient, too short, low confidence, or poorly structured, it retries with a mid-tier model. Even then, if it is not good enough, escalate to premium. This approach minimises cost but adds latency for escalated requests.

Cost-aware routing optimises the cost-quality tradeoff dynamically.
The router maintains a quality threshold. Say 90% of premium model quality, and select the cheapest model that consistently meets it.
As models improve or pricing changes, the routing adjusts automatically.
Semantic routing relies on embeddings to match requests to the right model based on meaning. The router computes an embedding for the incoming prompt and compares it to clusters of known prompt types.
Similar prompts that previously performed well on a cheaper model get routed there again. Load-balanced routing distributes requests across providers for reliability and throughput.
If OpenAI's API is slow or returning errors, the router redirects to Anthropic or Google. It is less about cost and more about uptime.
Most production systems combine two or three of these.
A common pattern is intent-based routing as the primary strategy. Cascading as a fallback, and load balancing across providers for reliability.

The Fallback Pattern
One routing pattern deserves special attention because it solves a problem every PM worries about: what happens when the cheap model gets it wrong?
The answer is fallback chains. The router sends the request to a tier-1 model.
A quality check evaluates the response.
It can be as simple as checking whether the response is above a minimum length, or as sophisticated as using another model to score it.
If the check passes, the response is returned. If it fails, it is automatically retried with a tier-2 model. If that also fails, it escalates to tier-3.
The key discipline is tracking your fallback rate.
If 40% of requests escalate from tier-1 to tier-2, your routing rules need adjustment. You are paying for two inferences rather than one.
A well-tuned system keeps escalation rates below 10-15%.
When Routing Is Not Worth It
Routing adds complexity. It is not always justified.
If your product handles fewer than a few hundred requests per day, the cost savings are too small to justify the engineering overhead.
If every request needs frontier-model reasoning (advanced code generation and nuanced medical advice), there is nothing to route to a cheaper tier.
Routing also requires evaluation.
You must know which quality level each model tier produces for your specific use case.
Without benchmarks, you are guessing which requests can safely go to a cheaper model. That guessing tends to produce either wasted money (routing too cautiously) or degraded quality (routing too aggressively).
The evaluation step is non-negotiable. Before routing a task type to a cheaper model, run your existing eval suite against both the premium and budget models.
If the budget model passes 95% of the same test cases, route confidently. If it is 70%, you have a quality gap that will show up in production.
The numbers, not intuition, should drive routing rules. The sweet spot is high-volume products with mixed-complexity workloads.
A customer support platform handling thousands of queries per day, where most questions are simple lookups but some require deep investigation, is the ideal case.
One such platform documented cutting monthly LLM spend from $42,000 to $18,000 by routing simple queries to Claude Haiku and complex escalations to Claude Sonnet, with identical customer satisfaction scores.
Also read: What is AI and LLM Observability?
The Tools
You do not need to build routing from scratch. A growing ecosystem of AI gateways and routers handles this as infrastructure.
Portkey provides a unified API across 1,600+ models with built-in routing, fallback chains, cost tracking, and observability. It adds under 1 millisecond of latency and starts at $49/month.
OpenRouter offers a single API endpoint that routes across many providers, with automated model selection and cost optimisation.
Martian takes an algorithmic approach. It analyses each prompt and automatically selects the best model for the job without manual routing rules.
Not-Diamond maintains a curated catalogue of routing approaches and benchmarks, positioning itself as a research-informed router that selects models based on empirical performance data rather than heuristics.
For teams that want full control, building a custom router is the best. You can implement a classification prompt, a routing table, and a fallback chain in a few hundred lines of code.
The gateways are valuable when you need observability, multi-provider failover, and managed infrastructure rather than building your own.
The models are getting cheaper and more capable at every tier.
But the gap between tiers is not shrinking. If anything, the spread between a $0.50/million-token model and a $60/million-token model is wider than ever.
Routing is how you use that spread to your advantage.
Therefore, the question is whether you are paying premium prices for work that a model costing 60 times less could handle just as well.
