Haiku, Sonnet, Opus in 2026: which Claude model to use when
The most expensive mistake in LLM architecture is "let's use the bigger model just in case." In production this turns into a bill 10x larger than necessary, and no one notices because "it works." Meanwhile Sonnet and Opus differ by 5x on price, and Haiku to Opus by 60x.
Here's when to reach for which, with real project examples.
Quick cheat sheet
- Haiku — classification, routing, field extraction, simple text transforms. Anything with a short answer and a formalized task.
- Sonnet — the workhorse. Conversations, mid-level reasoning, code generation, agents with a handful of tools, RAG synthesis. 80% of production load should land here.
- Opus — complex planning, multi-step agents, high-cost-of-error tasks where the quality gap pays for the price. Scientific reconciliation, legal analysis, debugging rare bugs.
Rule of thumb: start with Haiku, move up to Sonnet when quality drops, move up to Opus only when Sonnet honestly can't hack it on your evals.
Three real cases
Case 1: support ticket classifier → Opus → Haiku
A team put Opus on classifying incoming tickets into 12 categories, reasoning "getting this wrong is expensive." Bill: $1,800/mo on 200 tickets/day.
I moved it to Haiku with a short system prompt and 5 few-shot examples. Accuracy dropped from 94% to 91%. Bill: $30/mo.
The 3% accuracy gap cost the team 6 minutes of manual re-check per day. The saved $1,770/mo bought a lot of things.
Lesson: for classification, Haiku is almost always enough. Opus there is throwing money away.
Case 2: IDE assistant agent → Haiku → Sonnet
A startup shipped an agent that reads code and proposes edits. They picked Haiku for speed. Users complained: the agent lost cross-file context, confused variable names, occasionally hallucinated methods.
I moved it to Sonnet. Bill grew 4x, quality grew more. User NPS went up 30 points.
Lesson: for code reasoning tasks, Haiku has a low ceiling. Saving here kills the product.
Case 3: RAG over a legal knowledge base → Sonnet + Opus hybrid
Legal RAG: first retrieve relevant documents, then synthesize the answer. Sonnet synthesized well on standard questions, but on corner cases (contested interpretations) it confidently produced wrong answers.
Architecture:
- Sonnet generates an answer with a
<confidence>field (self-rated) - If confidence < 0.8, the same prompt is routed to Opus
- Opus's answer is returned to the user
Result: 85% of queries are served by Sonnet cheaply, 15% by Opus expensively but reliably. Total bill went up 40% vs. pure-Sonnet, but critical errors disappeared.
Lesson: hybrid pipelines with confidence-based routing are the most underrated optimization in LLM architecture.
When to definitely use Opus
- Multi-step agents with long tool-call chains (5+ steps)
- Tasks where you need to plan before executing (not react, but think)
- Debugging bugs with non-obvious causes — Opus holds the whole chain in its head better
- Code generation > 200 lines with consistent architecture
- Rare languages / domains where Sonnet's knowledge thins out
When to definitely not use Opus
- Classification, tagging, routing
- Plain-text summarization
- Field extraction from structured / semi-structured documents
- Short answers to fixed questions
- Any task where you have an eval dataset and Sonnet scores > 90%
About Haiku
Haiku has grown a lot since 2024. It now handles tasks that a year ago would have demanded Sonnet. The test is simple: write 30 representative requests, run them through Haiku and Sonnet, compare by hand. In half of products the difference is negligible.
Haiku's speed is a separate product factor. Where Sonnet answers in 4 seconds, Haiku answers in 0.8. In chat products where "feel alive" matters, this moves UX more than the quality gain from Sonnet.
How to pick for a new project
- Write 20-30 typical requests.
- Run them through all three models with the same prompt.
- Compare by hand or via LLM-as-judge.
- Multiply cost by projected monthly load.
- Pick the cheapest one that clears your quality bar.
Half a day of work. Pays for itself in the first month of production.
On model upgrades
Anthropic ships new versions (Sonnet 4.5 → 4.6 and so on). Don't switch blindly: run your eval set on the new model, compare. It happens that on general benchmarks the new one wins, but on your narrow domain it loses. Without your own evals you'll never see it.
What to do today
Open your Anthropic billing, find your top 3 endpoints by spend. For each ask: "If I downgrade the model here — what breaks?" If you don't know, you've never tested. Test.
Need help picking the right model architecture for your case? Get in touch. This is exactly the "AI infrastructure audit" package.