How to cut your LLM bill 10x without losing quality
The first month after shipping a product with LLMs is a shock at the bill. $800 for what looks like a chatbot serving two hundred users. People start panicking, slapping on limits, saying "this AI is too expensive for us."
Almost always the issue isn't the model — it's suboptimal use of it. Five techniques that reliably give 10x savings on real projects.
1. Prompt caching
The big one. If you have a 5000-token system prompt that goes with every request, you're paying to process it every time. That's 80% of the cost in a typical chat app.
Prompt caching in the Anthropic API: flag the stable part of your prompt, it gets cached for 5 minutes, and repeat requests pay 10x less for that chunk.
messages.create({
model: "claude-sonnet-4-5",
system: [
{ type: "text", text: LONG_SYSTEM_PROMPT, cache_control: { type: "ephemeral" } },
],
messages: [...],
});
One line. On conversational systems — up to 70% savings immediately.
2. Routing: light models for light tasks
Not everything your app does needs Opus. Classifying an incoming request, extracting parameters from text, simple transformations — Haiku or Sonnet 4.5 handles them fine.
Pattern: first request goes to Haiku — "what category is this, does it need heavy reasoning?" The answer tells you where to route next. Haiku is ~60x cheaper than Opus and distinguishes "yes" from "no" perfectly.
In production, 70–80% of requests can go through the smaller model. Do the math.
3. Streaming for UX, batching for backend
If you have a synchronous pipeline of several LLM calls and the user doesn't see intermediate output — don't stream. Anthropic's Batch API is 2x cheaper than regular, though with up to 24h latency.
Keep streaming only where a user is watching. Background jobs, executor agents, broadcasts, classifiers — all batch.
4. Output tokens — where they eat money
Output costs ~5x more than input. If the model returns a 2000-token JSON, where 1500 tokens are field repetition and boilerplate, you're paying for noise.
Fix: structured output with a tight schema via tool use. Or ask the model for a compact format (CSV, short codes instead of names) and unpack server-side.
Also: set max_tokens aggressively. Most tasks don't need a 4000-token answer. Set 500 and the model packs tighter on its own.
5. Small model as a preprocessor
An underused trick with big savings. Before sending to Opus, run the request through a small local model (Llama 3.2 3B, Phi-4, anything). Its job: clean noise, extract structure, cut redundant text.
Example: user sends a 3000-token request with chat history, noise, duplicates. A local 3B model compresses it to 400 clean tokens in milliseconds. Opus sees less → you pay less on input.
The local model is free if you have a server. Savings on expensive models cover infrastructure by ~10K requests/month.
What not to do
Don't optimize prematurely. First build on the best model, prove it works, then cut cost. Saving 90% on something that doesn't work is pointless.
Don't switch providers for price alone. GPT-4 and Claude cost differently and behave differently. A 20% saving doesn't cover hours of debugging artifacts.
Don't try to remove the LLM from the pipeline. If a task is doable with regex, it should already be regex. You put an LLM there because it couldn't be done otherwise.
Final strategy
In practice the stack looks like this: prompt caching + 2–3 model routing by complexity + batch for everything non-interactive + structured output to save on responses. That's 5–10x off a naive baseline.
Need help optimizing LLM cost in your product? Drop a line — happy to look at the specifics.