How to cut your LLM bill 10x without losing quality

The first month after shipping a product with LLMs is a shock at the bill. $800 for what looks like a chatbot serving two hundred users. People start panicking, slapping on limits, saying "this AI is too expensive for us."

Almost always the issue isn't the model — it's suboptimal use of it. Five techniques that reliably give 10x savings on real projects.

1. Prompt caching

The big one. If you have a 5000-token system prompt that goes with every request, you're paying to process it every time. That's 80% of the cost in a typical chat app.

Prompt caching in the Anthropic API: flag the stable part of your prompt, it gets cached for 5 minutes, and repeat requests pay 10x less for that chunk.

messages.create({
  model: "claude-sonnet-4-5",
  system: [
    { type: "text", text: LONG_SYSTEM_PROMPT, cache_control: { type: "ephemeral" } },
  ],
  messages: [...],
});

One line. On conversational systems — up to 70% savings immediately.

2. Routing: light models for light tasks

Not everything your app does needs Opus. Classifying an incoming request, extracting parameters from text, simple transformations — Haiku or Sonnet 4.5 handles them fine.

Pattern: first request goes to Haiku — "what category is this, does it need heavy reasoning?" The answer tells you where to route next. Haiku is ~60x cheaper than Opus and distinguishes "yes" from "no" perfectly.

In production, 70–80% of requests can go through the smaller model. Do the math.

3. Streaming for UX, batching for backend

If you have a synchronous pipeline of several LLM calls and the user doesn't see intermediate output — don't stream. Anthropic's Batch API is 2x cheaper than regular, though with up to 24h latency.

Keep streaming only where a user is watching. Background jobs, executor agents, broadcasts, classifiers — all batch.

4. Output tokens — where they eat money

Output costs ~5x more than input. If the model returns a 2000-token JSON, where 1500 tokens are field repetition and boilerplate, you're paying for noise.

Fix: structured output with a tight schema via tool use. Or ask the model for a compact format (CSV, short codes instead of names) and unpack server-side.

Also: set max_tokens aggressively. Most tasks don't need a 4000-token answer. Set 500 and the model packs tighter on its own.

5. Small model as a preprocessor

An underused trick with big savings. Before sending to Opus, run the request through a small local model (Llama 3.2 3B, Phi-4, anything). Its job: clean noise, extract structure, cut redundant text.

Example: user sends a 3000-token request with chat history, noise, duplicates. A local 3B model compresses it to 400 clean tokens in milliseconds. Opus sees less → you pay less on input.

The local model is free if you have a server. Savings on expensive models cover infrastructure by ~10K requests/month.

What not to do

Don't optimize prematurely. First build on the best model, prove it works, then cut cost. Saving 90% on something that doesn't work is pointless.

Don't switch providers for price alone. GPT-4 and Claude cost differently and behave differently. A 20% saving doesn't cover hours of debugging artifacts.

Don't try to remove the LLM from the pipeline. If a task is doable with regex, it should already be regex. You put an LLM there because it couldn't be done otherwise.

Final strategy

In practice the stack looks like this: prompt caching + 2–3 model routing by complexity + batch for everything non-interactive + structured output to save on responses. That's 5–10x off a naive baseline.

Need help optimizing LLM cost in your product? Drop a line — happy to look at the specifics.