Prompt caching: the one-line change that cut our LLM bill 4x

The biggest hole in most LLM budgets isn't model choice or verbose outputs — it's resending the same context on every request. A 3k-token system prompt, a 15k knowledge base, 2k of examples — 500 times a day. You pay each time like it's the first.

Prompt caching fixes this in ten lines of code with zero quality loss. Here's how it works on the Anthropic API, and why most teams don't turn it on because of one wrong intuition.

What prompt caching actually is

The idea is simple. The Claude API lets you mark parts of a prompt as cacheable. The first request costs slightly more (1.25x), but the prompt prefix result is stored server-side for 5 minutes (or 1 hour in the extended variant). Every subsequent request with the same prefix:

Skips reprocessing
Costs 10x less for those tokens (0.1x input price)
Responds faster — no need to re-chew a large prefix

Savings scale with how much of your prompt is repeated. From real projects:

RAG bot with a fixed 20k-token knowledge base → -75% bill
Agent with long system prompt + tools → -60%
Code assistant reading the same file context → -80%

How to turn it on (Anthropic SDK, Node/Python)

One field, cache_control, on the block you want cached:

const resp = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 2000,
  system: [
    {
      type: "text",
      text: BIG_SYSTEM_PROMPT,
      cache_control: { type: "ephemeral" }
    }
  ],
  messages: [{ role: "user", content: userQuery }]
});

That's it. The first response returns cache_creation_input_tokens: 3000; subsequent ones return cache_read_input_tokens: 3000 at 1/10th the price.

Where to place cache breakpoints

Order matters. Caching works by prefix: if a single byte changes before a cache_control point, the cache invalidates and rebuilds.

The right hierarchy (stable → volatile):

tools (if any) — rarely change
system prompt — static
Knowledge base / document context — changes per session
Chat history — grows
Latest user message — always new

Up to 4 cache_control breakpoints per request. Typical setup: one at the end of system, one at the end of history.

Why teams don't turn it on

Three reasons I keep hearing.

"Our prompt is dynamic, nothing to cache." Check with token counts, not intuition. Log how many tokens live in system vs. the user message. In 80% of cases system is 70% of the request, and you're only rewriting it between users — not between requests from the same user.

"Scared something will break." It won't. Either the cache hits or misses — response correctness is identical. The only risk is placing the breakpoint wrong (before the volatile part), and then you just see no savings. Watch cache_read_input_tokens in the response.

"5-minute TTL is useless, our traffic is hourly." First, extended cache (1 hour) removes that objection. Second, even 5 minutes covers the typical chat session, RAG bot, or agent chain. If your traffic is truly once an hour, you don't have an LLM product — you have a cron job, and there caching truly doesn't matter.

Edge cases

Tools with dynamic parameters. If your tools JSON is generated on the fly (e.g. available functions depend on user role), the cache misses. Fix: split tools into static and dynamic halves, place the breakpoint between them.

Different models, one cache. Cache is per-model. Switch Sonnet → Haiku and it rebuilds.

Paying for cache_creation. The first request costs 1.25x. If the request is never repeated (user walks away), it's a pure loss. In systems with low hit rates (under 20%), caching makes things worse. Track hit rate in metrics.

What to measure

Log three fields from every API response:

cache_creation_input_tokens
cache_read_input_tokens
input_tokens (uncached)

Cache hit rate = cache_read / (cache_read + input_tokens). Target above 60%. Below 30% means you placed cache_control in the wrong spot.

What to do today

Pick your highest-traffic endpoint. Look at the prompt structure. If you have more than 1000 static tokens before the first volatile piece — add cache_control. Check your bill a week later.

A one-off change that pays for itself within the first dozen requests.

Need help tuning LLM cost for your use case? Get in touch. This falls into the "AI infrastructure audit" package.