Evals: how to know your LLM product actually works
The most common mistake in LLM products: a developer writes a prompt, eyeballs five examples, sees it works, ships. A month later, users complain. What changed? No idea. Same prompt, same code, same model. But quality dropped.
The answer is almost always the same: no metrics, so the regression crept in unseen. You had no baseline, no regression tests, no way to compare yesterday to today.
Here's how to fix that.
Why LLM products are especially fragile
In normal code, regressions are visible: a test fails, the build turns red. In an LLM product the output is always something — the model doesn't crash, it just answers worse. 5% worse, then 10% worse, then users quietly leave.
Causes of silent degradation:
- The provider updated the model under the same name (common in 2023–2024, rarer now but still happens)
- You tweaked one prompt and quietly broke an adjacent scenario
- Input distribution shifted — users started sending something different
- A new SDK / embeddings library version shifted behavior
Without evals, you learn about any of these from a complaining customer, not from a metric.
Minimal eval pipeline in a day
Three ingredients.
1. A dataset — 30–100 examples labeled "good/bad" or with ideal answers. Sources: real production logs, business scenarios, corner cases you fear. Go through them by hand, mark the expected outcome. Store them in git as JSONL. Not Notion, not Google Sheets.
2. A runner — a script that runs the current product on the whole dataset and saves responses. Always use temperature: 0 in evals — otherwise you're measuring noise, not the system.
3. A grader — a way to compare output to expectation. Three flavors:
- Exact match (for structured outputs: JSON, classification)
- Similarity (cosine over embeddings for free text — fast but rough)
- LLM-as-judge (a stronger model scores outputs against a rubric — expensive but reliable)
In practice you mix: exact match for classifiers, judge for open-ended answers.
What to measure beyond quality
Cost per request. If answers are 2x better but 5x more expensive, that's an economic regression, not an improvement.
Latency. Users leave after 8 seconds. Track p50, p95, p99 — averages lie.
Structural error rate. If you demand JSON, what percentage parses? Should be 99.9%, or something's broken.
Output length. Drifts silently — the model starts answering longer, you pay for extra tokens, UX degrades.
CI integration
Evals should run automatically on every PR that touches prompts or pipeline code. The threshold isn't new-run vs absolute gold — it's new vs previous main. 3% delta = warning, 10% delta = block.
All of this is a half-hour of GitHub Actions. A script, datasets in the repo, the runner calls your pipeline and compares to the previous commit's baseline.
Antipatterns
"We checked it by hand, it's fine." You checked 5 cases. Production sees 5000. You can't eyeball that.
Too small a dataset. 10 examples catch nothing. 30 minimum, 100 better. Diversity matters more than volume.
LLM-as-judge using the same model as production. The model grades itself with inflated confidence. Your judge should be a different provider or a stronger model.
Scoring "does it sound good." Fuzzy rubric = useless score. You need concrete criteria: did the model mention N key facts, did it hallucinate, did it obey the format.
What to do today
Block out an hour. Pull 30 real requests from logs. Label expected answers by hand. Write 50 lines of Python or Node that run your pipeline and compare results. Freeze the current score — that's your baseline.
Congratulations, now you see regressions on a chart instead of in an angry email. That's 80% of the work — after that you just add cases as pain surfaces.
Need help building an eval pipeline for your product? Get in touch. This falls squarely into the "AI infrastructure audit" package.