Eval suites are the new test pyramid

Apr 15, 2026claudemint5 min read

You wouldn't ship a backend without tests. Yet we keep meeting teams who ship LLM systems with none — no fixtures, no regression set, no number that tells them whether last night's prompt change made things better or worse. Then behaviour drifts in production, and everyone acts surprised.

The discipline that fixes this already has a name in software: the test pyramid. Evals are the same idea, ported to systems whose output is probabilistic instead of deterministic. And like tests, the teams that win aren't the ones with the cleverest model — they're the ones who can change things confidently because they can measure what changed.

"It feels better" is not a measurement. If you can't put a number on the regression, you're not iterating — you're guessing in public.

Why "vibes" stop scaling

Eyeballing a few outputs works for the first week of a prototype. It breaks the moment you have more than one prompt, more than one developer, or more than one model version. A change that fixes one case quietly breaks three others, and without a held-out set you simply never see it. The cost of not having evals isn't visible — which is exactly why it's dangerous.

A starter pattern you can ship this week

  1. Golden set (the wide base). Twenty to fifty real inputs with known-good outputs, drawn from actual usage, not invented. Run them on every prompt or model change. This is your regression suite — cheap, fast, and the thing that catches 80% of silent breakage.
  2. Assertion checks (the middle). Programmatic rules on every response: valid JSON, no leaked PII, required fields present, length in range, no forbidden phrases. These are deterministic and instant — run them in CI like unit tests.
  3. LLM-as-judge (the narrow top). For the subjective dimensions — tone, helpfulness, faithfulness to source — score with a separate model against a rubric. Slower and noisier, so use sparingly and always anchor it with a few human-labelled examples.

Wire all three to run on a single command and on every change. Track the scores over time. The first time a "small" prompt tweak drops your golden-set score by 15%, the suite has already paid for itself.

Evals are a feature, not a phase

The mistake is treating evaluation as something you bolt on before launch. In a healthy LLM build, the eval set exists before the first prompt does — it's how you define "working" in the first place. We ship every system with evals, observability, and cost guardrails on day one, because retrofitting them after the system is live is three times the work and half the coverage.

This is how every LLM Pilot we run ships — evals, observability and guardrails from day one.

See LLM Pilots →