AIMar 12, 20264 min read

LLM evals: the test suite your AI feature is missing

You wouldn't ship code without tests. Most teams ship LLM features without evals. Here's how we measure quality before users do.

Chitt Bhavsar

LLM evals: the test suite your AI feature is missing

Imagine shipping software with no tests, where every release might silently break a feature and you'd only find out from angry users. That's how most teams ship LLM features today. Evals are the test suite for AI, and they're not optional.

Why prompts need tests

LLM behaviour is non-deterministic and brittle. A tweak to a prompt that fixes one case can quietly regress ten others. Without a measurement harness, you're tuning blind.

Golden datasets: real inputs paired with acceptable outputs.
Automated scorers: exact-match, semantic similarity, or an LLM-as-judge.
Regression gates: a prompt change must not lower the score to merge.

LLM-as-judge, carefully

For open-ended tasks we use a stronger model to grade outputs against a rubric. It's not perfect, but it scales human judgment far enough to catch obvious regressions and rank competing prompts.

Evals turn "the AI feels worse today" into a number you can act on. Once a team has them, prompt engineering stops being vibes and starts being engineering.

Keep reading

Jun 20, 2026

Building an internal developer platform teams actually use

Jun 5, 2026

How we cut a client's cloud bill by 40% in six weeks

May 22, 2026