AI Evals · How to Measure What Actually Matters

AI Evals The Benchmark Crisis BLEU score · MMLU · perplexity → none of these measure what matters in production

Every AI team has a benchmark story. The model scores 91% on the leaderboard and fails at the actual job. That's not a coincidence. The benchmark was measuring the wrong thing from the beginning — and the gap between benchmark performance and production performance is where real AI systems go to die.

// Andrej Karpathy · March 2025

"There is an evaluation crisis. I don't really know what metrics to look at right now. MMLU was good and useful for a few years but that's long over. SWE-Bench Verified I really like and is great but itself too narrow."

— @karpathy · former OpenAI Director of AI, Tesla AI lead
// The Eval Pyramid · What Each Layer Measures ↑ harder to game · ↑ more useful

§01 Why Benchmarks Get Gamed

The problem with every public benchmark is Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. The moment MMLU became the standard LLM leaderboard metric, labs started optimizing for MMLU — through training data selection, prompt formatting, and in some cases direct contamination. A model that scores 91% on MMLU may have seen MMLU questions in training. You have no way to know.

This isn't a conspiracy. It's a structural problem. Benchmarks are public. Training pipelines are private. The benchmark measures a proxy for capability, and proxies get optimized away from the underlying capability as soon as optimization pressure is applied. The benchmark is useful until it isn't — and once it isn't, everyone keeps citing it for another two years out of inertia.

The contamination problem: Research has repeatedly found evidence that test-set questions from major benchmarks appear in LLM training corpora. A model that achieves 96% on MATH-500 may have been trained on those exact 500 problems or close variants. The benchmark can't distinguish memorization from reasoning. Your production system will reveal the difference immediately.

§02 Benchmark Status Report, 2026

MMLU
Saturated · Contaminated · Retired
BLEU
Word overlap ≠ quality · Deprecated
Perplexity
Predicts text, not usefulness
TruthfulQA
Gamed · Training contamination risk
SWE-Bench Verified
Real GitHub issues · Hard to fake · Active
GAIA
Real-world tasks · Tool use · Multi-step
AIME 2026
New each year · Competition math · Hard
Your Prod Data
↑ THE benchmark · Immune to contamination

§03 The Four Evaluation Methods

1
Multiple Choice / Benchmarks
Fast · Cheap · Scalable · Gameable

Multiple choice questions with predefined correct answers. Easy to run at scale. The problem: real model outputs aren't multiple choice. Pass rates on MMLU don't predict how well the model handles an open-ended production task. Useful for model selection in early stages. Useless as a production quality signal.

Contamination risk · Proxy metric
2
Code Verifiers / Unit Tests
Objective · Deterministic · Domain-limited

For coding tasks: does the code run? Does it pass the test suite? Completely objective — no subjective judgment, no contamination risk on a fresh test set. SWE-Bench Verified works because the solutions are verifiable against real GitHub issue resolutions. The limitation: not every task has a verifiable ground truth. Real estate agent evaluation can't use unit tests.

Best for coding · Use where possible
3
Leaderboards / Human Preference (Elo)
Crowdsourced · Preference-based · Gameable

Chatbot Arena / LMSYS: humans compare two model responses and vote for the better one. Aggregate votes produce an Elo-style ranking. Captures something real — actual human preference — but the preference distribution of crowdsourced voters may not match your specific user population. An enterprise legal AI ranked #12 overall might be the right tool for your job. Or it might not.

Useful signal · Doesn't replace domain eval
4
LLM-as-Judge
Scalable · Programmable · ~80% human agreement

Use a capable model to evaluate another model's outputs against a rubric. Research shows a well-prompted GPT-4-class judge matches human annotator agreement ~80% of the time — roughly equal to inter-annotator human agreement. Programmable: define exactly what criteria to judge. Scalable: evaluate thousands of outputs without human bottleneck. Risk: verbosity bias (judges favor longer answers), self-preference (judges favor their own style).

Best practical method · Use with rubrics

§04 The Step Everyone Skips

There is one evaluation method that is immune to benchmark contamination by definition, directly measures what you care about, and is almost universally skipped. A proprietary test set built from your production data.

Build Your Proprietary Test Set

The most valuable eval investment most teams never make. A set of 100–500 real production inputs with expert-annotated expected outputs. This is the step most teams skip and most consistently regret.

1
Sample 100–500 inputs from real production traffic. Not synthetic, not constructed — actual tasks your system handles.
2
Have qualified annotators label the expected output or acceptable output range. This is expensive and worth every dollar.
3
Segment by task type to find targeted weaknesses — not just overall accuracy but performance broken out by category.
4
Rotate the test set every 6 months. Teams unconsciously overfit to a static set — the model gets optimized toward the test cases you wrote, not toward the underlying task.
5
100 examples is the statistical minimum. 500 is where you get enough samples to segment and find real weaknesses. 1,000+ if you can afford it.

§05 The Evaluation Rubric

What to actually measure depends on what you're building. The mistake is treating evaluation as a single score. Production systems require evaluation across multiple axes, and different axes matter differently for different applications.

// Evaluation Axis Selector — Match to Your Use Case
AxisWhat It MeasuresHow to MeasureCritical For
Factual Accuracy Does the output contain verifiably correct facts? LLM-as-judge with ground truth reference, human spot-check 10% All production systems
Task Completion Did the agent complete the stated goal? Verifier against spec, human review for ambiguous cases Agentic systems
Groundedness Are claims grounded in the provided documents? RAGAS faithfulness score, NLI entailment check RAG systems, document Q&A
Hallucination Rate How often does the model confidently state falsehoods? Sample + human verification, knowledge boundary probing Customer-facing systems
Latency p50 / p95 Time to first token, time to completion Instrumentation — not an LLM metric, but an eval metric Real-time agents
Regression Rate Did a change break something that was working? CI/CD eval pipeline against fixed test set on every deploy Any system with prompt changes
User Satisfaction Do real users find the outputs useful? Thumbs up/down, session completion, downstream action rates Consumer products
The eval in production: At PropTechUSA.ai, the real eval for the boardroom agent system isn't a benchmark — it's whether Donneal and Eric are getting better deal intelligence than they were before. Latency is measured at the API level. Accuracy is verified by cross-checking agent outputs against known deal data. Regression is caught by running test cases on every deployment. No leaderboard required.
The benchmark measures what's easy to measure. Production measures what matters.

A 91% MMLU score and a failing production system are not contradictory. They're the expected outcome when you optimize for the benchmark instead of the job. The eval that matters is the one you build from your own data — the one that can't be contaminated, can't be gamed, and directly measures the thing you actually care about.

J
Justin Erickson — PropTechUSA.ai
Eval in production · boardroom agent · Carl · Claudia · 87 workers · March 2026
Series 2 Complete · Continue Reading