§01 Why Benchmarks Get Gamed
The problem with every public benchmark is Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. The moment MMLU became the standard LLM leaderboard metric, labs started optimizing for MMLU — through training data selection, prompt formatting, and in some cases direct contamination. A model that scores 91% on MMLU may have seen MMLU questions in training. You have no way to know.
This isn't a conspiracy. It's a structural problem. Benchmarks are public. Training pipelines are private. The benchmark measures a proxy for capability, and proxies get optimized away from the underlying capability as soon as optimization pressure is applied. The benchmark is useful until it isn't — and once it isn't, everyone keeps citing it for another two years out of inertia.
§02 Benchmark Status Report, 2026
§03 The Four Evaluation Methods
Multiple choice questions with predefined correct answers. Easy to run at scale. The problem: real model outputs aren't multiple choice. Pass rates on MMLU don't predict how well the model handles an open-ended production task. Useful for model selection in early stages. Useless as a production quality signal.
Contamination risk · Proxy metricFor coding tasks: does the code run? Does it pass the test suite? Completely objective — no subjective judgment, no contamination risk on a fresh test set. SWE-Bench Verified works because the solutions are verifiable against real GitHub issue resolutions. The limitation: not every task has a verifiable ground truth. Real estate agent evaluation can't use unit tests.
Best for coding · Use where possibleChatbot Arena / LMSYS: humans compare two model responses and vote for the better one. Aggregate votes produce an Elo-style ranking. Captures something real — actual human preference — but the preference distribution of crowdsourced voters may not match your specific user population. An enterprise legal AI ranked #12 overall might be the right tool for your job. Or it might not.
Useful signal · Doesn't replace domain evalUse a capable model to evaluate another model's outputs against a rubric. Research shows a well-prompted GPT-4-class judge matches human annotator agreement ~80% of the time — roughly equal to inter-annotator human agreement. Programmable: define exactly what criteria to judge. Scalable: evaluate thousands of outputs without human bottleneck. Risk: verbosity bias (judges favor longer answers), self-preference (judges favor their own style).
Best practical method · Use with rubrics§04 The Step Everyone Skips
There is one evaluation method that is immune to benchmark contamination by definition, directly measures what you care about, and is almost universally skipped. A proprietary test set built from your production data.
The most valuable eval investment most teams never make. A set of 100–500 real production inputs with expert-annotated expected outputs. This is the step most teams skip and most consistently regret.
§05 The Evaluation Rubric
What to actually measure depends on what you're building. The mistake is treating evaluation as a single score. Production systems require evaluation across multiple axes, and different axes matter differently for different applications.
| Axis | What It Measures | How to Measure | Critical For |
|---|---|---|---|
| Factual Accuracy | Does the output contain verifiably correct facts? | LLM-as-judge with ground truth reference, human spot-check 10% | All production systems |
| Task Completion | Did the agent complete the stated goal? | Verifier against spec, human review for ambiguous cases | Agentic systems |
| Groundedness | Are claims grounded in the provided documents? | RAGAS faithfulness score, NLI entailment check | RAG systems, document Q&A |
| Hallucination Rate | How often does the model confidently state falsehoods? | Sample + human verification, knowledge boundary probing | Customer-facing systems |
| Latency p50 / p95 | Time to first token, time to completion | Instrumentation — not an LLM metric, but an eval metric | Real-time agents |
| Regression Rate | Did a change break something that was working? | CI/CD eval pipeline against fixed test set on every deploy | Any system with prompt changes |
| User Satisfaction | Do real users find the outputs useful? | Thumbs up/down, session completion, downstream action rates | Consumer products |
A 91% MMLU score and a failing production system are not contradictory. They're the expected outcome when you optimize for the benchmark instead of the job. The eval that matters is the one you build from your own data — the one that can't be contaminated, can't be gamed, and directly measures the thing you actually care about.