AI Evals: How to Actually Measure Model Performance

§01 Why Benchmarks Get Gamed

The problem with every public benchmark is Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. The moment MMLU became the standard LLM leaderboard metric, labs started optimizing for MMLU — through training data selection, prompt formatting, and in some cases direct contamination. A model that scores 91% on MMLU may have seen MMLU questions in training. You have no way to know.

This isn't a conspiracy. It's a structural problem. Benchmarks are public. Training pipelines are private. The benchmark measures a proxy for capability, and proxies get optimized away from the underlying capability as soon as optimization pressure is applied. The benchmark is useful until it isn't — and once it isn't, everyone keeps citing it for another two years out of inertia.

✗

The contamination problem: Research has repeatedly found evidence that test-set questions from major benchmarks appear in LLM training corpora. A model that achieves 96% on MATH-500 may have been trained on those exact 500 problems or close variants. The benchmark can't distinguish memorization from reasoning. Your production system will reveal the difference immediately.

§02 Benchmark Status Report, 2026

MMLU

Saturated · Contaminated · Retired

BLEU

Word overlap ≠ quality · Deprecated

Perplexity

Predicts text, not usefulness

TruthfulQA

Gamed · Training contamination risk

SWE-Bench Verified

Real GitHub issues · Hard to fake · Active

GAIA

Real-world tasks · Tool use · Multi-step

AIME 2026

New each year · Competition math · Hard

Your Prod Data

↑ THE benchmark · Immune to contamination

§03 The Four Evaluation Methods

Multiple Choice / Benchmarks

Fast · Cheap · Scalable · Gameable

Multiple choice questions with predefined correct answers. Easy to run at scale. The problem: real model outputs aren't multiple choice. Pass rates on MMLU don't predict how well the model handles an open-ended production task. Useful for model selection in early stages. Useless as a production quality signal.

Contamination risk · Proxy metric

Code Verifiers / Unit Tests

Objective · Deterministic · Domain-limited

For coding tasks: does the code run? Does it pass the test suite? Completely objective — no subjective judgment, no contamination risk on a fresh test set. SWE-Bench Verified works because the solutions are verifiable against real GitHub issue resolutions. The limitation: not every task has a verifiable ground truth. Real estate agent evaluation can't use unit tests.

Best for coding · Use where possible

Leaderboards / Human Preference (Elo)

Crowdsourced · Preference-based · Gameable

Chatbot Arena / LMSYS: humans compare two model responses and vote for the better one. Aggregate votes produce an Elo-style ranking. Captures something real — actual human preference — but the preference distribution of crowdsourced voters may not match your specific user population. An enterprise legal AI ranked #12 overall might be the right tool for your job. Or it might not.

Useful signal · Doesn't replace domain eval

LLM-as-Judge

Scalable · Programmable · ~80% human agreement

Use a capable model to evaluate another model's outputs against a rubric. Research shows a well-prompted GPT-4-class judge matches human annotator agreement ~80% of the time — roughly equal to inter-annotator human agreement. Programmable: define exactly what criteria to judge. Scalable: evaluate thousands of outputs without human bottleneck. Risk: verbosity bias (judges favor longer answers), self-preference (judges favor their own style).

Best practical method · Use with rubrics

§04 The Step Everyone Skips

There is one evaluation method that is immune to benchmark contamination by definition, directly measures what you care about, and is almost universally skipped. A proprietary test set built from your production data.

Build Your Proprietary Test Set

The most valuable eval investment most teams never make. A set of 100–500 real production inputs with expert-annotated expected outputs. This is the step most teams skip and most consistently regret.

Sample 100–500 inputs from real production traffic. Not synthetic, not constructed — actual tasks your system handles.

Have qualified annotators label the expected output or acceptable output range. This is expensive and worth every dollar.

Segment by task type to find targeted weaknesses — not just overall accuracy but performance broken out by category.

Rotate the test set every 6 months. Teams unconsciously overfit to a static set — the model gets optimized toward the test cases you wrote, not toward the underlying task.

100 examples is the statistical minimum. 500 is where you get enough samples to segment and find real weaknesses. 1,000+ if you can afford it.

§05 The Evaluation Rubric

What to actually measure depends on what you're building. The mistake is treating evaluation as a single score. Production systems require evaluation across multiple axes, and different axes matter differently for different applications.

// Evaluation Axis Selector — Match to Your Use Case

Axis	What It Measures	How to Measure	Critical For
Factual Accuracy	Does the output contain verifiably correct facts?	LLM-as-judge with ground truth reference, human spot-check 10%	All production systems
Task Completion	Did the agent complete the stated goal?	Verifier against spec, human review for ambiguous cases	Agentic systems
Groundedness	Are claims grounded in the provided documents?	RAGAS faithfulness score, NLI entailment check	RAG systems, document Q&A
Hallucination Rate	How often does the model confidently state falsehoods?	Sample + human verification, knowledge boundary probing	Customer-facing systems
Latency p50 / p95	Time to first token, time to completion	Instrumentation — not an LLM metric, but an eval metric	Real-time agents
Regression Rate	Did a change break something that was working?	CI/CD eval pipeline against fixed test set on every deploy	Any system with prompt changes
User Satisfaction	Do real users find the outputs useful?	Thumbs up/down, session completion, downstream action rates	Consumer products

★

The eval in production: At PropTechUSA.ai, the real eval for the boardroom agent system isn't a benchmark — it's whether Donneal and Eric are getting better deal intelligence than they were before. Latency is measured at the API level. Accuracy is verified by cross-checking agent outputs against known deal data. Regression is caught by running test cases on every deployment. No leaderboard required.

The benchmark measures what's easy to measure. Production measures what matters.

A 91% MMLU score and a failing production system are not contradictory. They're the expected outcome when you optimize for the benchmark instead of the job. The eval that matters is the one you build from your own data — the one that can't be contaminated, can't be gamed, and directly measures the thing you actually care about.

Justin Erickson — PropTechUSA.ai

Eval in production · boardroom agent · Carl · Claudia · 87 workers · March 2026

Series 2 Complete · Continue Reading

Research

Context Windows at Scale

Research

AI Memory Architectures

Research

The Agent Coordination Problem

Research

Reasoning Models Are Eating Benchmarks

AI Evals The Benchmark Crisis BLEU score · MMLU · perplexity → none of these measure what matters in production

§01 Why Benchmarks Get Gamed

§02 Benchmark Status Report, 2026

§03 The Four Evaluation Methods

§04 The Step Everyone Skips

§05 The Evaluation Rubric