The Spec Sheet Is a Marketing Document
When Anthropic says 200K, they mean technically capable of processing 200K tokens in a single request. They do not mean the model performs at 200K the way it performs at 20K. The advertised context window and the effective reliable context window are two different numbers — and the gap between them can be catastrophic in production.
The pattern is consistent across frontier models: performance holds reasonably well up to roughly 65% of the advertised limit, then begins to degrade. Not gradually — suddenly. Researchers describe it as a cliff, not a slope. A model that was working fine at 120K tokens doesn't gracefully degrade at 140K — it drops. And when it drops, it doesn't produce an error. It produces a confident, coherent-sounding wrong answer.
The Architecture That Creates the Blind Spot
The "lost in the middle" effect isn't a bug in a specific model. It's structural to how transformers work. In 2025, MIT researchers identified the architectural cause: causal masking combined with positional attention weight accumulation.
In a transformer, each token can only attend to tokens that came before it — that's causal masking. Token #1 is visible to every subsequent token in the sequence. Token #500,000, sitting in the middle of a 1M-token context, is only visible to tokens #500,001 onward. Earlier tokens accumulate more total attention weight across the model, simply because they have more opportunities to be referenced. The result is a U-shaped attention distribution: strong recall at the beginning (primacy effect), strong recall at the end (recency effect), and a valley of degraded recall for everything in the middle.
Five Things That Break at Scale
At 1M tokens, frontier models average ~60% recall on facts distributed across the context. The missing 40% isn't flagged. The model doesn't say "I couldn't find that." It either omits it or — worse — confabulates something plausible in its place. The danger multiplies in agentic workflows where mid-context decisions compound downstream.
Gemini 1.5 · 1M tokens · avg recall ~60%Models advertised at 200K tokens typically show sudden performance drops around 130K — not gradual degradation. This means the model that was reliable at 120K can be genuinely unreliable at 140K, with no warning signal. The gap between spec and production reality is consistent enough across models that "70% of advertised context" is a rough operational rule of thumb for where reliability starts to erode.
200K spec → ~130K effective · sudden drop, not slopeThe KV (key-value) cache — the attention mechanism's working memory — requires approximately 15GB of VRAM per concurrent user at 1M token context length. A 7B parameter model with a 128K context needs ~2.5GB just for the cache alone. Scale to 1M tokens, scale to 1,000 concurrent users, and the infrastructure math becomes extremely unfavorable. This is why frontier model providers charge 2× input token cost for requests exceeding standard context lengths.
~15GB KV cache per user · 1M token contextBefore the model generates a single output token, it must process all input tokens — this is the prefill phase. At maximum context lengths, prefill latency exceeds two minutes. A 50-step agentic workflow where each step processes 20K tokens accumulates 1M tokens total. That's not a single 2-minute wait — it's 2-minute waits stacked across a workflow. Context engineering — not context maximization — is the production answer.
Prefill >2min at max context · 50-step agent = 1M tokensAfter approximately 20% context window utilization, some frontier models exhibit contextual memory degradation — confusing past information with current state. The agent takes actions based on what was true 40K tokens ago, not what's true now. This is particularly dangerous in real estate or finance agentic workflows where deal state changes mid-execution. The model doesn't know it's confused. It acts with full confidence on outdated facts.
Gemini 2.5 Flash · degradation onset ~20% window fillThe Real Cost of "Just Use a Bigger Context"
| Context Length | Relative Cost | Prefill Latency | Effective Recall | Recommended For |
|---|---|---|---|---|
| 8K–32K | 1× | <1s | ~95% | Conversational, focused Q&A, real-time agents |
| 32K–128K | 2–4× | 2–10s | ~85% | Document analysis, long-form workflows, RAG augmentation |
| 128K–200K | 4–8× | 15–45s | ~75% | Large codebase analysis, multi-document synthesis |
| 200K–1M | 8–20× | 1–2min+ | ~60% | Batch processing only — not real-time, not conversational |
| >1M | 20×+ | >2min | Unknown | Research/experimental — not production-ready for most use cases |
That critical detail at token position 487,000 was technically "in context." The attention mechanism just gave it a fraction of the weight it gives to the first and last 20K tokens. The information was present. The model was looking elsewhere. That's not a memory problem. That's an architecture problem — and bigger windows don't fix it.
The Production Answer: Context Engineering
The production response to context window limitations isn't bigger windows — it's smarter context composition. The goal is to maximize the signal density of what's actually in the window, rather than maximizing the window size.
| Rule | Why It Works |
|---|---|
| Put critical information at the edges | Primacy and recency effects are features, not bugs. System prompts (start) and final instructions (end) receive the highest attention weight. Structure your context accordingly. |
| Compress before injecting | 89–95% compression via memory extraction means 40K tokens of conversation history can become 400 tokens of distilled facts — with better recall properties, not worse. Selective extraction outperforms wholesale injection. |
| Use retrieval, not stuffing | A hybrid approach — 32K–128K context with intelligent vector retrieval — consistently outperforms pure long-context on both cost and accuracy. The retrieval step is not a workaround. It's the architecture. |
| Benchmark at your actual target length | Don't trust the spec sheet. Test your specific use case at 80%, 100%, and 120% of your intended operating context length. The degradation curve varies by model, task type, and information distribution. Measure it. |
| Design agentic workflows with token budgets | A 50-step agent workflow at 20K tokens/call = 1M tokens total. Design step-level token budgets before the architecture scales. Context accumulates silently. Observability at the per-step level is non-negotiable for multi-step agents. |