The Latency Wall: Why Speed Is the Next AI Frontier

§01 The Two Phases Nobody Explains

LLM inference has two distinct computational phases that behave completely differently and fail for completely different reasons. Almost every latency conversation conflates them, which is why most "optimization" efforts miss the point.

PHASE 01 · COMPUTE-BOUND

Prefill

The model processes your entire input prompt in parallel, building the KV cache — the compressed representation of everything it needs to hold in "memory" to generate a response. This phase is GPU compute-bound. It determines your TTFT. Long prompts, large context windows, RAG documents pasted in — all of it lands here. A 10K-token RAG context takes proportionally longer to prefill than a 500-token chat message. The longer the prompt, the longer the user stares at a cursor.

drives TTFT · scales with prompt length

PHASE 02 · MEMORY-BOUND

Decode

The model generates tokens one at a time, autoregressively. Each new token requires reading the entire KV cache from GPU memory, which is why this phase is memory bandwidth-bound, not compute-bound. You can't parallelize decode — each token depends on the previous one. This phase determines your tokens-per-second throughput. Adding GPUs helps prefill substantially but helps decode with significant diminishing returns: going from 4× to 8× A100s on Llama2-70B only decreases decode latency by 0.7×.

drives throughput · memory bandwidth limited

⚠

The trap: Most teams throw more GPUs at a latency problem without diagnosing which phase is the bottleneck. If your TTFT is bad, you have a prefill problem — prefill optimization, KV cache compression, and prompt engineering will help. If your tokens-per-second is bad under load, you have a decode/memory-bandwidth problem. Treating both as "need more compute" wastes budget and doesn't fix either.

§02 The Metrics That Actually Matter

Average latency is a lie. A system with excellent mean TTFT can still have catastrophic P99 latency — which is what 1% of your users experience. If you have ten thousand daily active users, that's a hundred people who had a broken experience. Every day. Track the right numbers:

TTFT

Time to First Token

The Perceived Speed Metric

How long from request submission to the first token appearing in the response stream. This is the latency the user feels. Everything before it is dead silence. Streaming output exists specifically to optimize perceived TTFT — you can start rendering before the full response is generated. If TTFT is bad, no amount of post-first-token speed saves the experience.

▸ chat <500ms · code completion <100ms

TPOT

Time Per Output Token

The Reading Speed Metric

The average time between consecutive tokens after the first appears. A TPOT of 100ms/token = 10 tokens/second, roughly 450 words per minute — faster than most people read. Below this threshold, generation feels fluid. Above it, users watch the cursor blink between words. Under high concurrency, TPOT degrades first as the decode phase competes for memory bandwidth across simultaneous requests.

▸ <100ms/token for natural reading pace

P99

99th Percentile Latency

The Reality Check Metric

The latency that 99% of your requests fall under. P99 reveals your worst-case experience for real users — and it's almost always much worse than the mean. A P99 TTFT near 100 seconds means 1 in 100 interactions is essentially broken. Queue spikes during traffic bursts, KV cache evictions under memory pressure, cold start after low-traffic periods — all of them show up in P99 first and stay invisible in the mean until they're catastrophic.

▸ monitor P95 and P99 — not just mean

GDPT

Goodput

The Business Metric

Throughput that actually meets your latency SLOs. You can maximize raw tokens/second by batching aggressively — but if your latency SLO is 500ms TTFT and your batch strategy blows past that for half the requests, your goodput is not your throughput. Goodput is the intersection of speed and usability. High goodput means you're scaling efficiently while still delivering acceptable experiences. It's the metric that connects infrastructure to product outcomes.

▸ goodput = throughput within latency SLOs

§03 The Optimization Stack

Every technique below trades something for something else. The mistake is treating any of them as free wins. Know your bottleneck, then apply the lever that addresses it directly:

2–4×

SPEEDUP

Speculative Decoding

Decode phase · No quality loss · Added complexity

Use a small, fast "draft" model to generate multiple candidate tokens ahead. A larger verifier model then checks them in parallel. If they match, you skip redundant decode steps. If they don't, you fall back to normal generation — the output is identical either way. Model providers benchmark 2–4× speedups under the right conditions. The catch: the small draft model needs to closely match the large model's distribution or the accept rate drops. The gain is real but requires careful model pairing.

69×

TTFT REDUCTION

LayerKV / KV Cache Offloading

Prefill phase · Memory management · GPU ↔ CPU

KV cache grows with every token in context and competes with new prefill operations for GPU memory. LayerKV proactively offloads non-critical cache layers to CPU memory, freeing GPU resources for incoming prefill tasks. Under cache memory contention, this technique delivers up to 69× average TTFT reduction. The tradeoff: CPU-GPU memory transfer latency. The win: you stop evicting existing KV caches to make room for new requests, which was causing the worst-case latency spikes in the first place.

2–4×

FASTER INFERENCE

Quantization (INT8 / INT4)

Both phases · 4–8× VRAM savings · small quality tradeoff

Reduce model weight precision from FP32 → FP16 → INT8 → INT4. This delivers 2–4× faster inference and 4–8× lower VRAM usage, enabling deployment on cheaper hardware or significantly higher concurrency on existing hardware. The quality tradeoff at INT8 is nearly invisible for most generation tasks. INT4 starts to show degradation on reasoning-heavy tasks. For most production conversational AI, INT8 quantization is a near-free latency win. Most serving frameworks support it out of the box today.

20–70%

WALL-CLOCK

Async Pipeline Execution

System level · CPU / GPU overlap · operational overhead

Overlap operations that would otherwise run sequentially: tokenization, prefill, decode, and post-processing can execute concurrently across separate threads. Stream tokens to the client as they're generated rather than waiting for full completion. A 20–70% wall-clock reduction in production pipelines with well-designed async architecture. Streaming output is the user-facing version of this: by sending tokens as they're decoded, TTFT feels fast even if end-to-end latency is unchanged. The first token appearing is the experience. Everything after is reading.

14×

THROUGHPUT

Continuous Batching

Decode throughput · latency tradeoff · 100+ concurrent req/min

Process multiple requests simultaneously by filling decode slots as they free up rather than waiting for an entire batch to complete. Increasing batch size from 1 to 64 can boost throughput 14× — but also raises per-request latency 4×. This is the fundamental throughput/latency tradeoff. For high-concurrency applications where aggregate throughput matters more than per-request speed (batch processing, background generation jobs), batching is essential. For interactive applications where every user feels the latency directly, tune batch size carefully against your TTFT SLO.

Average latency tells you how fast your best users are suffering.

P99 tells you about the rest. The 1% of requests that took 90 seconds. The queue spike at 2pm when traffic hit. The cold KV cache after a 20-minute lull. That's where your product actually lives or dies. Amazon found that 100ms of latency costs 1% of sales. They were measuring P99 latency. Optimize the mean and you're solving the wrong problem.

§04 Tradeoff Matrix

// Optimization Techniques · Latency / Throughput / Complexity / Quality Tradeoffs

Technique	TTFT Impact	Throughput	Quality Cost	Complexity
Speculative Decoding	2–4× speedup	↑ significant	Zero (identical output)	High (model pairing)
INT8 Quantization	2–4× faster	↑ 4–8× VRAM savings	Minimal	Low (built-in today)
INT4 Quantization	3–5× faster	↑ 8× VRAM savings	Moderate on reasoning	Low
KV Cache Offloading	Up to 69× reduction	Neutral	None	Medium
Streaming Output	Perceived near-zero	Neutral (same tokens)	None	Low
Continuous Batching (64)	4× worse per-user	14× better aggregate	None	Medium
Prompt compression	↓ prefill cost	Moderate	Risk if over-compressed	Low

◈

Cloudflare Workers + streaming: Every agent worker in the PropTechUSA.ai stack uses Server-Sent Events to stream responses. The user sees Carl start answering in under 200ms even when the full response takes 4 seconds to generate. Perceived latency and actual latency are not the same thing. Stream everything interactive. Batch everything that can wait.

Justin Erickson · PropTechUSA.ai

87 CF Workers · SSE streaming · Cloudflare AI inference · GED (juvenile detention) · Self-taught · March 2026

Continue Reading · Series 3

Research

Embedding Models: The Invisible Infrastructure Layer

Research

Context Windows at Scale

Research

The Agent Coordination Problem

Research

AI Evals: How to Actually Measure Models

The Latency Wall // why speed is the next AI frontier — and how to break through

§01 The Two Phases Nobody Explains

§02 The Metrics That Actually Matter

§03 The Optimization Stack

§04 Tradeoff Matrix