INFERENCE PERFORMANCE · TTFT ANALYSIS · MARCH 2026

The Latency Wall // why speed is the next AI frontier — and how to break through

The model can be brilliant. The architecture can be elegant. If the first token takes three seconds, the user is already gone. Latency is not an infrastructure problem. It's a product problem. And most teams are optimizing the wrong number to fix it.

<500ms
TTFT target for chatbots to feel responsive to users
<100ms
TTFT needed for code completion to feel seamless in-editor
1%
Revenue loss per 100ms of added latency (Amazon research)
2–4×
Speed gain from speculative decoding with no output quality loss
// TTFT oscilloscope · prefill / decode phases · P50 / P95 / P99 latency bands · live sim SAMPLING

§01 The Two Phases Nobody Explains

LLM inference has two distinct computational phases that behave completely differently and fail for completely different reasons. Almost every latency conversation conflates them, which is why most "optimization" efforts miss the point.

PHASE 01 · COMPUTE-BOUND
Prefill

The model processes your entire input prompt in parallel, building the KV cache — the compressed representation of everything it needs to hold in "memory" to generate a response. This phase is GPU compute-bound. It determines your TTFT. Long prompts, large context windows, RAG documents pasted in — all of it lands here. A 10K-token RAG context takes proportionally longer to prefill than a 500-token chat message. The longer the prompt, the longer the user stares at a cursor.

drives TTFT · scales with prompt length
PHASE 02 · MEMORY-BOUND
Decode

The model generates tokens one at a time, autoregressively. Each new token requires reading the entire KV cache from GPU memory, which is why this phase is memory bandwidth-bound, not compute-bound. You can't parallelize decode — each token depends on the previous one. This phase determines your tokens-per-second throughput. Adding GPUs helps prefill substantially but helps decode with significant diminishing returns: going from 4× to 8× A100s on Llama2-70B only decreases decode latency by 0.7×.

drives throughput · memory bandwidth limited
The trap: Most teams throw more GPUs at a latency problem without diagnosing which phase is the bottleneck. If your TTFT is bad, you have a prefill problem — prefill optimization, KV cache compression, and prompt engineering will help. If your tokens-per-second is bad under load, you have a decode/memory-bandwidth problem. Treating both as "need more compute" wastes budget and doesn't fix either.

§02 The Metrics That Actually Matter

Average latency is a lie. A system with excellent mean TTFT can still have catastrophic P99 latency — which is what 1% of your users experience. If you have ten thousand daily active users, that's a hundred people who had a broken experience. Every day. Track the right numbers:

TTFT
Time to First Token
The Perceived Speed Metric

How long from request submission to the first token appearing in the response stream. This is the latency the user feels. Everything before it is dead silence. Streaming output exists specifically to optimize perceived TTFT — you can start rendering before the full response is generated. If TTFT is bad, no amount of post-first-token speed saves the experience.

▸ chat <500ms · code completion <100ms
TPOT
Time Per Output Token
The Reading Speed Metric

The average time between consecutive tokens after the first appears. A TPOT of 100ms/token = 10 tokens/second, roughly 450 words per minute — faster than most people read. Below this threshold, generation feels fluid. Above it, users watch the cursor blink between words. Under high concurrency, TPOT degrades first as the decode phase competes for memory bandwidth across simultaneous requests.

▸ <100ms/token for natural reading pace
P99
99th Percentile Latency
The Reality Check Metric

The latency that 99% of your requests fall under. P99 reveals your worst-case experience for real users — and it's almost always much worse than the mean. A P99 TTFT near 100 seconds means 1 in 100 interactions is essentially broken. Queue spikes during traffic bursts, KV cache evictions under memory pressure, cold start after low-traffic periods — all of them show up in P99 first and stay invisible in the mean until they're catastrophic.

▸ monitor P95 and P99 — not just mean
GDPT
Goodput
The Business Metric

Throughput that actually meets your latency SLOs. You can maximize raw tokens/second by batching aggressively — but if your latency SLO is 500ms TTFT and your batch strategy blows past that for half the requests, your goodput is not your throughput. Goodput is the intersection of speed and usability. High goodput means you're scaling efficiently while still delivering acceptable experiences. It's the metric that connects infrastructure to product outcomes.

▸ goodput = throughput within latency SLOs

§03 The Optimization Stack

Every technique below trades something for something else. The mistake is treating any of them as free wins. Know your bottleneck, then apply the lever that addresses it directly:

2–4×
SPEEDUP
Speculative Decoding
Decode phase · No quality loss · Added complexity

Use a small, fast "draft" model to generate multiple candidate tokens ahead. A larger verifier model then checks them in parallel. If they match, you skip redundant decode steps. If they don't, you fall back to normal generation — the output is identical either way. Model providers benchmark 2–4× speedups under the right conditions. The catch: the small draft model needs to closely match the large model's distribution or the accept rate drops. The gain is real but requires careful model pairing.

69×
TTFT REDUCTION
LayerKV / KV Cache Offloading
Prefill phase · Memory management · GPU ↔ CPU

KV cache grows with every token in context and competes with new prefill operations for GPU memory. LayerKV proactively offloads non-critical cache layers to CPU memory, freeing GPU resources for incoming prefill tasks. Under cache memory contention, this technique delivers up to 69× average TTFT reduction. The tradeoff: CPU-GPU memory transfer latency. The win: you stop evicting existing KV caches to make room for new requests, which was causing the worst-case latency spikes in the first place.

2–4×
FASTER INFERENCE
Quantization (INT8 / INT4)
Both phases · 4–8× VRAM savings · small quality tradeoff

Reduce model weight precision from FP32 → FP16 → INT8 → INT4. This delivers 2–4× faster inference and 4–8× lower VRAM usage, enabling deployment on cheaper hardware or significantly higher concurrency on existing hardware. The quality tradeoff at INT8 is nearly invisible for most generation tasks. INT4 starts to show degradation on reasoning-heavy tasks. For most production conversational AI, INT8 quantization is a near-free latency win. Most serving frameworks support it out of the box today.

20–70%
WALL-CLOCK
Async Pipeline Execution
System level · CPU / GPU overlap · operational overhead

Overlap operations that would otherwise run sequentially: tokenization, prefill, decode, and post-processing can execute concurrently across separate threads. Stream tokens to the client as they're generated rather than waiting for full completion. A 20–70% wall-clock reduction in production pipelines with well-designed async architecture. Streaming output is the user-facing version of this: by sending tokens as they're decoded, TTFT feels fast even if end-to-end latency is unchanged. The first token appearing is the experience. Everything after is reading.

14×
THROUGHPUT
Continuous Batching
Decode throughput · latency tradeoff · 100+ concurrent req/min

Process multiple requests simultaneously by filling decode slots as they free up rather than waiting for an entire batch to complete. Increasing batch size from 1 to 64 can boost throughput 14× — but also raises per-request latency 4×. This is the fundamental throughput/latency tradeoff. For high-concurrency applications where aggregate throughput matters more than per-request speed (batch processing, background generation jobs), batching is essential. For interactive applications where every user feels the latency directly, tune batch size carefully against your TTFT SLO.

Average latency tells you how fast your best users are suffering.

P99 tells you about the rest. The 1% of requests that took 90 seconds. The queue spike at 2pm when traffic hit. The cold KV cache after a 20-minute lull. That's where your product actually lives or dies. Amazon found that 100ms of latency costs 1% of sales. They were measuring P99 latency. Optimize the mean and you're solving the wrong problem.

§04 Tradeoff Matrix

// Optimization Techniques · Latency / Throughput / Complexity / Quality Tradeoffs
TechniqueTTFT ImpactThroughputQuality CostComplexity
Speculative Decoding 2–4× speedup ↑ significant Zero (identical output) High (model pairing)
INT8 Quantization 2–4× faster ↑ 4–8× VRAM savings Minimal Low (built-in today)
INT4 Quantization 3–5× faster ↑ 8× VRAM savings Moderate on reasoning Low
KV Cache Offloading Up to 69× reduction Neutral None Medium
Streaming Output Perceived near-zero Neutral (same tokens) None Low
Continuous Batching (64) 4× worse per-user 14× better aggregate None Medium
Prompt compression ↓ prefill cost Moderate Risk if over-compressed Low
Cloudflare Workers + streaming: Every agent worker in the PropTechUSA.ai stack uses Server-Sent Events to stream responses. The user sees Carl start answering in under 200ms even when the full response takes 4 seconds to generate. Perceived latency and actual latency are not the same thing. Stream everything interactive. Batch everything that can wait.
JE
Justin Erickson · PropTechUSA.ai
87 CF Workers · SSE streaming · Cloudflare AI inference · GED (juvenile detention) · Self-taught · March 2026
Continue Reading · Series 3