§01 The Two Phases Nobody Explains
LLM inference has two distinct computational phases that behave completely differently and fail for completely different reasons. Almost every latency conversation conflates them, which is why most "optimization" efforts miss the point.
The model processes your entire input prompt in parallel, building the KV cache — the compressed representation of everything it needs to hold in "memory" to generate a response. This phase is GPU compute-bound. It determines your TTFT. Long prompts, large context windows, RAG documents pasted in — all of it lands here. A 10K-token RAG context takes proportionally longer to prefill than a 500-token chat message. The longer the prompt, the longer the user stares at a cursor.
drives TTFT · scales with prompt lengthThe model generates tokens one at a time, autoregressively. Each new token requires reading the entire KV cache from GPU memory, which is why this phase is memory bandwidth-bound, not compute-bound. You can't parallelize decode — each token depends on the previous one. This phase determines your tokens-per-second throughput. Adding GPUs helps prefill substantially but helps decode with significant diminishing returns: going from 4× to 8× A100s on Llama2-70B only decreases decode latency by 0.7×.
drives throughput · memory bandwidth limited§02 The Metrics That Actually Matter
Average latency is a lie. A system with excellent mean TTFT can still have catastrophic P99 latency — which is what 1% of your users experience. If you have ten thousand daily active users, that's a hundred people who had a broken experience. Every day. Track the right numbers:
How long from request submission to the first token appearing in the response stream. This is the latency the user feels. Everything before it is dead silence. Streaming output exists specifically to optimize perceived TTFT — you can start rendering before the full response is generated. If TTFT is bad, no amount of post-first-token speed saves the experience.
The average time between consecutive tokens after the first appears. A TPOT of 100ms/token = 10 tokens/second, roughly 450 words per minute — faster than most people read. Below this threshold, generation feels fluid. Above it, users watch the cursor blink between words. Under high concurrency, TPOT degrades first as the decode phase competes for memory bandwidth across simultaneous requests.
The latency that 99% of your requests fall under. P99 reveals your worst-case experience for real users — and it's almost always much worse than the mean. A P99 TTFT near 100 seconds means 1 in 100 interactions is essentially broken. Queue spikes during traffic bursts, KV cache evictions under memory pressure, cold start after low-traffic periods — all of them show up in P99 first and stay invisible in the mean until they're catastrophic.
Throughput that actually meets your latency SLOs. You can maximize raw tokens/second by batching aggressively — but if your latency SLO is 500ms TTFT and your batch strategy blows past that for half the requests, your goodput is not your throughput. Goodput is the intersection of speed and usability. High goodput means you're scaling efficiently while still delivering acceptable experiences. It's the metric that connects infrastructure to product outcomes.
§03 The Optimization Stack
Every technique below trades something for something else. The mistake is treating any of them as free wins. Know your bottleneck, then apply the lever that addresses it directly:
Use a small, fast "draft" model to generate multiple candidate tokens ahead. A larger verifier model then checks them in parallel. If they match, you skip redundant decode steps. If they don't, you fall back to normal generation — the output is identical either way. Model providers benchmark 2–4× speedups under the right conditions. The catch: the small draft model needs to closely match the large model's distribution or the accept rate drops. The gain is real but requires careful model pairing.
KV cache grows with every token in context and competes with new prefill operations for GPU memory. LayerKV proactively offloads non-critical cache layers to CPU memory, freeing GPU resources for incoming prefill tasks. Under cache memory contention, this technique delivers up to 69× average TTFT reduction. The tradeoff: CPU-GPU memory transfer latency. The win: you stop evicting existing KV caches to make room for new requests, which was causing the worst-case latency spikes in the first place.
Reduce model weight precision from FP32 → FP16 → INT8 → INT4. This delivers 2–4× faster inference and 4–8× lower VRAM usage, enabling deployment on cheaper hardware or significantly higher concurrency on existing hardware. The quality tradeoff at INT8 is nearly invisible for most generation tasks. INT4 starts to show degradation on reasoning-heavy tasks. For most production conversational AI, INT8 quantization is a near-free latency win. Most serving frameworks support it out of the box today.
Overlap operations that would otherwise run sequentially: tokenization, prefill, decode, and post-processing can execute concurrently across separate threads. Stream tokens to the client as they're generated rather than waiting for full completion. A 20–70% wall-clock reduction in production pipelines with well-designed async architecture. Streaming output is the user-facing version of this: by sending tokens as they're decoded, TTFT feels fast even if end-to-end latency is unchanged. The first token appearing is the experience. Everything after is reading.
Process multiple requests simultaneously by filling decode slots as they free up rather than waiting for an entire batch to complete. Increasing batch size from 1 to 64 can boost throughput 14× — but also raises per-request latency 4×. This is the fundamental throughput/latency tradeoff. For high-concurrency applications where aggregate throughput matters more than per-request speed (batch processing, background generation jobs), batching is essential. For interactive applications where every user feels the latency directly, tune batch size carefully against your TTFT SLO.
P99 tells you about the rest. The 1% of requests that took 90 seconds. The queue spike at 2pm when traffic hit. The cold KV cache after a 20-minute lull. That's where your product actually lives or dies. Amazon found that 100ms of latency costs 1% of sales. They were measuring P99 latency. Optimize the mean and you're solving the wrong problem.
§04 Tradeoff Matrix
| Technique | TTFT Impact | Throughput | Quality Cost | Complexity |
|---|---|---|---|---|
| Speculative Decoding | 2–4× speedup | ↑ significant | Zero (identical output) | High (model pairing) |
| INT8 Quantization | 2–4× faster | ↑ 4–8× VRAM savings | Minimal | Low (built-in today) |
| INT4 Quantization | 3–5× faster | ↑ 8× VRAM savings | Moderate on reasoning | Low |
| KV Cache Offloading | Up to 69× reduction | Neutral | None | Medium |
| Streaming Output | Perceived near-zero | Neutral (same tokens) | None | Low |
| Continuous Batching (64) | 4× worse per-user | 14× better aggregate | None | Medium |
| Prompt compression | ↓ prefill cost | Moderate | Risk if over-compressed | Low |