How big does the KV cache get?
Per token, per layer, you store a key and a value vector — both of size
Also known as: KV caching, key-value cache, attention cache
The KV cache stores the key and value tensors from previous tokens during autoregressive generation, so each new token only computes attention over its own query against cached keys and values — not a full re-computation.
The KV cache is the inference-time optimization that makes autoregressive generation tractable. Without it, every new token would require re-running the entire transformer over the entire prefix — making the cost of generating
Recall how a decoder-only model processes a sequence. At every layer, attention computes:
Crucially, because of causal masking, position
So during generation, when you produce token
You never recompute the prior keys and values. They were computed once during the prompt’s prefill phase, and they live in memory until generation ends.
Memory. Lots of memory.
For each layer, each head, each token, you store
For Llama-3-70B (80 layers, 64 heads with GQA
Per sequence. Batched inference of 32 requests at 8K context = ~85 GB just in KV cache, on top of the model weights. This is why inference servers care about VRAM more than compute, and why long-context inference at high batch size is so expensive.
Naive KV cache allocation pre-reserves a contiguous slab of memory for each sequence at its maximum possible length — but most sequences finish far short of that maximum. The unused tail is stranded VRAM the server can’t reclaim, and fragmentation cuts achievable batch size to a fraction of what the GPU could otherwise hold. PagedAttention (vLLM, 2023) breaks the KV cache into fixed-size “pages” of 16 or 32 tokens, allocated on demand from a shared pool. Each sequence holds a page table mapping logical positions to physical pages, exactly like virtual memory in an OS. The result is roughly 2 to 4x higher achievable batch sizes at the same VRAM, because you no longer pay for tokens that never arrive.
The KV cache is the load-bearing primitive of modern LLM serving.
Cache size scales linearly with the number of KV heads. Two architectural changes drop this cost:
GQA is now the default for modern open-weight LLMs.
A second-level optimization: if two requests share the same prefix (a long system prompt, retrieved RAG context, a few-shot example block), the KV cache from the prefix can be reused. vLLM, SGLang, and other modern inference engines implement this by hashing prefix tokens and storing KV blocks keyed by the hash. A new request that matches reuses the cached KV instead of recomputing.
For RAG applications with stable system prompts, prefix caching makes a large fraction of every request’s prefill free. Combined with the per-token decode cache, modern LLM inference is essentially “compute once for novel input; reuse everything else.”
Without the KV cache, serving a 70B model at conversation latency would be impossible — every token would re-run the full transformer over the whole prefix. Almost every other inference improvement ( FlashAttention , paged attention, speculative decoding, prefix caching, GQA ) either accelerates KV-cache computation or shrinks its memory footprint.
Per token, per layer, you store a key and a value vector — both of size
Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce KV cache size by sharing keys and values across multiple attention heads. MQA: all query heads share one set of K/V. GQA: groups of query heads share K/V. Both cut KV memory ~10× with minimal quality loss; GQA is the modern default.
Yes — prefix caching is the optimization. If two requests share an initial prefix (a system prompt, a long shared context), the KV cache from the prefix can be computed once and reused for both. vLLM, SGLang, and most modern inference engines do this automatically. For RAG with stable system prompts, prefix caching is a major latency win.