KV Cache

Also known as: KV caching, key-value cache, attention cache

TL;DR

The KV cache stores the key and value tensors from previous tokens during autoregressive generation, so each new token only computes attention over its own query against cached keys and values — not a full re-computation.

The KV cache is the inference-time optimization that makes autoregressive generation tractable. Without it, every new token would require re-running the entire transformer over the entire prefix — making the cost of generating tokens in compute. With it, generation drops to total work and per-token latency becomes nearly constant in context length (memory bandwidth permitting).

Why caching works

Recall how a decoder-only model processes a sequence. At every layer, attention computes:

Crucially, because of causal masking, position ‘s output depends only on , , . The keys and values at positions are computed from the input embeddings at those positions, which never change once written.

So during generation, when you produce token :

Compute , , for the new token only.
Append to the cached .
Compute and use it to attend over the (cached) values.
Pass through the rest of the layer.

You never recompute the prior keys and values. They were computed once during the prompt’s prefill phase, and they live in memory until generation ends.

What it costs

Memory. Lots of memory.

For each layer, each head, each token, you store and — two vectors of dimension . For a model with layers, heads, dimension per head, and a sequence of tokens, total KV cache size is:

For Llama-3-70B (80 layers, 64 heads with GQA 8 KV heads, , fp16 = 2 bytes/element) at :

Per sequence. Batched inference of 32 requests at 8K context = ~85 GB just in KV cache, on top of the model weights. This is why inference servers care about VRAM more than compute, and why long-context inference at high batch size is so expensive.

Naive KV cache allocation pre-reserves a contiguous slab of memory for each sequence at its maximum possible length — but most sequences finish far short of that maximum. The unused tail is stranded VRAM the server can’t reclaim, and fragmentation cuts achievable batch size to a fraction of what the GPU could otherwise hold. PagedAttention (vLLM, 2023) breaks the KV cache into fixed-size “pages” of 16 or 32 tokens, allocated on demand from a shared pool. Each sequence holds a page table mapping logical positions to physical pages, exactly like virtual memory in an OS. The result is roughly 2 to 4x higher achievable batch sizes at the same VRAM, because you no longer pay for tokens that never arrive.

The KV cache is the load-bearing primitive of modern LLM serving.

MQA, GQA, and friends

Cache size scales linearly with the number of KV heads. Two architectural changes drop this cost:

Multi-Query Attention (MQA). All query heads share a single . KV cache shrinks by factor of (number of heads). Some quality loss.
Grouped-Query Attention (GQA) . Multiple groups of query heads, each group shares one . Llama-3 uses 8 KV heads for 64 query heads — 8× smaller cache than full MHA, with negligible quality loss.

GQA is now the default for modern open-weight LLMs.

Prefix caching

A second-level optimization: if two requests share the same prefix (a long system prompt, retrieved RAG context, a few-shot example block), the KV cache from the prefix can be reused. vLLM, SGLang, and other modern inference engines implement this by hashing prefix tokens and storing KV blocks keyed by the hash. A new request that matches reuses the cached KV instead of recomputing.

For RAG applications with stable system prompts, prefix caching makes a large fraction of every request’s prefill free. Combined with the per-token decode cache, modern LLM inference is essentially “compute once for novel input; reuse everything else.”

Why this is the central inference optimization

Without the KV cache, serving a 70B model at conversation latency would be impossible — every token would re-run the full transformer over the whole prefix. Almost every other inference improvement ( FlashAttention , paged attention, speculative decoding, prefix caching, GQA ) either accelerates KV-cache computation or shrinks its memory footprint.

Go further

How big does the KV cache get?

Per token, per layer, you store a key and a value vector — both of size in some models. For Llama-3-70B at 8K context, the KV cache is several GB per single sequence. This is why inference servers spend much of their VRAM on KV cache, not weights, and why batch sizes at long context are tiny.

Context window Autoregressive generation

What are MQA and GQA?

Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce KV cache size by sharing keys and values across multiple attention heads. MQA: all query heads share one set of K/V. GQA: groups of query heads share K/V. Both cut KV memory ~10× with minimal quality loss; GQA is the modern default.

Attention Decoder-only model

Can you reuse KV cache across requests?

Yes — prefix caching is the optimization. If two requests share an initial prefix (a system prompt, a long shared context), the KV cache from the prefix can be computed once and reused for both. vLLM, SGLang, and most modern inference engines do this automatically. For RAG with stable system prompts, prefix caching is a major latency win.

Context window RAG

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs