The KV cache (Key-Value cache) is a memory optimisation used during LLM inference that stores the intermediate attention key and value tensors computed for each token in the input sequence, so they do not need to be recomputed for every new generated token. Without a KV cache, generating a 1,000-token response would require the model to reprocess all previous tokens for each new token — making generation O(n²) in sequence length. With the KV cache, each new token only computes attention against cached keys and values — reducing autoregressive generation to O(n) and making real-time LLM inference practical.
Why the KV cache is necessary
In a transformer decoder, generating each new token requires computing attention over all previous tokens. The attention computation for each token involves multiplying queries (Q) against keys (K) to get attention weights, then weighting values (V) by those weights. For a token at position t, all keys and values from positions 1 to t-1 are needed. Without caching, a model generating 1,000 tokens would compute these from scratch for every step — the computation for step 1000 reprocesses the full 999-token prefix. With the KV cache, keys and values are computed once during the prefill phase and stored; each generation step only adds the single new token's K and V.
Standard scaled dot-product attention. K and V from previous tokens are cached; only Q (current token) needs to be freshly computed at each generation step.
| Scenario | Without KV cache | With KV cache |
|---|---|---|
| Generation step t | Recompute K, V for all t-1 tokens | Load cached K, V; compute only new token |
| Time complexity per step | O(t · d) | O(d) |
| Total generation (n tokens) | O(n² · d) | O(n · d) |
| For 1000-token response | ~500,000 K/V computations | ~1,000 K/V computations |
| Memory requirement | None (recomputes each time) | K and V for every cached token and layer |
KV cache memory: the GPU bottleneck
The KV cache trades compute for memory. For large models serving long contexts, the KV cache can be the dominant GPU memory consumer — larger than the model weights themselves. For LLaMA 3 70B (FP16) at 128K context, the KV cache requires approximately 128 GB of GPU memory — more than the 140 GB needed for model weights. This is why GPU memory is the primary bottleneck for long-context LLM inference in 2026, not compute.
- Memory formula: KV cache size = 2 (K and V) × num_layers × num_heads × head_dim × sequence_length × bytes_per_element. For LLaMA 3 8B at 8K context in FP16: 2 × 32 × 32 × 128 × 8192 × 2 bytes ≈ 4.3 GB.
- Grouped Query Attention (GQA): Used in LLaMA 3, Mistral, and most recent models to reduce KV cache memory — multiple query heads share a single K/V head, reducing cache size by 4–8× with minimal quality impact.
- Prompt caching (cloud): Anthropic, OpenAI, and Google offer prompt caching on their APIs — storing the KV cache for repeated system prompts server-side, reducing both cost and latency for applications with long, repeated context.
- Quantised KV cache: Storing K and V in INT8 or INT4 instead of FP16 reduces memory by 2–4× with modest accuracy tradeoffs — a common production optimisation.
Why this matters for developers
KV cache size determines maximum batch size at a given context length. If your inference server runs out of KV cache memory, it either rejects requests or pages the cache to CPU memory (causing 10–100× latency increases). If you are deploying LLMs and experiencing OOM errors or degraded latency at long contexts, KV cache memory exhaustion is the most likely culprit. Solutions: reduce batch size, use GQA models, enable KV quantisation, or upgrade GPU memory.
Practice questions
- A 70B LLM with 80 transformer layers at BF16 precision caches a 32K context. Estimate the KV cache size. (Answer: KV cache size = 2 (K and V) × num_layers × seq_len × num_KV_heads × head_dim × bytes_per_element. For LLaMA 3 70B: 2 × 80 × 32768 × 8 × 128 × 2 bytes ≈ 42.9 GB. This is why 32K context on a single A100-80GB is extremely tight — the KV cache alone exceeds available VRAM after accounting for model weights (~140GB for BF16).)
- What is Grouped Query Attention (GQA) and how does it reduce KV cache size? (Answer: In standard Multi-Head Attention (MHA), each of the h attention heads has its own K and V projections — KV cache scales linearly with h. GQA groups heads: g query heads share one K and V head. With 8 query heads per KV head (as in LLaMA 3): KV cache is reduced by 8×. LLaMA 3 70B uses 8 KV heads vs 64 query heads — reducing KV cache from ~343GB to ~43GB for 32K context.)
- What is prompt caching and how do Anthropic and OpenAI implement it? (Answer: Prompt caching stores KV cache for reused prompt prefixes across multiple API calls. If you call Claude with the same system prompt and context repeatedly, the KV computation for that prefix is cached server-side and not recomputed. Anthropic charges 10% of normal input token price for cached tokens and 25% write cost on first creation. This can reduce costs by 90%+ for applications with large stable system prompts.)
- During prefill vs decode phases of LLM inference, which is compute-bound and which is memory-bandwidth-bound? (Answer: Prefill (processing the input prompt in parallel): compute-bound — the GPU performs large matrix multiplications at high utilisation (arithmetic intensity > GPU roofline). Decode (generating tokens one by one): memory-bandwidth-bound — each step loads the full KV cache and model weights to compute one token, with low arithmetic intensity. This is why batching many decode requests together improves GPU utilisation.)
- Why does the KV cache grow linearly with context length but quadratically threatens VRAM? (Answer: KV cache itself grows linearly O(L × layers × heads × dim). But the attention computation at each decode step reads the ENTIRE KV cache, so memory bandwidth consumption is O(L) per token generated. For very long contexts, the memory bandwidth bottleneck dominates — attention takes longer than compute. Additionally, long contexts shrink available VRAM for batching multiple concurrent users, reducing throughput.)