Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

KV Cache

Why autoregressive generation recomputes attention at every step, how caching past key-value pairs makes it linear, and the memory bottleneck that drives MQA, GQA, and paged attention.

AdvancedTier 2Current~45 min
0

Why This Matters

8 GB (consumer GPU limit)7B model (~0.5 MB/token)70B model (~2 MB/token)70B OOMs at ~4K tokens on 8GB GPU0 GB4 GB8 GB12 GB16 GBKV cache memory01K2K3K4KSequence length (tokens)KV cache = 2 x layers x heads x d_head x seq_len x dtype_bytesLinear in sequence length. This is why long-context inference is memory-bound.

Training a transformer is compute-bound. Inference is memory-bound. The single biggest memory consumer during autoregressive generation is the KV cache: the stored key and value tensors from all previous tokens. For a 70B parameter model serving 100K-token contexts, the KV cache alone can require over 100 GB of GPU memory. Understanding the KV cache is essential for understanding why long-context inference is expensive, why models have context length limits in practice, and why architectural innovations like multi-query attention exist.

Mental Model

When a transformer generates text autoregressively (one token at a time), each new token must attend to all previous tokens. Without caching, you would recompute the key and value projections for all previous tokens at every single generation step. wasting enormous compute on redundant matrix multiplications.

The KV cache stores the key and value vectors computed for past tokens, so that at each step you only compute the query, key, and value for the new token. The new query attends to all cached keys, the weighted sum uses all cached values, and the new key-value pair is appended to the cache.

The Redundant Recomputation Problem

During autoregressive generation, at step tt the model has produced tokens x1,x2,,xtx_1, x_2, \ldots, x_t and needs to produce xt+1x_{t+1}. The attention computation for token tt requires:

Qt=xtWQ,K1:t=X1:tWK,V1:t=X1:tWVQ_t = x_t W_Q, \quad K_{1:t} = X_{1:t} W_K, \quad V_{1:t} = X_{1:t} W_V

outputt=softmax(qtK1:tdk)V1:t\text{output}_t = \text{softmax}\left(\frac{q_t K_{1:t}^\top}{\sqrt{d_k}}\right) V_{1:t}

Without caching: At step tt, you compute K1:tK_{1:t} and V1:tV_{1:t} from scratch. At step t+1t+1, you compute K1:t+1K_{1:t+1} and V1:t+1V_{1:t+1} from scratch. recomputing rows 11 through tt that were already computed in the previous step.

Over a generation of TT tokens, the total key/value computation is O(T2d)O(T^2 d). This is wasteful because most of it is redundant.

The KV Cache Solution

Definition

KV Cache

The KV cache stores the computed key and value vectors for all past tokens. At generation step tt:

  1. Compute the new token's projections: qt=xtWQq_t = x_t W_Q, kt=xtWKk_t = x_t W_K, vt=xtWVv_t = x_t W_V
  2. Append ktk_t and vtv_t to the cache: Kcache[Kcache;kt]K_{\text{cache}} \leftarrow [K_{\text{cache}}; k_t], Vcache[Vcache;vt]V_{\text{cache}} \leftarrow [V_{\text{cache}}; v_t]
  3. Attend using the full cache: outputt=softmax(qtKcache/dk)Vcache\text{output}_t = \text{softmax}(q_t K_{\text{cache}}^\top / \sqrt{d_k}) \, V_{\text{cache}}

The key-value projections for tokens 1,,t11, \ldots, t-1 are never recomputed.

Proposition

KV Cache Reduces Per-Step Compute from O(td) to O(d)

Statement

Without KV cache, generating token tt requires O(td)O(t \cdot d) work for key/value projections (computing K1:tK_{1:t} and V1:tV_{1:t}) plus O(tdk)O(t \cdot d_k) for attention. Generating all TT tokens costs O(T2d)O(T^2 d) total.

With KV cache, generating token tt requires O(d)O(d) for the new key/value projections (only one token) plus O(tdk)O(t \cdot d_k) for attending to the cache. Generating all TT tokens costs O(T2dk+Td)O(T^2 d_k + T d) for attention operations. The key-value projection cost drops from O(T2d)O(T^2 d) to O(Td)O(Td).

The memory cost of the cache is:

KV cache memory=2×L×nheads×t×dk×bytes per element\text{KV cache memory} = 2 \times L \times n_{\text{heads}} \times t \times d_k \times \text{bytes per element}

where LL is the number of layers, the factor of 2 accounts for both keys and values, and tt is the current sequence length.

Intuition

The KV cache trades memory for compute. Instead of recomputing past keys and values (O(td)O(t \cdot d) compute per step), you store them (O(td)O(t \cdot d) memory per layer) and only compute the single new token's projections (O(d)O(d) compute). This is the classic time-space tradeoff, and for autoregressive LLMs it is overwhelmingly worthwhile.

Why It Matters

Without KV caching, generating a 1000-token response from a transformer would require roughly 500x more key/value projection compute than with caching. In practice, no production LLM system operates without a KV cache. The cache is what makes real-time text generation feasible.

Failure Mode

The KV cache grows linearly with sequence length. If Llama 2 70B used MHA with L=80L = 80 layers, nheads=64n_{\text{heads}} = 64 KV heads, dk=128d_k = 128 in fp16, the cache for a single sequence of length tt would be 2×80×64×t×128×22 \times 80 \times 64 \times t \times 128 \times 2 bytes =2.62 MB×t= 2.62 \text{ MB} \times t. At t=100,000t = 100{,}000: approximately 262 GB. This exceeds the memory of any single GPU. In practice Llama 2 70B uses GQA with 8 KV heads, reducing the actual cache by 8x to roughly 33 GB at the same context length, which is the precise reason GQA was adopted. The KV cache remains the primary bottleneck for long-context inference.

Prefill vs Decode Phases

Autoregressive inference has two distinct phases:

Prefill (prompt processing): The entire input prompt of length npn_p is processed in a single forward pass (parallel, like training). All npn_p key-value pairs are computed and stored in the cache. This phase is compute-bound because it processes many tokens in parallel.

Decode (generation): New tokens are generated one at a time. Each step processes a single token, attending to the full cache. This phase is memory-bandwidth-bound because the compute per token is small, but each step must read the entire KV cache from GPU memory.

The decode phase is why LLM inference throughput is often limited by memory bandwidth rather than compute. For a model with 80 layers, generating one token requires reading the entire KV cache (potentially tens of GB) from HBM, but performing only O(d)O(d) FLOPs per layer.

Multi-Query Attention and Grouped-Query Attention

Definition

Multi-Query Attention (MQA)

In standard multi-head attention (MHA), each of hh heads has its own WK(i)W_K^{(i)} and WV(i)W_V^{(i)} matrices. The KV cache stores hh separate key-value pairs per layer per token.

Multi-query attention (MQA) (Shazeer, 2019) shares a single set of keys and values across all heads:

K=XWK,V=XWV(shared across all heads)K = XW_K, \quad V = XW_V \qquad \text{(shared across all heads)} Q(i)=XWQ(i)(separate per head)Q^{(i)} = XW_Q^{(i)} \qquad \text{(separate per head)}

Each head still has its own query projection but uses the same keys and values.

Proposition

MQA Reduces KV Cache by Factor h

Statement

The KV cache size for different attention variants at sequence length tt:

VariantKV pairs per layerCache size per layer
MHA (hh heads)hh2×h×t×dk2 \times h \times t \times d_k
GQA (gg groups)gg2×g×t×dk2 \times g \times t \times d_k
MQA (1 shared)112×1×t×dk2 \times 1 \times t \times d_k

MQA reduces the KV cache by a factor of hh compared to MHA. GQA with gg groups reduces it by a factor of h/gh/g.

Intuition

The key insight is that in MHA, different heads often learn similar key-value patterns. Sharing keys and values across heads eliminates this redundancy at a small quality cost. GQA is the middle ground: instead of all heads sharing one KV set or each head having its own, groups of heads share KV sets.

Why It Matters

For Llama 2 70B with h=64h = 64 heads, MQA would reduce the KV cache by 64x. In practice, Llama 2 70B uses GQA with g=8g = 8 groups, reducing the cache by 8x. This makes the difference between requiring 262 GB (hypothetical MHA) and 33 GB (actual GQA-8) for a 100K-token context. The difference between infeasible and barely feasible on a single node.

The quality degradation from GQA is minimal. Ainslie et al. (2023) showed that GQA with small group counts (e.g., g=8g = 8) matches MHA quality to within roughly 0.2 points on standard benchmarks while achieving near-MQA inference speed. Llama 2 70B subsequently adopted GQA-8 in production.

Multi-Head Latent Attention (MLA)

The MQA/GQA progression reduces the number of stored KV heads. Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2 (2024), instead compresses the KV representation through a low-rank joint projection. Per token per layer, MLA stores a small latent vector cKVRdcc_{KV} \in \mathbb{R}^{d_c} with dchdkd_c \ll h \cdot d_k, from which keys and values for all heads are reconstructed at attention time via learned up-projections. The reported result is a roughly 7x KV-cache reduction relative to GQA-8 at comparable or better quality on DeepSeek-V2 evaluations. The technique is the current state-of-the-art cache-compression method as of 2024 and represents a distinct axis from MQA/GQA: head sharing reduces the number of cached projections, while MLA reduces their dimensionality.

Paged Attention (vLLM)

Definition

Paged Attention

Paged attention (Kwon et al., 2023) applies the operating system concept of virtual memory paging to KV cache management.

In standard serving, each sequence pre-allocates a contiguous memory block for its maximum possible KV cache size. This wastes memory because:

  • Most sequences do not reach maximum length
  • Memory cannot be shared between sequences
  • Internal fragmentation from alignment requirements

Paged attention divides the KV cache into fixed-size pages (blocks of tokens). Pages are allocated on demand and can be stored non-contiguously in GPU memory, with a page table mapping logical token positions to physical memory locations.

Paged attention enables:

  • Near-zero memory waste: pages are allocated only as needed
  • Memory sharing: sequences with shared prefixes (e.g., the same system prompt) can share KV cache pages via copy-on-write
  • Higher throughput: more sequences can be batched because memory utilization is close to optimal

In benchmarks, vLLM achieves 2-4x higher throughput than naive serving implementations, primarily by reducing KV cache memory waste.

Why KV Cache Is the Memory Bottleneck

For a concrete example, consider serving Llama 3 70B (80 layers, 64 heads, GQA with 8 KV heads, dk=128d_k = 128, fp16):

  • Model weights: 70×109×270 \times 10^9 \times 2 bytes =140= 140 GB
  • KV cache per token per layer: 2×8×128×22 \times 8 \times 128 \times 2 bytes =4,096= 4{,}096 bytes
  • KV cache per token (all layers): 4,096×80=327,6804{,}096 \times 80 = 327{,}680 bytes 320\approx 320 KB
  • KV cache for 128K context: 320 KB×128,00040320 \text{ KB} \times 128{,}000 \approx 40 GB

With batching (serving BB sequences simultaneously), the total KV cache is B×40B \times 40 GB. At batch size 8: 320 GB for KV cache alone. more than double the model weight memory.

This is why KV cache size, not parameter count, determines the maximum batch size and context length for deployed LLMs.

Common Confusions

Watch Out

KV cache saves compute, not the attention operation itself

The KV cache avoids recomputing key and value projections for past tokens. The attention operation itself. computing qtKq_t K^\top and the weighted sum over values. still requires reading the full cache and scales as O(t)O(t) per step. KV caching does not change the complexity of the attention mechanism; it eliminates redundant projection computation.

Watch Out

MQA/GQA affect the KV cache but not the query projections

In MQA and GQA, each attention head still has its own query projection WQ(i)W_Q^{(i)}. Only the key and value projections are shared. This means the quality of each head's "question" is preserved. only the "dictionary" being searched is shared.

Watch Out

FlashAttention and KV cache solve different problems

FlashAttention reduces the memory required for the attention computation (the n×nn \times n matrix) by computing it in tiles. The KV cache reduces redundant key-value projection compute during generation. They are complementary and both are used in production systems.

Summary

  • Without KV cache: generating TT tokens costs O(T2d)O(T^2 d) in key/value projections
  • With KV cache: store past K, V vectors; only compute new token projections per step
  • KV cache memory: 2×L×hkv×t×dk×bytes2 \times L \times h_{\text{kv}} \times t \times d_k \times \text{bytes}
  • Prefill is compute-bound (parallel); decode is memory-bandwidth-bound (sequential)
  • MQA shares one KV set across all heads (cache ÷h\div h)
  • GQA shares KV across groups of heads (cache ÷h/g\div h/g). The practical sweet spot
  • Paged attention manages cache like virtual memory pages, reducing waste
  • For long contexts, KV cache dominates GPU memory over model weights

Exercises

ExerciseCore

Problem

A model has L=32L = 32 layers, h=32h = 32 attention heads (full MHA, no grouping), dk=128d_k = 128, and uses fp16. How much memory does the KV cache require for a single sequence of length t=4096t = 4096?

ExerciseAdvanced

Problem

During the decode phase, generating one token requires reading the entire KV cache from GPU HBM. For the model above (32 layers, GQA with 4 groups, dk=128d_k = 128, fp16) at sequence length t=32,000t = 32{,}000, how many bytes must be read per generated token? If the GPU has 2 TB/s HBM bandwidth, what is the maximum generation speed (tokens per second) for a single sequence, ignoring all computation time?

Related Comparisons

References

Canonical:

Current:

  • Shazeer, "Fast Transformer Decoding: One Write-Head is All You Need" (2019), arXiv:1911.02150. multi-query attention
  • Ainslie, Lee-Thorp, de Jong, Zemlyanskiy, Lebrón, Sanghai, "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (2023), arXiv:2305.13245
  • Pope, Douglas, Chowdhery, Devlin, Bradbury, Heek, Xiao, Agrawal, Dean, "Efficiently Scaling Transformer Inference," MLSys 2023, arXiv:2211.05102. canonical analysis of inference cost, prefill/decode, and KV cache scaling
  • Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention," SOSP 2023, arXiv:2309.06180. vLLM
  • DeepSeek-AI, "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (2024), arXiv:2405.04434. introduces Multi-Head Latent Attention (MLA)

Next Topics

The natural next steps from KV cache:

  • Scaling laws: understanding compute-optimal training which determines the model sizes being served
  • Positional encoding: how position information interacts with cached key-value pairs

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This