Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

KV Cache Optimization

Advanced techniques for managing the KV cache memory bottleneck: paged attention for fragmentation-free allocation, prefix caching for shared prompts, token eviction for long sequences, and quantized KV cache for reduced footprint.

AdvancedTier 2Frontier~45 min

Prerequisites

0

Why This Matters

The KV cache is the dominant memory consumer during LLM inference. For a 70B model serving 128K-token contexts, the KV cache requires approximately 40 GB per sequence. Naive memory management wastes 60-80% of allocated GPU memory through fragmentation and over-allocation. The techniques in this topic collectively reduce memory waste, increase batch sizes, and enable longer contexts. They are the difference between serving 4 concurrent requests and serving 20 on the same hardware.

Mental Model

Think of GPU memory for KV cache like a hard drive for a file system. Naive allocation is like giving every file its maximum possible size upfront. Paged attention is like a file system with block allocation: allocate small blocks on demand, use a page table to track them. Prefix caching is like hard links: multiple files sharing the same data blocks. Token eviction is like an LRU cache eviction policy. Quantization is like file compression.

Paged Attention (vLLM)

Standard KV cache allocation pre-allocates a contiguous memory region for each sequence at the maximum context length. Problems:

  1. Internal fragmentation: a sequence using 500 of 4096 allocated tokens wastes 88% of its allocation
  2. External fragmentation: after sequences of different lengths complete, the freed memory has gaps that cannot be used for new large allocations
  3. No sharing: two sequences with the same system prompt store identical KV cache entries separately
Definition

Paged Attention

Paged attention (Kwon et al., 2023) divides the KV cache into fixed-size blocks (pages), each storing key-value pairs for a fixed number of tokens (typically 16 or 32). A block table maps logical token positions to physical memory blocks for each sequence. Blocks are allocated on demand from a shared pool and need not be contiguous.

Proposition

Paged Attention Achieves Near-Optimal Memory Utilization

Statement

With paged attention and block size bb, the memory waste per sequence is at most b1b - 1 token slots (the internal fragmentation of the last block). The memory utilization ratio is:

utilization=tt/bbtt+b1\text{utilization} = \frac{t}{\lceil t/b \rceil \cdot b} \geq \frac{t}{t + b - 1}

For tbt \gg b, utilization approaches 1. In contrast, contiguous pre-allocation at maximum length TT has utilization t/Tt / T, which can be arbitrarily small.

Kwon et al. (2023) report that vLLM achieves memory utilization above 96% compared to 20-40% for naive pre-allocation in typical serving workloads.

Intuition

The only wasted memory is the unused portion of the last block. With block size 16 and a 1000-token sequence, you allocate 63 blocks (1008 slots), wasting only 8 slots (0.8%). Naive pre-allocation at T=4096T = 4096 would waste 3096 slots (75.6%).

Why It Matters

The 2-4x throughput improvement reported by vLLM comes almost entirely from this memory efficiency gain. Higher utilization means more sequences can fit in GPU memory simultaneously, which means larger batch sizes and better GPU utilization during the memory-bandwidth-bound decode phase.

Failure Mode

Paged attention adds a layer of indirection (the block table lookup) at each attention computation. This adds a small latency overhead per token, typically under 5% on modern hardware. The overhead increases if the block table becomes very large (extremely long sequences with very small block sizes).

Prefix Caching

Many serving scenarios involve repeated prefixes. Every request to a chatbot shares the same system prompt. Every request in a batch of document QA shares the same document context. Prefix caching exploits this by storing KV cache blocks for common prefixes once and sharing them across requests.

Definition

Prefix Caching

Prefix caching identifies requests with shared token prefixes and maps their block tables to the same physical memory blocks for those shared positions. When a request diverges from the prefix, new blocks are allocated for the diverging tokens. This uses copy-on-write semantics: the shared blocks are read-only, and a request only gets its own copy of a block when it needs to modify it (which does not happen for KV cache, since past key-value pairs are immutable).

Proposition

Prefix Sharing Reduces Memory Proportional to Shared Length

Statement

Without prefix caching, the total KV cache memory for NN requests is proportional to NTN \cdot T. With prefix caching, the shared prefix occupies p/b\lceil p/b \rceil blocks stored once, and each request adds (Tp)/b\lceil (T-p)/b \rceil unique blocks. The total memory is:

memoryp/b+N(Tp)/b\text{memory} \propto \lceil p/b \rceil + N \cdot \lceil (T-p)/b \rceil

The memory savings ratio is approximately:

savings=1p/b+N(Tp)/bNT/b=(N1)pNT=pTN1N\text{savings} = 1 - \frac{p/b + N(T-p)/b}{NT/b} = \frac{(N-1)p}{NT} = \frac{p}{T} \cdot \frac{N-1}{N}

For NN large and p/Tp/T substantial, savings approach p/Tp/T. A shared system prompt of 2000 tokens in a 4000-token context with 100 requests saves approximately 50% of KV cache memory.

Intuition

If all requests share 50% of their tokens as a common prefix, you store that prefix once instead of NN times. The memory for the shared portion drops from O(N)O(N) to O(1)O(1).

Why It Matters

In production chatbot deployments, system prompts are often 1000-3000 tokens. With thousands of concurrent requests, prefix caching saves terabytes of aggregate KV cache memory across a serving cluster.

Failure Mode

Prefix caching only helps when requests genuinely share a prefix. For workloads with diverse, unique prompts (e.g., code completion on different codebases), the cache hit rate is low and the overhead of tracking potential prefixes may not be worthwhile.

Token Eviction for Long Sequences

For sequences that exceed available memory, or when aggressive memory savings are needed, some tokens can be dropped from the KV cache.

Attention-based eviction. Track the cumulative attention weight each cached token receives. Tokens that consistently receive near-zero attention are unlikely to be needed in the future and can be evicted. H2O (Heavy Hitter Oracle, Zhang et al., 2023) keeps the tokens with highest cumulative attention scores plus a sliding window of recent tokens.

Sliding window attention. Keep only the most recent ww tokens in the KV cache plus a set of "sink" tokens (the first few tokens, which empirically always receive high attention). StreamingLLM (Xiao et al., 2023) uses this approach to enable infinite-length generation with fixed memory.

The trade-off is clear: evicted tokens are permanently lost. If the model later needs information from an evicted token, generation quality degrades.

Quantized KV Cache

The KV cache is typically stored in the same precision as model activations (fp16 or bf16). But the key and value vectors tolerate quantization better than model weights because they are consumed once (during the attention computation) rather than reused across the entire forward pass.

INT8 KV cache. Quantize cached keys and values to 8-bit integers. This halves the memory footprint. Dequantization happens on-the-fly during the attention computation. The accuracy degradation is typically under 0.5% on standard benchmarks.

INT4 KV cache. Quantize to 4-bit integers. This quarters the memory footprint but introduces more noticeable quality degradation, especially for tasks requiring precise recall of information from early in the context. Group quantization (quantizing blocks of values with per-group scale factors) mitigates this.

For a 70B model with GQA (g=8g = 8, dk=128d_k = 128) serving 128K-token contexts:

  • fp16 KV cache: approximately 40 GB per sequence
  • INT8 KV cache: approximately 20 GB per sequence
  • INT4 KV cache: approximately 10 GB per sequence

Common Confusions

Watch Out

Paged attention is about memory management, not attention computation

Paged attention does not change the mathematical attention operation. The queries, keys, values, and softmax computation are identical. What changes is where the keys and values are stored in GPU memory and how that memory is allocated and tracked. The attention kernel must be modified to handle non-contiguous memory (gathering blocks via the page table), but the mathematical result is the same.

Watch Out

Token eviction is lossy, prefix caching is lossless

Prefix caching is an exact optimization: the shared KV cache entries are identical to what would be computed independently. Token eviction discards information and can degrade generation quality. They solve different problems: prefix caching reduces redundancy across requests; token eviction reduces memory per request.

Watch Out

KV cache quantization is different from weight quantization

Weight quantization compresses model parameters, reducing model size and compute. KV cache quantization compresses the runtime cache, reducing inference memory. They can be applied independently. A model can use INT4 weights with fp16 KV cache, or fp16 weights with INT8 KV cache, or both.

Summary

  • Paged attention: allocate KV cache in blocks on demand, like virtual memory pages
  • vLLM achieves 96%+ memory utilization vs. 20-40% for naive pre-allocation
  • Prefix caching: share KV blocks for common prefixes across requests, saving p/T\sim p/T memory
  • Token eviction (H2O, StreamingLLM): drop low-attention tokens, enabling unbounded context with fixed memory
  • Quantized KV cache: INT8 halves memory (minimal quality loss), INT4 quarters it (some quality loss)
  • These optimizations compose: paged attention + prefix caching + INT8 quantization together

Exercises

ExerciseCore

Problem

A serving system uses paged attention with block size b=16b = 16. A request uses 500 tokens. How many blocks are allocated? What is the memory utilization? Compare to naive pre-allocation at maximum length T=8192T = 8192.

ExerciseAdvanced

Problem

A serving cluster handles 1000 concurrent chatbot requests. Each request has a 2048-token system prompt and an average conversation length of 1024 user/assistant tokens (total 3072 tokens per request). The model uses GQA with g=8g = 8, dk=128d_k = 128, L=80L = 80 layers, in fp16. Compare total KV cache memory with and without prefix caching.

References

Canonical:

  • Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (2023)

Current:

  • Zhang et al., "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models" (2023)
  • Xiao et al., "Efficient Streaming Language Models with Attention Sinks" (2023). StreamingLLM
  • Hooper et al., "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization" (2024)

Next Topics

The natural next steps from KV cache optimization:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This