KV Cache Optimization

Sneiderman, Robby

LLM Construction

KV Cache Optimization

Advanced techniques for managing the KV cache memory bottleneck: paged attention for fragmentation-free allocation, prefix caching for shared prompts, token eviction for long sequences, and quantized KV cache for reduced footprint.

AdvancedTier 2FrontierFrontier watch~45 min

Prerequisites

KV Cache

Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 5 | tier 2. This page has 1 direct prerequisite and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Prefix Caching

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The KV cache is the dominant memory consumer during LLM inference. For a 70B model serving 128K-token contexts, the KV cache requires approximately 40 GB per sequence. Naive memory management wastes 60-80% of allocated GPU memory through fragmentation and over-allocation. The techniques in this topic collectively reduce memory waste, increase batch sizes, and enable longer contexts. They are the difference between serving 4 concurrent requests and serving 20 on the same hardware.

Current Checkpoint

The KV cache is now the limiting resource in many real inference systems. Long-context models make the problem sharper: even when the weights fit, the runtime cache can decide how many users can be served, how long a request can run, and whether repeated prompts become affordable.

The main design split is:

exact reuse: paged attention and prefix caching preserve the same model computation while reducing allocation waste or repeated prefill;
lossy compression: quantized caches and token eviction trade some information for capacity;
architectural compression: GQA and MLA reduce the cache shape by model design rather than by runtime policy;
tiering: CPU or NVMe offload increases maximum context length but is usually too slow for hot decode paths.

Build It This Way by Default

For a learner product, cache the expensive stable prefix: system rubric, tool schema, curriculum context, and retrieved theorem material. Do not cache or reuse learner-specific private state across users.

Mental Model

Think of GPU memory for KV cache like a hard drive for a file system. Naive allocation is like giving every file its maximum possible size upfront. Paged attention is like a file system with block allocation: allocate small blocks on demand, use a page table to track them. Prefix caching is like hard links: multiple files sharing the same data blocks. Token eviction is like an LRU cache eviction policy. Quantization is like file compression.

Paged Attention (vLLM)

Standard KV cache allocation pre-allocates a contiguous memory region for each sequence at the maximum context length. Problems:

Internal fragmentation: a sequence using 500 of 4096 allocated tokens wastes 88% of its allocation
External fragmentation: after sequences of different lengths complete, the freed memory has gaps that cannot be used for new large allocations
No sharing: two sequences with the same system prompt store identical KV cache entries separately

Definition

Paged Attention

Paged attention (Kwon et al., 2023) divides the KV cache into fixed-size blocks (pages), each storing key-value pairs for a fixed number of tokens (typically 16 or 32). A block table maps logical token positions to physical memory blocks for each sequence. Blocks are allocated on demand from a shared pool and need not be contiguous.

Proposition

Paged Attention Achieves Near-Optimal Memory Utilization

Statement

With paged attention and block size $b$ , the memory waste per sequence is at most $b - 1$ token slots (the internal fragmentation of the last block). The memory utilization ratio is:

$\text{utilization} = \frac{t}{\lceil t/b \rceil \cdot b} \geq \frac{t}{t + b - 1}$

For $t \gg b$ , utilization approaches 1. In contrast, contiguous pre-allocation at maximum length $T$ has utilization $t / T$ , which can be arbitrarily small.

Kwon et al. (2023) report that vLLM achieves memory utilization above 96% compared to 20-40% for naive pre-allocation in typical serving workloads.

Intuition

The only wasted memory is the unused portion of the last block. With block size 16 and a 1000-token sequence, you allocate 63 blocks (1008 slots), wasting only 8 slots (0.8%). Naive pre-allocation at $T = 4096$ would waste 3096 slots (75.6%).

Why It Matters

The 2-4x throughput improvement reported by vLLM comes almost entirely from this memory efficiency gain. Higher utilization means more sequences can fit in GPU memory simultaneously, which means larger batch sizes and better GPU utilization during the memory-bandwidth-bound decode phase.

Failure Mode

Paged attention adds a layer of indirection (the block table lookup) at each attention computation. This adds a small latency overhead per token, typically under 5% on modern hardware. The overhead increases if the block table becomes very large (extremely long sequences with very small block sizes).

report a correction →

Prefix Caching

Many serving scenarios involve repeated prefixes. Every request to a chatbot shares the same system prompt. Every request in a batch of document QA shares the same document context. Prefix caching exploits this by storing KV cache blocks for common prefixes once and sharing them across requests.

Definition

Prefix Caching

Prefix caching identifies requests with shared token prefixes and maps their block tables to the same physical memory blocks for those shared positions. When a request diverges from the prefix, new blocks are allocated for the diverging tokens. This uses copy-on-write semantics: the shared blocks are read-only, and a request only gets its own copy of a block when it needs to modify it (which does not happen for KV cache, since past key-value pairs are immutable).

Proposition

Prefix Sharing Reduces Memory Proportional to Shared Length

Statement

Without prefix caching, the total KV cache memory for $N$ requests is proportional to $N \cdot T$ . With prefix caching, the shared prefix occupies $\lceil p/b \rceil$ blocks stored once, and each request adds $\lceil (T-p)/b \rceil$ unique blocks. The total memory is:

$\text{memory} \propto \lceil p/b \rceil + N \cdot \lceil (T-p)/b \rceil$

The memory savings ratio is approximately:

$\text{savings} = 1 - \frac{p/b + N(T-p)/b}{NT/b} = \frac{(N-1)p}{NT} = \frac{p}{T} \cdot \frac{N-1}{N}$

For $N$ large and $p/T$ substantial, savings approach $p/T$ . A shared system prompt of 2000 tokens in a 4000-token context with 100 requests saves approximately 50% of KV cache memory.

Intuition

If all requests share 50% of their tokens as a common prefix, you store that prefix once instead of $N$ times. The memory for the shared portion drops from $O(N)$ to $O(1)$ .

Why It Matters

In production chatbot deployments, system prompts are often 1000-3000 tokens. With thousands of concurrent requests, prefix caching saves terabytes of aggregate KV cache memory across a serving cluster.

Failure Mode

Prefix caching only helps when requests genuinely share a prefix. For workloads with diverse, unique prompts (e.g., code completion on different codebases), the cache hit rate is low and the overhead of tracking potential prefixes may not be worthwhile.

report a correction →

Token Eviction for Long Sequences

For sequences that exceed available memory, or when aggressive memory savings are needed, some tokens can be dropped from the KV cache.

Attention-based eviction. Track the cumulative attention weight each cached token receives. Tokens that consistently receive near-zero attention are unlikely to be needed in the future and can be evicted. H2O (Heavy Hitter Oracle, Zhang et al., 2023) keeps the tokens with highest cumulative attention scores plus a sliding window of recent tokens.

Sliding window attention. Keep only the most recent $w$ tokens in the KV cache plus a set of "sink" tokens (the first few tokens, which empirically always receive high attention). StreamingLLM (Xiao et al., 2023) uses this approach to enable infinite-length generation with fixed memory.

The trade-off is clear: evicted tokens are permanently lost. If the model later needs information from an evicted token, generation quality degrades.

Quantized KV Cache

The KV cache is typically stored in the same precision as model activations (fp16 or bf16). But the key and value vectors tolerate quantization better than model weights because they are consumed once (during the attention computation) rather than reused across the entire forward pass.

INT8 KV cache. Quantize cached keys and values to 8-bit integers. This halves the memory footprint. Dequantization happens on-the-fly during the attention computation. The accuracy degradation is typically under 0.5% on standard benchmarks.

INT4 KV cache. Quantize to 4-bit integers. This quarters the memory footprint but introduces more noticeable quality degradation, especially for tasks requiring precise recall of information from early in the context. Group quantization (quantizing blocks of values with per-group scale factors) mitigates this.

For a 70B model with GQA ( $g = 8$ , $d_k = 128$ ) serving 128K-token contexts:

fp16 KV cache: approximately 40 GB per sequence
INT8 KV cache: approximately 20 GB per sequence
INT4 KV cache: approximately 10 GB per sequence

FP8 KV cache. FP8 formats (E4M3 or E5M2, Micikevicius et al., 2022) give the same 2x memory reduction as INT8 but preserve the exponent range, which matters for outlier keys and values. TensorRT-LLM and vLLM both support FP8 KV cache. The halved HBM footprint also halves HBM traffic during the memory-bandwidth-bound decode phase, so FP8 improves throughput in addition to capacity.

Multi-head Latent Attention (MLA)

Definition

Multi-head Latent Attention

MLA (DeepSeek-V2, 2024; DeepSeek-V3, 2024) compresses the KV cache by projecting keys and values into a shared low-rank latent $c_t^{KV} \in \mathbb{R}^{d_c}$ with $d_c \ll n_h \cdot d_h$ , caches the latent instead of per-head K,V, and reconstructs per-head keys and values via learned up-projections at attention time.

MLA is mathematically distinct from GQA. GQA shares K,V across groups of query heads (head-group sharing). MLA compresses the entire head dimension into a shared latent (rank compression). For DeepSeek-V3, $d_c = 512$ while $n_h \cdot d_h = 128 \cdot 128 = 16384$ , a 32x compression of the cached dimension. DeepSeek-V2 reports MLA achieving lower KV cache memory than GQA at matched quality.

The trade-off is extra compute at attention time: per-token up-projections from latent to per-head keys and values. For memory-bandwidth-bound decode, this is a favorable trade (extra FLOPs are cheap relative to HBM traffic).

Cross-request Caching: SGLang RadixAttention

Prefix caching as described above handles exact prefix matches within a batch or across sequential requests in vLLM. RadixAttention (Zheng et al., 2023, SGLang) generalizes this with a tree-structured prefix cache and LRU eviction: multiple requests sharing any prefix of the prompt tree share KV blocks, not just contiguous start prefixes. This is most valuable for agentic workloads (ReAct-style loops with shared system prompt and tool descriptions, branching on tool call outputs) and few-shot inference (thousands of requests sharing the same few-shot header).

Anthropic's Prompt Caching API (2024) exposes this capability to end users: requests mark cache breakpoints, and the server persists KV blocks across requests up to a TTL.

Context-aware Eviction: SnapKV and MInference

H2O and StreamingLLM evict based on cumulative attention scores or position. More recent work exploits structured sparsity in long-context attention.

SnapKV (Li et al., 2024) uses an observation window of the most recent tokens to identify which earlier positions received concentrated attention from the observation window, then keeps only those KV entries for subsequent decoding. This is lossy like H2O but selects positions using a targeted probe rather than running averages.

MInference (Jiang et al., 2024) identifies three structured sparsity patterns in long-context attention maps (A-shape, Vertical-Slash, Block-Sparse) and runs a sparse attention kernel that skips masked positions. Unlike eviction, MInference keeps all KV entries in memory but skips computation on statistically unimportant positions.

CacheBlend (Yao et al., 2024) targets RAG: when a request concatenates retrieved chunks, CacheBlend selectively recomputes KV entries for cross-chunk attention rather than reusing per-chunk caches wholesale, preserving attention fidelity at retrieved-chunk boundaries.

CPU and NVMe Offloading

For contexts or cache footprints that exceed GPU HBM, the KV cache can be tiered: hot blocks on HBM, warm blocks on CPU DRAM, cold blocks on NVMe. vLLM's SwapBlockManager implements HBM-to-CPU swapping block-wise on preemption; DeepSpeed-Inference (ZeRO-Inference) extends this to NVMe. The penalty is PCIe bandwidth (32 GB/s on PCIe 4.0 x16) for CPU-to-GPU and single-GB/s for NVMe-to-GPU, both much slower than HBM. Offloading is therefore valuable for capacity (serving contexts that would not fit at all) but not for throughput on critical-path decode tokens.

Common Confusions

Watch Out

Paged attention is about memory management, not attention computation

Paged attention does not change the mathematical attention operation. The queries, keys, values, and softmax computation are identical. What changes is where the keys and values are stored in GPU memory and how that memory is allocated and tracked. The attention kernel must be modified to handle non-contiguous memory (gathering blocks via the page table), but the mathematical result is the same.

Watch Out

Token eviction is lossy, prefix caching is lossless

Prefix caching is an exact optimization: the shared KV cache entries are identical to what would be computed independently. Token eviction discards information and can degrade generation quality. They solve different problems: prefix caching reduces redundancy across requests; token eviction reduces memory per request.

Watch Out

KV cache quantization is different from weight quantization

Weight quantization compresses model parameters, reducing model size and compute. KV cache quantization compresses the runtime cache, reducing inference memory. They can be applied independently. A model can use INT4 weights with fp16 KV cache, or fp16 weights with INT8 KV cache, or both.

Summary

Paged attention: allocate KV cache in blocks on demand, like virtual memory pages
vLLM achieves 96%+ memory utilization vs. 20-40% for naive pre-allocation
Prefix caching: share KV blocks for common prefixes across requests, saving $\sim p/T$ memory
Token eviction (H2O, StreamingLLM): drop low-attention tokens, enabling unbounded context with fixed memory
Quantized KV cache: INT8 halves memory (minimal quality loss), INT4 quarters it (some quality loss)
These optimizations compose: paged attention + prefix caching + INT8 quantization together

Exercises

ExerciseCore

Problem

A serving system uses paged attention with block size $b = 16$ . A request uses 500 tokens. How many blocks are allocated? What is the memory utilization? Compare to naive pre-allocation at maximum length $T = 8192$ .

ExerciseAdvanced

Problem

A serving cluster handles 1000 concurrent chatbot requests. Each request has a 2048-token system prompt and an average conversation length of 1024 user/assistant tokens (total 3072 tokens per request). The model uses GQA with $g = 8$ , $d_k = 128$ , $L = 80$ layers, in fp16. Compare total KV cache memory with and without prefix caching.

References

Canonical:

Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (2023)

Current:

Zhang et al., "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models" (2023)
Xiao et al., "Efficient Streaming Language Models with Attention Sinks" (2023). StreamingLLM
Hooper et al., "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization" (2024)
Liu et al., "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache" (2024)
DeepSeek-AI, "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (2024), arXiv:2405.04434. MLA
DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024), arXiv:2412.19437
Zheng et al., "SGLang: Efficient Execution of Structured Language Model Programs" (2023), arXiv:2312.07104. RadixAttention
Li et al., "SnapKV: LLM Knows What You are Looking for Before Generation" (2024), arXiv:2404.14469
Jiang et al., "MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention" (2024), arXiv:2407.02490
Yao et al., "CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion" (2024), arXiv:2405.16444
Micikevicius et al., "FP8 Formats for Deep Learning" (2022), arXiv:2209.05433
Anthropic, "Prompt Caching" (2024). Anthropic API documentation

Next Topics

The natural next steps from KV cache optimization:

Inference systems overview: the full stack of LLM serving optimization

Last reviewed: May 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

KV Cachelayer 5 · tier 2

Derived topics

1

Prefix Cachinglayer 5 · tier 2

Graph-backed continuations

Prefix Caching