Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Prefix Caching

Share computed KV cache entries across requests that share the same prefix. Radix attention trees enable efficient lookup. Significant latency savings for prefix-heavy production workloads.

AdvancedTier 2Frontier~35 min
0

Why This Matters

In production LLM serving, many requests share the same prefix: a system prompt, few-shot examples, or a shared document context. Without prefix caching, each request recomputes the KV cache for this shared prefix from scratch. For a 2000-token system prompt processed 1000 times per minute, that is 2 million redundant token computations per minute. Prefix caching eliminates this waste.

Mental Model

Standard KV cache: each request computes and stores its own key-value pairs for all tokens. Prefix caching: if two requests share the first kk tokens, they share the same KV cache entries for those kk tokens. Only the suffix (the tokens that differ) requires new computation.

This is analogous to a trie (prefix tree) for strings: common prefixes are stored once, and branches occur only where sequences diverge.

Formal Setup

Definition

Prefix Cache

A prefix cache stores KV cache entries indexed by token sequences. For a request with token sequence [t1,t2,,tn][t_1, t_2, \ldots, t_n], the cache checks if a prefix [t1,,tk][t_1, \ldots, t_k] has been computed before. If so, it reuses those kk cached KV entries and only computes entries for tokens tk+1,,tnt_{k+1}, \ldots, t_n.

Definition

Radix Attention Tree

A radix attention tree organizes cached KV entries in a radix tree (compressed trie) structure. Each edge corresponds to a sequence of tokens. Lookup finds the longest matching prefix in O(n)O(n) time where nn is the query length. Insertion adds only the new suffix. The tree supports efficient eviction by tracking access recency at each node.

Main Theorems

Proposition

Prefix Cache Latency Reduction

Statement

For a batch of BB requests each of length nn, where all share a common prefix of length kk, the prefill computation with prefix caching is:

Costcached=k+B(nk)\text{Cost}_{\text{cached}} = k + B(n - k)

compared to BnBn without caching. The speedup factor for the prefill phase is:

Speedup=Bnk+B(nk)=1k/(Bn)+1k/n\text{Speedup} = \frac{Bn}{k + B(n - k)} = \frac{1}{k/(Bn) + 1 - k/n}

When k/nk/n is large (long shared prefix) and BB is large, the speedup approaches n/(nk)n/(n-k).

Intuition

The prefix is computed once instead of BB times. If the prefix is 90% of each request (k=0.9nk = 0.9n) and BB is large, you compute roughly 10% of what you would without caching: a 10x speedup on prefill.

Proof Sketch

Without caching, each of BB requests processes all nn tokens: cost BnBn. With caching, the prefix of length kk is processed once (cost kk), and each request processes its unique suffix of length nkn - k (cost B(nk)B(n - k)). Total: k+B(nk)k + B(n - k).

Why It Matters

In real deployments, the system prompt can be 1000+ tokens while user messages are 50-200 tokens. With k/n0.85k/n \approx 0.85 and hundreds of concurrent requests, prefix caching reduces prefill costs by roughly 5-7x. This directly reduces time-to-first-token (TTFT), the latency metric users care about most.

Failure Mode

Prefix caching provides no benefit when requests do not share prefixes (e.g., each user has a unique context). It also requires exact token-level matching: even a single different token in the prefix breaks the cache hit. If the system prompt changes frequently, the cache hit rate drops. Memory overhead for storing cached KV entries must be managed through eviction policies.

Implementation Details

Automatic Prefix Detection

The serving system does not require manual specification of which tokens are the "prefix." Instead, it hashes token sequences and performs longest-prefix matching automatically. When a new request arrives, the system walks the radix tree to find the longest cached prefix.

Eviction Policies

GPU memory is finite. When the cache is full, eviction must discard some entries. Common strategies:

  • LRU (Least Recently Used): evict the prefix that was accessed longest ago.
  • Reference counting: evict entries not currently used by any active request.
  • Size-aware: prefer evicting shorter prefixes (they are cheaper to recompute).

Implementations

vLLM: supports automatic prefix caching with hash-based lookup. Enabled with a configuration flag.

SGLang: designed around prefix caching from the start. Uses a radix attention tree as the core data structure. Supports tree-structured KV cache management for branching conversations (e.g., multiple candidate responses sharing a prompt).

Common Confusions

Watch Out

Prefix caching only helps the prefill phase

The prefill phase (processing the prompt tokens) is where prefix caching saves computation. The decode phase (generating new tokens one at a time) is unchanged because each new token is unique to the request. For long prompts with short completions, prefix caching has large impact. For short prompts with long completions, the benefit is smaller.

Watch Out

Prefix caching is not the same as prompt caching at the API level

API-level prompt caching (offered by some providers) may use prefix caching internally, but it can also involve caching at higher levels (e.g., caching the full response for identical prompts). Prefix caching specifically refers to reusing KV cache entries at the attention layer, which works even when the suffixes differ.

Exercises

ExerciseCore

Problem

A serving system handles 100 requests per second. Each request has a 1500-token system prompt and a 200-token user message. Assuming all requests share the same system prompt, compute the prefill token savings per second with prefix caching versus without.

ExerciseAdvanced

Problem

Explain why prefix caching requires exact token-level matching and cannot work with semantically similar but lexically different prefixes. What would be needed to cache "soft" prefix matches?

References

Canonical:

  • Zheng et al., "SGLang: Efficient Execution of Structured Language Model Programs," arXiv:2312.07104 (2023)

Current:

  • Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention," SOSP (2023)
  • vLLM documentation on automatic prefix caching

Next Topics

  • Speculative decoding: another inference optimization that reduces latency by drafting tokens with a smaller model

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.