LLM Construction
Prefix Caching
Share computed KV cache entries across requests that share the same prefix. Radix attention trees enable efficient lookup. Significant latency savings for prefix-heavy production workloads.
Prerequisites
Why This Matters
In production LLM serving, many requests share the same prefix: a system prompt, few-shot examples, or a shared document context. Without prefix caching, each request recomputes the KV cache for this shared prefix from scratch. For a 2000-token system prompt processed 1000 times per minute, that is 2 million redundant token computations per minute. Prefix caching eliminates this waste.
Mental Model
Standard KV cache: each request computes and stores its own key-value pairs for all tokens. Prefix caching: if two requests share the first tokens, they share the same KV cache entries for those tokens. Only the suffix (the tokens that differ) requires new computation.
This is analogous to a trie (prefix tree) for strings: common prefixes are stored once, and branches occur only where sequences diverge.
Formal Setup
Prefix Cache
A prefix cache stores KV cache entries indexed by token sequences. For a request with token sequence , the cache checks if a prefix has been computed before. If so, it reuses those cached KV entries and only computes entries for tokens .
Radix Attention Tree
A radix attention tree organizes cached KV entries in a radix tree (compressed trie) structure. Each edge corresponds to a sequence of tokens. Lookup finds the longest matching prefix in time where is the query length. Insertion adds only the new suffix. The tree supports efficient eviction by tracking access recency at each node.
Main Theorems
Prefix Cache Latency Reduction
Statement
For a batch of requests each of length , where all share a common prefix of length , the prefill computation with prefix caching is:
compared to without caching. The speedup factor for the prefill phase is:
When is large (long shared prefix) and is large, the speedup approaches .
Intuition
The prefix is computed once instead of times. If the prefix is 90% of each request () and is large, you compute roughly 10% of what you would without caching: a 10x speedup on prefill.
Proof Sketch
Without caching, each of requests processes all tokens: cost . With caching, the prefix of length is processed once (cost ), and each request processes its unique suffix of length (cost ). Total: .
Why It Matters
In real deployments, the system prompt can be 1000+ tokens while user messages are 50-200 tokens. With and hundreds of concurrent requests, prefix caching reduces prefill costs by roughly 5-7x. This directly reduces time-to-first-token (TTFT), the latency metric users care about most.
Failure Mode
Prefix caching provides no benefit when requests do not share prefixes (e.g., each user has a unique context). It also requires exact token-level matching: even a single different token in the prefix breaks the cache hit. If the system prompt changes frequently, the cache hit rate drops. Memory overhead for storing cached KV entries must be managed through eviction policies.
Implementation Details
Automatic Prefix Detection
The serving system does not require manual specification of which tokens are the "prefix." Instead, it hashes token sequences and performs longest-prefix matching automatically. When a new request arrives, the system walks the radix tree to find the longest cached prefix.
Eviction Policies
GPU memory is finite. When the cache is full, eviction must discard some entries. Common strategies:
- LRU (Least Recently Used): evict the prefix that was accessed longest ago.
- Reference counting: evict entries not currently used by any active request.
- Size-aware: prefer evicting shorter prefixes (they are cheaper to recompute).
Implementations
vLLM: supports automatic prefix caching with hash-based lookup. Enabled with a configuration flag.
SGLang: designed around prefix caching from the start. Uses a radix attention tree as the core data structure. Supports tree-structured KV cache management for branching conversations (e.g., multiple candidate responses sharing a prompt).
Common Confusions
Prefix caching only helps the prefill phase
The prefill phase (processing the prompt tokens) is where prefix caching saves computation. The decode phase (generating new tokens one at a time) is unchanged because each new token is unique to the request. For long prompts with short completions, prefix caching has large impact. For short prompts with long completions, the benefit is smaller.
Prefix caching is not the same as prompt caching at the API level
API-level prompt caching (offered by some providers) may use prefix caching internally, but it can also involve caching at higher levels (e.g., caching the full response for identical prompts). Prefix caching specifically refers to reusing KV cache entries at the attention layer, which works even when the suffixes differ.
Exercises
Problem
A serving system handles 100 requests per second. Each request has a 1500-token system prompt and a 200-token user message. Assuming all requests share the same system prompt, compute the prefill token savings per second with prefix caching versus without.
Problem
Explain why prefix caching requires exact token-level matching and cannot work with semantically similar but lexically different prefixes. What would be needed to cache "soft" prefix matches?
References
Canonical:
- Zheng et al., "SGLang: Efficient Execution of Structured Language Model Programs," arXiv:2312.07104 (2023)
Current:
- Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention," SOSP (2023)
- vLLM documentation on automatic prefix caching
Next Topics
- Speculative decoding: another inference optimization that reduces latency by drafting tokens with a smaller model
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- KV CacheLayer 5
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- KV Cache OptimizationLayer 5