Prefix Caching

Sneiderman, Robby

LLM Construction

Prefix Caching

Share computed KV cache entries across requests that share the same prefix. Radix attention trees enable efficient lookup. Significant latency savings for prefix-heavy production workloads.

AdvancedTier 2FrontierFrontier watch~35 min

Prerequisites

KV Cache KV Cache Optimization

Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 5 | tier 2. This page has 2 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

In production LLM serving, many requests share the same prefix: a system prompt, few-shot examples, or a shared document context. Without prefix caching, each request recomputes the KV cache for this shared prefix from scratch. For a 2000-token system prompt processed 1000 times per minute, that is 2 million redundant token computations per minute. Prefix caching eliminates this waste.

Mental Model

Standard KV cache: each request computes and stores its own key-value pairs for all tokens. Prefix caching: if two requests share the first $k$ tokens, they share the same KV cache entries for those $k$ tokens. Only the suffix (the tokens that differ) requires new computation.

This is analogous to a trie (prefix tree) for strings: common prefixes are stored once, and branches occur only where sequences diverge.

Formal Setup

Definition

Prefix Cache

A prefix cache stores KV cache entries indexed by token sequences. For a request with token sequence $[t_1, t_2, \ldots, t_n]$ , the cache checks if a prefix $[t_1, \ldots, t_k]$ has been computed before. If so, it reuses those $k$ cached KV entries and only computes entries for tokens $t_{k+1}, \ldots, t_n$ .

Definition

Radix Attention Tree

A radix attention tree organizes cached KV entries in a radix tree (compressed trie) structure. Each edge corresponds to a sequence of tokens. Lookup finds the longest matching prefix in $O(n)$ time where $n$ is the query length. Insertion adds only the new suffix. The tree supports efficient eviction by tracking access recency at each node.

Main Theorems

Proposition

Prefix Cache Latency Reduction

Statement

For a batch of $B$ requests each of length $n$ , where all share a common prefix of length $k$ , the prefill computation with prefix caching is:

$\text{Cost}_{\text{cached}} = k + B(n - k)$

compared to $Bn$ without caching. The speedup factor for the prefill phase is:

$\text{Speedup} = \frac{Bn}{k + B(n - k)} = \frac{1}{k/(Bn) + 1 - k/n}$

When $k/n$ is large (long shared prefix) and $B$ is large, the speedup approaches $n/(n-k)$ .

Intuition

The prefix is computed once instead of $B$ times. If the prefix is 90% of each request ( $k = 0.9n$ ) and $B$ is large, you compute roughly 10% of what you would without caching: a 10x speedup on prefill.

Proof Sketch

Without caching, each of $B$ requests processes all $n$ tokens: cost $Bn$ . With caching, the prefix of length $k$ is processed once (cost $k$ ), and each request processes its unique suffix of length $n - k$ (cost $B(n - k)$ ). Total: $k + B(n - k)$ .

Why It Matters

In real deployments, the system prompt can be 1000+ tokens while user messages are 50-200 tokens. With $k/n \approx 0.85$ and hundreds of concurrent requests, prefix caching reduces prefill costs by roughly 5-7x. This directly reduces time-to-first-token (TTFT), the latency metric users care about most.

Failure Mode

Prefix caching provides no benefit when requests do not share prefixes (e.g., each user has a unique context). It also requires exact token-level matching: even a single different token in the prefix breaks the cache hit. If the system prompt changes frequently, the cache hit rate drops. Memory overhead for storing cached KV entries must be managed through eviction policies.

report a correction →

Implementation Details

Automatic Prefix Detection

The serving system does not require manual specification of which tokens are the "prefix." Instead, it hashes token sequences and performs longest-prefix matching automatically. When a new request arrives, the system walks the radix tree to find the longest cached prefix.

Eviction Policies

GPU memory is finite. When the cache is full, eviction must discard some entries. Common strategies:

LRU (Least Recently Used): evict the prefix that was accessed longest ago.
Reference counting: evict entries not currently used by any active request.
Size-aware: prefer evicting shorter prefixes (they are cheaper to recompute).

Implementations

vLLM: supports automatic prefix caching with hash-based lookup. Enabled with a configuration flag.

SGLang: designed around prefix caching from the start. Uses a radix attention tree as the core data structure. Supports tree-structured KV cache management for branching conversations (e.g., multiple candidate responses sharing a prompt).

Common Confusions

Watch Out

Prefix caching only helps the prefill phase

The prefill phase (processing the prompt tokens) is where prefix caching saves computation. The decode phase (generating new tokens one at a time) is unchanged because each new token is unique to the request. For long prompts with short completions, prefix caching has large impact. For short prompts with long completions, the benefit is smaller.

Watch Out

Prefix caching is not the same as prompt caching at the API level

API-level prompt caching (offered by some providers) may use prefix caching internally, but it can also involve caching at higher levels (e.g., caching the full response for identical prompts). Prefix caching specifically refers to reusing KV cache entries at the attention layer, which works even when the suffixes differ.

Exercises

ExerciseCore

Problem

A serving system handles 100 requests per second. Each request has a 1500-token system prompt and a 200-token user message. Assuming all requests share the same system prompt, compute the prefill token savings per second with prefix caching versus without.

ExerciseAdvanced

Problem

Explain why prefix caching requires exact token-level matching and cannot work with semantically similar but lexically different prefixes. What would be needed to cache "soft" prefix matches?

References

Canonical:

Zheng, Yin, Xie, Sun, Yu, Zhuo, Yang, Stoica, "SGLang: Efficient Execution of Structured Language Model Programs," NeurIPS (2024), arXiv:2312.07104
Pope, Douglas, Chowdhery, Devlin, Bradbury, Heek, Xiao, Agrawal, Dean, "Efficiently Scaling Transformer Inference," MLSys (2023)
Yu, Jeong, Kim, Kim, "Orca: A Distributed Serving System for Transformer-Based Generative Models," OSDI (2022)

Current:

Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica, "Efficient Memory Management for Large Language Model Serving with PagedAttention," SOSP (2023)
Liu, Tang, Lin, Han, "Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time," NeurIPS (2023)
Hooper, Kim, Mohammadzadeh, Mahoney, Shao, Keutzer, Gholami, "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization," NeurIPS (2024)
vLLM documentation on automatic prefix caching

Next Topics

Speculative decoding: another inference optimization that reduces latency by drafting tokens with a smaller model

Last reviewed: May 31, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

KV Cachelayer 5 · tier 2
KV Cache Optimizationlayer 5 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.