LLM Construction
Sparse Attention and Long Context
Standard attention is O(n^2). Sparse patterns (Longformer, Sparse Transformer, Reformer), ring attention for distributed sequences, streaming with attention sinks, and why extending context is harder than it sounds.
Prerequisites
Why This Matters
Standard self-attention computes all pairwise interactions between tokens, costing in time and memory. For a context of 1 million tokens, this means operations per attention layer. This is not feasible.
Long-context capability is a central bottleneck for LLMs. Longer context enables processing entire codebases, full books, and long conversations. But "just make the context longer" hits three walls: quadratic compute cost, memory for storing the KV cache, and degraded retrieval quality as context grows.
The Quadratic Bottleneck
Standard Attention Complexity
For a sequence of tokens with head dimension , the attention computation requires:
- Time: for the matrix multiply and the attention-weighted sum
- Memory: to store the attention matrix (or with FlashAttention recomputation, but time remains )
FlashAttention reduces the memory bottleneck from to by tiling and recomputation, but the time complexity remains. To break the quadratic barrier, you must change which attention scores are computed.
Sparse Attention Patterns
Sparse Attention Reduces Complexity
Statement
If each token attends to at most other tokens (where ), the attention computation requires time and memory for the sparse attention matrix. When or , this is subquadratic in .
Intuition
Instead of computing all attention scores, compute only of them. The question is which to choose. Different methods answer this differently, and the quality of the answer determines whether the approximation is good enough.
Proof Sketch
Direct counting. Each of tokens computes attention scores (each costing for the dot product), applies softmax over values, and computes a weighted sum of value vectors (each of dimension ). Total: .
Why It Matters
This is the foundation of all efficient attention methods. The specific value of and the choice of which tokens each position attends to define the design space.
Failure Mode
If the sparsity pattern misses tokens that carry critical information, the model performs worse than full attention regardless of computational savings. Sparse attention trades off expressiveness for efficiency. The right tradeoff depends on the task.
Local Window Attention (Longformer)
Each token attends to a fixed window of surrounding tokens: token attends to tokens . Complexity: , linear in for fixed .
Longformer (Beltagy et al., 2020) adds a small number of "global" tokens that attend to all positions and are attended to by all positions. This combines local context (from windows) with global context (from designated tokens like [CLS]).
Limitation: information must propagate through hops to travel across the full sequence. For a 100k-token sequence with , this requires about 200 layers of propagation.
Strided Attention (Sparse Transformer)
Child et al. (2019) use two interleaved patterns: a local window pattern and a strided pattern where token attends to tokens for stride . Each pattern has connections per token, giving total complexity.
The strided pattern enables direct long-range connections without multi-hop propagation.
Hash-Based Attention (Reformer)
LSH Attention Approximation
Statement
Reformer (Kitaev et al., 2020) uses locality-sensitive hashing (LSH) to assign queries and keys to buckets. Tokens only attend within their bucket. If the hash function satisfies where is monotonically increasing, then the expected attention pattern concentrates on the highest-scoring key-query pairs. With buckets of average size , the complexity is .
Intuition
Similar queries and keys are likely to land in the same hash bucket. Since attention weights are dominated by the largest dot products (after softmax), ignoring low-similarity pairs has minimal effect on the output.
Proof Sketch
By the LSH property, query-key pairs with high cosine similarity collide with high probability. Softmax concentrates mass on large dot products exponentially. The attention output is therefore approximated well by only summing over tokens in the same bucket, up to an error that decreases with the number of hash rounds.
Why It Matters
LSH attention is when bucket size is . It adapts the sparsity pattern to the data rather than using a fixed pattern.
Failure Mode
Hash collisions are stochastic: important keys may land in different buckets from their queries. Multiple hash rounds reduce this risk but increase cost. In practice, Reformer has been largely superseded by FlashAttention and other approaches that keep full attention but optimize memory and compute constants.
Ring Attention: Distributing Long Sequences
Ring attention (Liu et al., 2023) distributes a long sequence across devices. Each device holds tokens. The devices are arranged in a ring. Each device computes attention for its local tokens against one block of KV pairs, then passes the KV block to the next device. After rounds, each device has attended to all tokens.
Total memory per device: for the attention matrix (or with FlashAttention). This scales context length linearly with the number of devices.
Streaming with Attention Sinks (StreamingLLM)
Xiao et al. (2023) found empirically that pretrained LLMs allocate disproportionate attention to the first few tokens regardless of their content, naming this pattern "attention sinks." Why this happens mechanistically is still debated; one common hypothesis is that softmax forces attention mass to land somewhere even when no later token is informative, and the first tokens become the default sink. StreamingLLM keeps a fixed-size window of recent tokens plus the first few tokens (the sinks), enabling arbitrarily long inference streams with fixed memory, at the cost of losing access to middle-of-context information.
Why Extending Context is Hard
Three problems compound:
1. Quadratic cost. Even with FlashAttention, wall-clock time grows fast. Doubling context quadruples attention compute.
2. Lost-in-the-middle. Liu et al. (2023) showed empirically that LLMs retrieve information well from the beginning and end of a long context but poorly from the middle. The phenomenon is robust across models, but the mechanism is still open. Candidate explanations include positional encoding extrapolation, training-data position bias, and attention-sink interactions; the paper itself documents the effect without committing to a single cause.
3. Distributional shift. Models trained on 4k context and extended to 128k via positional interpolation may have degraded quality at the extended lengths. The model has never seen token interactions at distance 100k during training.
Common Confusions
FlashAttention does not reduce attention complexity
FlashAttention is an exact algorithm that reduces memory from to through tiling and recomputation. It does not change the time complexity. It is an IO optimization, not an algorithmic one. Sparse attention methods change the algorithm itself.
Longer context does not mean better retrieval
A model with 1M token context can hold a full book, but it may not reliably answer questions about a specific paragraph in the middle. Context length is a capacity bound, not a guarantee of utilization. Retrieval quality within context is a separate problem.
Canonical Examples
Longformer vs full attention on document classification
For a 4096-token document with window size and 8 global tokens, Longformer computes attention scores. Full attention computes attention scores. Longformer uses about 6.5% of the compute while matching full attention accuracy on document classification tasks (Beltagy et al., 2020).
Summary
- Standard attention costs time; this is the primary bottleneck for long context
- Sparse patterns (local, strided, hash-based) reduce to or
- Ring attention distributes sequences across devices for linear memory scaling
- Attention sinks enable streaming inference at the cost of mid-context access
- Longer context does not guarantee better retrieval due to the lost-in-the-middle effect
- Most production systems combine efficient attention with retrieval (RAG) rather than relying on context length alone
Exercises
Problem
A model uses local window attention with . How many attention layers are needed for information at token 0 to influence the output at token 8192?
Problem
You have 8 devices with ring attention and want to process a 1M-token sequence. Each device has 80GB of memory. The KV cache stores 2 floats (K and V) per token per head in fp16 (2 bytes each). The model has 32 layers and 32 heads with head dimension 128. Does the KV cache fit on each device?
References
Canonical:
- Vaswani et al., "Attention Is All You Need" (2017), Section 3 (baseline complexity)
- Child et al., "Generating Long Sequences with Sparse Transformers" (2019)
Current:
- Beltagy et al., "Longformer: The Long-Document Transformer" (2020)
- Kitaev et al., "Reformer: The Efficient Transformer" (2020)
- Liu et al., "Ring Attention with Blockwise Transformers" (2023)
- Xiao et al., "Efficient Streaming Language Models with Attention Sinks" (2023)
- Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (2023)
Next Topics
- Context engineering: system-level strategies for managing what goes into the context window
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Flash AttentionLayer 5