Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Sparse Attention and Long Context

Standard attention is O(n^2). Sparse patterns (Longformer, Sparse Transformer, Reformer), ring attention for distributed sequences, streaming with attention sinks, and why extending context is harder than it sounds.

AdvancedTier 2Frontier~55 min
0

Why This Matters

Standard self-attention computes all pairwise interactions between nn tokens, costing O(n2)O(n^2) in time and memory. For a context of 1 million tokens, this means 101210^{12} operations per attention layer. This is not feasible.

Long-context capability is a central bottleneck for LLMs. Longer context enables processing entire codebases, full books, and long conversations. But "just make the context longer" hits three walls: quadratic compute cost, memory for storing the KV cache, and degraded retrieval quality as context grows.

The Quadratic Bottleneck

Definition

Standard Attention Complexity

For a sequence of nn tokens with head dimension dd, the attention computation softmax(QKT/d)V\text{softmax}(QK^T / \sqrt{d})V requires:

  • Time: O(n2d)O(n^2 d) for the QKTQK^T matrix multiply and the attention-weighted sum
  • Memory: O(n2)O(n^2) to store the attention matrix (or O(n)O(n) with FlashAttention recomputation, but time remains O(n2d)O(n^2 d))

FlashAttention reduces the memory bottleneck from O(n2)O(n^2) to O(n)O(n) by tiling and recomputation, but the O(n2)O(n^2) time complexity remains. To break the quadratic barrier, you must change which attention scores are computed.

Sparse Attention Patterns

Proposition

Sparse Attention Reduces Complexity

Statement

If each token attends to at most kk other tokens (where knk \ll n), the attention computation requires O(nkd)O(nkd) time and O(nk)O(nk) memory for the sparse attention matrix. When k=O(n)k = O(\sqrt{n}) or k=O(logn)k = O(\log n), this is subquadratic in nn.

Intuition

Instead of computing all n2n^2 attention scores, compute only nknk of them. The question is which nknk to choose. Different methods answer this differently, and the quality of the answer determines whether the approximation is good enough.

Proof Sketch

Direct counting. Each of nn tokens computes kk attention scores (each costing O(d)O(d) for the dot product), applies softmax over kk values, and computes a weighted sum of kk value vectors (each of dimension dd). Total: O(nkd)O(nkd).

Why It Matters

This is the foundation of all efficient attention methods. The specific value of kk and the choice of which tokens each position attends to define the design space.

Failure Mode

If the sparsity pattern misses tokens that carry critical information, the model performs worse than full attention regardless of computational savings. Sparse attention trades off expressiveness for efficiency. The right tradeoff depends on the task.

Local Window Attention (Longformer)

Each token attends to a fixed window of ww surrounding tokens: token ii attends to tokens {iw/2,,i+w/2}\{i - w/2, \ldots, i + w/2\}. Complexity: O(nwd)O(nw d), linear in nn for fixed ww.

Longformer (Beltagy et al., 2020) adds a small number of "global" tokens that attend to all positions and are attended to by all positions. This combines local context (from windows) with global context (from designated tokens like [CLS]).

Limitation: information must propagate through O(n/w)O(n/w) hops to travel across the full sequence. For a 100k-token sequence with w=512w = 512, this requires about 200 layers of propagation.

Strided Attention (Sparse Transformer)

Child et al. (2019) use two interleaved patterns: a local window pattern and a strided pattern where token ii attends to tokens {i,is,i2s,}\{i, i - s, i - 2s, \ldots\} for stride ss. Each pattern has O(n)O(\sqrt{n}) connections per token, giving O(nnd)O(n\sqrt{n} \cdot d) total complexity.

The strided pattern enables direct long-range connections without multi-hop propagation.

Hash-Based Attention (Reformer)

Proposition

LSH Attention Approximation

Statement

Reformer (Kitaev et al., 2020) uses locality-sensitive hashing (LSH) to assign queries and keys to buckets. Tokens only attend within their bucket. If the hash function satisfies Pr[hash(q)=hash(k)]=f(sim(q,k))\Pr[\text{hash}(q) = \text{hash}(k)] = f(\text{sim}(q,k)) where ff is monotonically increasing, then the expected attention pattern concentrates on the highest-scoring key-query pairs. With bb buckets of average size n/bn/b, the complexity is O(n(n/b)d)O(n \cdot (n/b) \cdot d).

Intuition

Similar queries and keys are likely to land in the same hash bucket. Since attention weights are dominated by the largest dot products (after softmax), ignoring low-similarity pairs has minimal effect on the output.

Proof Sketch

By the LSH property, query-key pairs with high cosine similarity collide with high probability. Softmax concentrates mass on large dot products exponentially. The attention output is therefore approximated well by only summing over tokens in the same bucket, up to an error that decreases with the number of hash rounds.

Why It Matters

LSH attention is O(nlogn)O(n \log n) when bucket size is O(logn)O(\log n). It adapts the sparsity pattern to the data rather than using a fixed pattern.

Failure Mode

Hash collisions are stochastic: important keys may land in different buckets from their queries. Multiple hash rounds reduce this risk but increase cost. In practice, Reformer has been largely superseded by FlashAttention and other approaches that keep full attention but optimize memory and compute constants.

Ring Attention: Distributing Long Sequences

Ring attention (Liu et al., 2023) distributes a long sequence across PP devices. Each device holds n/Pn/P tokens. The devices are arranged in a ring. Each device computes attention for its local tokens against one block of KV pairs, then passes the KV block to the next device. After PP rounds, each device has attended to all tokens.

Total memory per device: O(n2/P2)O(n^2 / P^2) for the attention matrix (or O(n/P)O(n/P) with FlashAttention). This scales context length linearly with the number of devices.

Streaming with Attention Sinks (StreamingLLM)

Xiao et al. (2023) found empirically that pretrained LLMs allocate disproportionate attention to the first few tokens regardless of their content, naming this pattern "attention sinks." Why this happens mechanistically is still debated; one common hypothesis is that softmax forces attention mass to land somewhere even when no later token is informative, and the first tokens become the default sink. StreamingLLM keeps a fixed-size window of recent tokens plus the first few tokens (the sinks), enabling arbitrarily long inference streams with fixed memory, at the cost of losing access to middle-of-context information.

Why Extending Context is Hard

Three problems compound:

1. Quadratic cost. Even with FlashAttention, O(n2)O(n^2) wall-clock time grows fast. Doubling context quadruples attention compute.

2. Lost-in-the-middle. Liu et al. (2023) showed empirically that LLMs retrieve information well from the beginning and end of a long context but poorly from the middle. The phenomenon is robust across models, but the mechanism is still open. Candidate explanations include positional encoding extrapolation, training-data position bias, and attention-sink interactions; the paper itself documents the effect without committing to a single cause.

3. Distributional shift. Models trained on 4k context and extended to 128k via positional interpolation may have degraded quality at the extended lengths. The model has never seen token interactions at distance 100k during training.

Common Confusions

Watch Out

FlashAttention does not reduce attention complexity

FlashAttention is an exact algorithm that reduces memory from O(n2)O(n^2) to O(n)O(n) through tiling and recomputation. It does not change the O(n2d)O(n^2 d) time complexity. It is an IO optimization, not an algorithmic one. Sparse attention methods change the algorithm itself.

Watch Out

Longer context does not mean better retrieval

A model with 1M token context can hold a full book, but it may not reliably answer questions about a specific paragraph in the middle. Context length is a capacity bound, not a guarantee of utilization. Retrieval quality within context is a separate problem.

Canonical Examples

Example

Longformer vs full attention on document classification

For a 4096-token document with window size w=256w = 256 and 8 global tokens, Longformer computes 4096×256+8×4096×21.1M4096 \times 256 + 8 \times 4096 \times 2 \approx 1.1\text{M} attention scores. Full attention computes 4096216.8M4096^2 \approx 16.8\text{M} attention scores. Longformer uses about 6.5% of the compute while matching full attention accuracy on document classification tasks (Beltagy et al., 2020).

Summary

  • Standard attention costs O(n2d)O(n^2 d) time; this is the primary bottleneck for long context
  • Sparse patterns (local, strided, hash-based) reduce to O(nn)O(n \sqrt{n}) or O(nlogn)O(n \log n)
  • Ring attention distributes sequences across devices for linear memory scaling
  • Attention sinks enable streaming inference at the cost of mid-context access
  • Longer context does not guarantee better retrieval due to the lost-in-the-middle effect
  • Most production systems combine efficient attention with retrieval (RAG) rather than relying on context length alone

Exercises

ExerciseCore

Problem

A model uses local window attention with w=512w = 512. How many attention layers are needed for information at token 0 to influence the output at token 8192?

ExerciseAdvanced

Problem

You have 8 devices with ring attention and want to process a 1M-token sequence. Each device has 80GB of memory. The KV cache stores 2 floats (K and V) per token per head in fp16 (2 bytes each). The model has 32 layers and 32 heads with head dimension 128. Does the KV cache fit on each device?

References

Canonical:

  • Vaswani et al., "Attention Is All You Need" (2017), Section 3 (baseline complexity)
  • Child et al., "Generating Long Sequences with Sparse Transformers" (2019)

Current:

  • Beltagy et al., "Longformer: The Long-Document Transformer" (2020)
  • Kitaev et al., "Reformer: The Efficient Transformer" (2020)
  • Liu et al., "Ring Attention with Blockwise Transformers" (2023)
  • Xiao et al., "Efficient Streaming Language Models with Attention Sinks" (2023)
  • Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (2023)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics