LLM Construction
Efficient Transformers Survey
Sub-quadratic attention variants (linear attention, Linformer, Performer, Longformer, BigBird) and why FlashAttention, a hardware-aware exact method, made most of them unnecessary in practice.
Prerequisites
Why This Matters
Standard self-attention computes all pairwise interactions for a sequence of length , costing time and memory. For long sequences (documents, genomes, high-resolution images), this quadratic cost is prohibitive.
Between 2019 and 2022, dozens of papers proposed sub-quadratic attention variants. Most are now rarely used. FlashAttention (2022) showed that the bottleneck was memory bandwidth, not FLOPs, and that exact attention with IO-aware tiling is faster in practice than most approximate methods up to sequences of length 8K-16K. Understanding why this happened teaches more about systems-level thinking than about any single algorithm.
The Quadratic Bottleneck
For queries , keys , values :
The matrix has entries. Storing it costs memory. For and , the attention matrix alone requires 40 GB in float32. The question: can we avoid materializing this matrix?
Linear Attention
Linear Attention via Kernel Decomposition
Statement
Replace the softmax kernel with a feature map such that . Then:
The sums and can be precomputed in time. Each attention output costs , giving total cost , which is linear in if is fixed.
Intuition
Instead of computing the full attention matrix and then multiplying by , change the order of operations. First aggregate key-value pairs into a compact summary (the matrix ), then query this summary. This is the associativity trick: up to normalization.
Proof Sketch
Write . Substitute and factor out from both sums.
Why It Matters
Linear attention reduces the complexity from to . For , this is a large speedup for very long sequences. The idea also enables recurrent computation: process tokens one by one, updating and incrementally, giving constant memory per step.
Failure Mode
The approximation quality depends on . No known feature map perfectly approximates softmax attention. The approximation error degrades attention's ability to do sharp, sparse lookups (e.g., copying a specific token). In practice, linear attention models underperform standard attention on tasks requiring precise retrieval from context.
Sparse Attention Methods
Rather than approximating the softmax, sparse methods compute exact attention on a subset of the pairs.
Longformer (Beltagy et al., 2020): Each token attends to a fixed-size local window (e.g., 512 tokens around it) plus a small set of global tokens. Cost is where is the window size. Good for document-level tasks where local context dominates.
BigBird (Zaheer et al., 2020): Combines local window attention, global tokens, and random attention edges. Proved that this sparse pattern is a universal approximator of sequence-to-sequence functions and is Turing complete, matching the theoretical expressiveness of full attention.
Block-sparse attention: Partition tokens into blocks and attend within blocks. Efficient on GPUs because block operations map well to hardware. This is the foundation of FlashAttention's tiling strategy.
Low-Rank Approximations
Linformer (Wang et al., 2020): Project keys and values to a lower dimension using learned projection matrices :
Cost is instead of . The claim: the attention matrix is approximately low-rank for most practical inputs, so projecting to rank loses little information.
Performer (Choromanski et al., 2021): Uses random orthogonal features to approximate the softmax kernel:
where with random .
Performer Approximation Quality
Statement
The Performer approximation satisfies:
The variance decreases as , so with random features, the relative approximation error is at most with high probability.
Intuition
Random Fourier features (Rahimi and Recht, 2007) applied to the softmax kernel. More random features give a better approximation, but the required can be large for high accuracy, reducing the practical speedup.
Proof Sketch
The unbiasedness follows from the identity . Concentration follows from standard random feature arguments.
Why It Matters
This gives a principled way to trade accuracy for speed. But the required number of features for high accuracy often makes the method no faster than exact attention for typical sequence lengths.
Failure Mode
For tasks requiring precise attention patterns (copying, exact retrieval), the approximation error causes significant quality degradation. The number of features needed for acceptable accuracy grows with the "peakiness" of the attention distribution.
Why FlashAttention Won
FlashAttention computes exact softmax attention but tiles the computation to minimize HBM (high bandwidth memory) reads and writes. The key insight: the attention bottleneck for sequences up to 16K is not FLOPs but memory IO. The attention matrix does not need to be materialized; it can be computed and consumed block by block.
FlashAttention achieves 2-4x wall-clock speedup over standard attention for typical LLM training sequence lengths (2K-8K), while computing the exact same output. This made most approximate attention methods unnecessary for their target use case (moderate-length sequences on modern GPUs).
Sub-quadratic methods still matter when is truly enormous (100K+ tokens), where the FLOP cost itself dominates. But for most current LLM training and inference, FlashAttention (and its successors) suffice.
Common Confusions
Sub-quadratic FLOPs does not mean faster wall-clock time
A method with FLOPs can be slower than if it has poor memory access patterns, cannot use tensor cores, or requires custom kernels. On modern GPUs, memory bandwidth is often the bottleneck, not arithmetic. This is why FlashAttention (which uses more FLOPs due to recomputation) is faster than standard attention (which materializes the full attention matrix).
Linear attention is not a drop-in replacement
Linear attention changes the model's expressive power, not just its cost. The softmax attention matrix can be arbitrarily sparse and peaked. Linear attention with a fixed feature map cannot represent sharp attention patterns. Models trained with linear attention often require architectural changes (gating, decay factors) to match softmax attention quality.
Key Takeaways
- Standard attention costs time and memory
- Linear attention uses the kernel trick to achieve but sacrifices sharp attention patterns
- Sparse methods (Longformer, BigBird) compute exact attention on a structured subset of pairs
- Low-rank methods (Linformer, Performer) approximate the attention matrix
- FlashAttention computes exact attention with IO-aware tiling, making approximate methods unnecessary for
- For truly long sequences (), sub-quadratic methods and state-space models remain relevant
Exercises
Problem
In linear attention with feature map , what are the shapes of the precomputed matrices and ? What is the total memory cost of storing and compared to storing the full attention matrix?
Problem
Explain why FlashAttention uses more FLOPs than standard attention (due to recomputation in the backward pass) yet achieves faster wall-clock time. Quantify the tradeoff: if HBM bandwidth is bytes/sec and compute throughput is FLOP/sec, under what condition on and is standard attention memory-bound?
References
Canonical:
- Vaswani et al., "Attention Is All You Need" (NeurIPS 2017)
- Katharopoulos et al., "Transformers are RNNs" (ICML 2020), linear attention
Current:
- Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention" (NeurIPS 2022)
- Tay et al., "Efficient Transformers: A Survey" (ACM Computing Surveys 2022)
- Dao, "FlashAttention-2" (ICLR 2024)
Next Topics
- Mamba and state-space models: a recurrent architecture that avoids attention entirely for long sequences
- KV cache: memory optimization for autoregressive inference
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Attention Variants and EfficiencyLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Flash AttentionLayer 5