Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Efficient Transformers Survey

Sub-quadratic attention variants (linear attention, Linformer, Performer, Longformer, BigBird) and why FlashAttention, a hardware-aware exact method, made most of them unnecessary in practice.

AdvancedTier 2Frontier~55 min

Why This Matters

Standard self-attention computes all n2n^2 pairwise interactions for a sequence of length nn, costing O(n2d)O(n^2 d) time and O(n2)O(n^2) memory. For long sequences (documents, genomes, high-resolution images), this quadratic cost is prohibitive.

Between 2019 and 2022, dozens of papers proposed sub-quadratic attention variants. Most are now rarely used. FlashAttention (2022) showed that the bottleneck was memory bandwidth, not FLOPs, and that exact attention with IO-aware tiling is faster in practice than most approximate methods up to sequences of length 8K-16K. Understanding why this happened teaches more about systems-level thinking than about any single algorithm.

The Quadratic Bottleneck

For queries QQ, keys KK, values VRn×dV \in \mathbb{R}^{n \times d}:

Attention(Q,K,V)=softmax ⁣(QKd)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}}\right) V

The matrix QKRn×nQK^\top \in \mathbb{R}^{n \times n} has n2n^2 entries. Storing it costs O(n2)O(n^2) memory. For n=100,000n = 100{,}000 and d=128d = 128, the attention matrix alone requires 40 GB in float32. The question: can we avoid materializing this matrix?

Linear Attention

Proposition

Linear Attention via Kernel Decomposition

Statement

Replace the softmax kernel κ(q,k)=exp(qk/d)\kappa(q, k) = \exp(q^\top k / \sqrt{d}) with a feature map ϕ:RdRm\phi: \mathbb{R}^d \to \mathbb{R}^m such that κ(q,k)ϕ(q)ϕ(k)\kappa(q, k) \approx \phi(q)^\top \phi(k). Then:

Attention(Q,K,V)i=ϕ(Qi)jϕ(Kj)Vjϕ(Qi)jϕ(Kj)\text{Attention}(Q, K, V)_i = \frac{\phi(Q_i)^\top \sum_j \phi(K_j) V_j^\top}{\phi(Q_i)^\top \sum_j \phi(K_j)}

The sums S=jϕ(Kj)VjRm×dS = \sum_j \phi(K_j) V_j^\top \in \mathbb{R}^{m \times d} and z=jϕ(Kj)Rmz = \sum_j \phi(K_j) \in \mathbb{R}^m can be precomputed in O(nmd)O(nmd) time. Each attention output costs O(md)O(md), giving total cost O(nmd)O(nmd), which is linear in nn if mm is fixed.

Intuition

Instead of computing the full n×nn \times n attention matrix and then multiplying by VV, change the order of operations. First aggregate key-value pairs into a compact summary (the matrix SS), then query this summary. This is the associativity trick: (QK)V=Q(KV)(QK^\top)V = Q(K^\top V) up to normalization.

Proof Sketch

Write Attni=jκ(Qi,Kj)Vj/jκ(Qi,Kj)\text{Attn}_i = \sum_j \kappa(Q_i, K_j) V_j / \sum_j \kappa(Q_i, K_j). Substitute κ(Qi,Kj)=ϕ(Qi)ϕ(Kj)\kappa(Q_i, K_j) = \phi(Q_i)^\top \phi(K_j) and factor out ϕ(Qi)\phi(Q_i)^\top from both sums.

Why It Matters

Linear attention reduces the complexity from O(n2d)O(n^2 d) to O(nmd)O(n m d). For mnm \ll n, this is a large speedup for very long sequences. The idea also enables recurrent computation: process tokens one by one, updating SS and zz incrementally, giving constant memory per step.

Failure Mode

The approximation quality depends on ϕ\phi. No known feature map perfectly approximates softmax attention. The approximation error degrades attention's ability to do sharp, sparse lookups (e.g., copying a specific token). In practice, linear attention models underperform standard attention on tasks requiring precise retrieval from context.

Sparse Attention Methods

Rather than approximating the softmax, sparse methods compute exact attention on a subset of the n2n^2 pairs.

Longformer (Beltagy et al., 2020): Each token attends to a fixed-size local window (e.g., 512 tokens around it) plus a small set of global tokens. Cost is O(nw)O(nw) where ww is the window size. Good for document-level tasks where local context dominates.

BigBird (Zaheer et al., 2020): Combines local window attention, global tokens, and random attention edges. Proved that this sparse pattern is a universal approximator of sequence-to-sequence functions and is Turing complete, matching the theoretical expressiveness of full attention.

Block-sparse attention: Partition tokens into blocks and attend within blocks. Efficient on GPUs because block operations map well to hardware. This is the foundation of FlashAttention's tiling strategy.

Low-Rank Approximations

Linformer (Wang et al., 2020): Project keys and values to a lower dimension knk \ll n using learned projection matrices E,FRk×nE, F \in \mathbb{R}^{k \times n}:

Attention(Q,EK,FV)=softmax ⁣(Q(EK)d)FV\text{Attention}(Q, EK, FV) = \text{softmax}\!\left(\frac{Q(EK)^\top}{\sqrt{d}}\right) FV

Cost is O(nkd)O(nkd) instead of O(n2d)O(n^2d). The claim: the attention matrix is approximately low-rank for most practical inputs, so projecting to rank kk loses little information.

Performer (Choromanski et al., 2021): Uses random orthogonal features to approximate the softmax kernel:

exp(qk/d)ϕ(q)ϕ(k)\exp(q^\top k / \sqrt{d}) \approx \phi(q)^\top \phi(k)

where ϕ(x)=1mexp(x2/2)[exp(ω1x),,exp(ωmx)]\phi(x) = \frac{1}{\sqrt{m}} \exp(-\|x\|^2/2) [\exp(\omega_1^\top x), \ldots, \exp(\omega_m^\top x)] with random ωiN(0,Id/d)\omega_i \sim \mathcal{N}(0, I_d / \sqrt{d}).

Proposition

Performer Approximation Quality

Statement

The Performer approximation κ^(q,k)=ϕ(q)ϕ(k)\hat{\kappa}(q, k) = \phi(q)^\top \phi(k) satisfies:

E[κ^(q,k)]=exp(qk/d)\mathbb{E}[\hat{\kappa}(q, k)] = \exp(q^\top k / \sqrt{d})

The variance decreases as O(1/m)O(1/m), so with m=O(dlogd/ϵ2)m = O(d \log d / \epsilon^2) random features, the relative approximation error is at most ϵ\epsilon with high probability.

Intuition

Random Fourier features (Rahimi and Recht, 2007) applied to the softmax kernel. More random features give a better approximation, but the required mm can be large for high accuracy, reducing the practical speedup.

Proof Sketch

The unbiasedness follows from the identity exp(qk)=EωN[exp(ωqq2/2)exp(ωkk2/2)]\exp(q^\top k) = \mathbb{E}_{\omega \sim \mathcal{N}}[\exp(\omega^\top q - \|q\|^2/2) \exp(\omega^\top k - \|k\|^2/2)]. Concentration follows from standard random feature arguments.

Why It Matters

This gives a principled way to trade accuracy for speed. But the required number of features for high accuracy often makes the method no faster than exact attention for typical sequence lengths.

Failure Mode

For tasks requiring precise attention patterns (copying, exact retrieval), the approximation error causes significant quality degradation. The number of features needed for acceptable accuracy grows with the "peakiness" of the attention distribution.

Why FlashAttention Won

FlashAttention computes exact softmax attention but tiles the computation to minimize HBM (high bandwidth memory) reads and writes. The key insight: the attention bottleneck for sequences up to 16K is not FLOPs but memory IO. The n×nn \times n attention matrix does not need to be materialized; it can be computed and consumed block by block.

FlashAttention achieves 2-4x wall-clock speedup over standard attention for typical LLM training sequence lengths (2K-8K), while computing the exact same output. This made most approximate attention methods unnecessary for their target use case (moderate-length sequences on modern GPUs).

Sub-quadratic methods still matter when nn is truly enormous (100K+ tokens), where the O(n2)O(n^2) FLOP cost itself dominates. But for most current LLM training and inference, FlashAttention (and its successors) suffice.

Common Confusions

Watch Out

Sub-quadratic FLOPs does not mean faster wall-clock time

A method with O(nlogn)O(n \log n) FLOPs can be slower than O(n2)O(n^2) if it has poor memory access patterns, cannot use tensor cores, or requires custom kernels. On modern GPUs, memory bandwidth is often the bottleneck, not arithmetic. This is why FlashAttention (which uses more FLOPs due to recomputation) is faster than standard attention (which materializes the full attention matrix).

Watch Out

Linear attention is not a drop-in replacement

Linear attention changes the model's expressive power, not just its cost. The softmax attention matrix can be arbitrarily sparse and peaked. Linear attention with a fixed feature map cannot represent sharp attention patterns. Models trained with linear attention often require architectural changes (gating, decay factors) to match softmax attention quality.

Key Takeaways

  • Standard attention costs O(n2d)O(n^2 d) time and O(n2)O(n^2) memory
  • Linear attention uses the kernel trick to achieve O(nmd)O(nmd) but sacrifices sharp attention patterns
  • Sparse methods (Longformer, BigBird) compute exact attention on a structured subset of pairs
  • Low-rank methods (Linformer, Performer) approximate the attention matrix
  • FlashAttention computes exact attention with IO-aware tiling, making approximate methods unnecessary for n<16Kn < 16K
  • For truly long sequences (n>100Kn > 100K), sub-quadratic methods and state-space models remain relevant

Exercises

ExerciseCore

Problem

In linear attention with feature map ϕ:RdRm\phi: \mathbb{R}^d \to \mathbb{R}^m, what are the shapes of the precomputed matrices S=jϕ(Kj)VjS = \sum_j \phi(K_j) V_j^\top and z=jϕ(Kj)z = \sum_j \phi(K_j)? What is the total memory cost of storing SS and zz compared to storing the full n×nn \times n attention matrix?

ExerciseAdvanced

Problem

Explain why FlashAttention uses more FLOPs than standard attention (due to recomputation in the backward pass) yet achieves faster wall-clock time. Quantify the tradeoff: if HBM bandwidth is BB bytes/sec and compute throughput is CC FLOP/sec, under what condition on nn and dd is standard attention memory-bound?

References

Canonical:

  • Vaswani et al., "Attention Is All You Need" (NeurIPS 2017)
  • Katharopoulos et al., "Transformers are RNNs" (ICML 2020), linear attention

Current:

  • Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention" (NeurIPS 2022)
  • Tay et al., "Efficient Transformers: A Survey" (ACM Computing Surveys 2022)
  • Dao, "FlashAttention-2" (ICLR 2024)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics