LLM Construction
Attention Mechanism Theory
Mathematical formulation of attention: scaled dot-product attention as soft dictionary lookup, why sqrt(d_k) scaling prevents softmax saturation, multi-head attention, and the connection to kernel methods.
Why This Matters
Attention is the computational primitive that makes transformers work. Every modern LLM. GPT-4, Claude, Gemini, Llama. processes information through billions of attention operations. Understanding attention mathematically means understanding why the specific formula takes the form it does, what goes wrong without the scaling factor, and how attention relates to classical ideas in statistics and kernel methods.
This topic focuses on the mathematical theory of attention itself, separated from the full transformer architecture, so you can build precise intuition for this single operation before composing it into larger systems.
Mental Model
Attention is a soft dictionary lookup. You have a query ("what am I looking for?"), a set of keys ("what does each entry contain?"), and a set of values ("what information does each entry carry?"). The query is compared against all keys to produce similarity scores. These scores become weights via softmax, and the output is a weighted sum of values.
Unlike a hard dictionary lookup (which returns the value of the exact matching key), attention returns a blend of all values, weighted by how well each key matches the query. This soft matching is what allows attention to combine information from multiple positions in a differentiable way.
Formal Setup and Notation
Let be the sequence length and the model dimension. The input is a matrix where each row is a token embedding.
We project the input into three spaces:
where and are learned projection matrices.
Scaled Dot-Product Attention
Scaled Dot-Product Attention
The scaled dot-product attention function is:
where the softmax is applied independently to each row of the matrix .
For a single query (the -th row of ), the output is:
The attention weights form a probability distribution over positions: and .
Why Scale by
Scaling Prevents Softmax Saturation
Statement
Assume are mutually independent random variables, each with mean and variance . Then the dot product has:
The scaled dot product has variance , regardless of .
Intuition
The dot product is a sum of independent terms , each with variance . By the independence assumption, the variance of the sum is the sum of the variances: . Without scaling, as grows, the dot products grow in magnitude, pushing the softmax inputs into regions where the gradient is near zero (softmax saturation). Dividing by normalizes the variance to , keeping the softmax in a regime with useful gradients.
Proof Sketch
Each entry has and (using independence and the fact that mean is zero).
The dot product is a sum of independent random variables, so .
After scaling: .
Why It Matters
Without scaling, a model with would have dot products with standard deviation . Softmax inputs of magnitude 20+ produce outputs extremely close to 0 or 1, with gradients on the order of . Training would effectively freeze. The scaling is not a minor numerical convenience. It is essential for trainability.
Failure Mode
The assumption that and entries are independent with unit variance holds approximately at initialization (with proper weight initialization) but may not hold after training. In practice, the model learns to calibrate its own attention logits, so the scaling factor becomes less critical later in training. However, removing it entirely still causes training instability.
Attention as Soft Dictionary Lookup
Attention as Soft Dictionary Lookup
A hard dictionary lookup with query , keys , and values returns where for some similarity function.
Attention replaces the hard with a soft weighting:
The output is a convex combination of all values, with higher weight on values whose keys are more similar to the query.
In the transformer, the similarity function is . Scaling by is a fixed normalization, not a learned temperature. When dot-product magnitudes grow (at inference time with unusually large query norms, or when studying asymptotic behavior) the softmax becomes peaky and the soft lookup approaches a hard one.
Attention is not literally dictionary lookup
The soft-dictionary analogy is a useful mental model for the mechanics of a single attention head. It does not capture what attention computes in a trained transformer. Learned projections make queries and keys live in spaces that have no relation to a human-interpretable notion of "matching." Mechanistic interpretability work (Elhage et al. 2021, Olsson et al. 2022) shows heads implementing copying, induction, positional patterns, and composition. Use the analogy to understand the formula, not to predict what heads do.
Self-Attention vs Cross-Attention
Self-Attention
In self-attention, queries, keys, and values all come from the same input sequence:
Each token attends to all tokens in the same sequence (including itself). Self-attention is how a model builds contextual representations.
Cross-Attention
In cross-attention, queries come from one sequence and keys/values come from another:
This is used in encoder-decoder models (e.g., for translation: the decoder attends to the encoder output) and in retrieval-augmented generation.
The mathematical formulation is identical. The only difference is whether and derive from the same or different input matrices.
Multi-Head Attention
Multi-Head Attention
Instead of a single attention function, compute heads in parallel on lower-dimensional projections:
where and with .
Concatenate and project:
where .
Multi-Head Attention Representational Capacity
Statement
Under the Vaswani et al. (2017) convention and ignoring bias terms, multi-head attention has the same weight parameter count as single-head attention with full head dimension :
However, MHA can represent richer attention patterns: each head can specialize in a different type of relationship (syntactic, semantic, positional), and the output projection learns how to combine these patterns.
The equality breaks under other widely used choices. Including biases adds parameters. Multi-query attention (MQA) shares across all heads, giving . Grouped-query attention (GQA) interpolates between these extremes. See the MQA/GQA page for the exact counts.
Intuition
Multiple heads give the model multiple independent "attention channels." One head might track subject-verb agreement, another might track coreference, a third might focus on adjacent tokens. Single-head attention is forced to compress all these patterns into a single set of attention weights, which is a lossy compression. Multi-head attention avoids this by giving each pattern its own subspace.
Why It Matters
Empirically, reducing to a single head significantly degrades performance. The multi-head structure is one of the most important design decisions in the transformer. Mechanistic interpretability research has shown that individual heads do specialize: there are "induction heads" that copy patterns, "previous token heads" that attend to the immediately preceding token, and "name mover heads" that track entities.
Computational Complexity
The dominant cost of attention is computing the attention matrix:
- Compute: requires operations. Multiplying requires . Total: where .
- Memory: Storing requires per head, or total.
For long sequences (), the attention matrix has entries per head. This quadratic scaling is the fundamental bottleneck for long-context models and motivates FlashAttention (which reduces memory to by tiling, without changing FLOPs), sparse attention, sub-quadratic architectures, and research into attention sinks and retrieval decay in streaming settings.
Connection to Kernel Methods
Attention as a Kernel Smoother
Statement
Scaled dot-product attention can be written as a Nadaraya-Watson kernel regression estimator:
where the kernel function is .
This is a softmax (exponential dot-product) kernel: . It is not a translation-invariant RBF kernel. Using :
The left factor breaks translation invariance: unlike the Gaussian RBF kernel, the softmax kernel depends on and separately, not just on . Only when and are (approximately) constant across all query--key pairs does the softmax kernel reduce to an RBF kernel. Layer normalization controls the pre-projection norm but and are not norm-constrained, so this equivalence is heuristic rather than exact. Tsai et al. (2019) treat the softmax kernel as an asymmetric dot-product kernel, which is the view we use here.
Intuition
The Nadaraya-Watson estimator is a classical nonparametric regression method: to estimate a function value at a query point, take a weighted average of observed values, where the weights are determined by a kernel measuring similarity between the query point and each data point. Attention is doing exactly this: it estimates the output at each position as a kernel-weighted average of value vectors.
Why It Matters
This connection has two major implications. First, it explains why attention works as a form of nonparametric in-context learning: the model can adapt its behavior at inference time by "regressing" on the input context, without updating weights. Second, it opens the door to efficient attention approximations via random feature maps for kernels (the "Performers" approach), which approximate the softmax kernel with complexity.
Failure Mode
The kernel interpretation is cleanest for a single attention head without learned projections. In practice, the learned matrices transform the inputs before the kernel is applied, and multi-head attention applies multiple kernels simultaneously. The kernel analogy is useful for intuition but does not fully capture the representational power of learned multi-head attention.
Common Confusions
Attention weights are NOT learned parameters
The attention weights are computed dynamically from the input at every forward pass. They change for every input sequence. The learned parameters are , which determine how attention is computed, not what the attention pattern is. This input-dependence is the key difference between attention and fixed linear layers.
Additive attention and dot-product attention are different
The original Bahdanau attention (2015) used additive scoring: . Dot-product attention uses . Vaswani et al. (2017) showed that dot-product attention is faster (it is a single matrix multiplication) and performs comparably when properly scaled. Additive attention is more flexible but slower. Modern transformers exclusively use scaled dot-product attention.
O(n^2) is in the sequence length, not the model dimension
When people say attention is "quadratic," they mean quadratic in (sequence length), not (model dimension). The cost is . For a fixed model size, doubling the context window quadruples the attention cost. For a fixed context, doubling the model dimension only doubles the cost.
Summary
- Attention: . soft dictionary lookup
- Scaling by normalizes dot product variance to 1, preventing softmax saturation
- Without scaling, dot product variance grows as , killing gradients
- Multi-head attention: parallel heads with , same parameter count as single head
- Self-attention: Q, K, V from same input. Cross-attention: Q from target, K/V from source
- Attention is a kernel smoother (Nadaraya-Watson estimator with softmax kernel)
- Computational cost: compute, memory per head
Exercises
Problem
Suppose and the entries of and are i.i.d. with mean 0 and variance 1. What is the standard deviation of the unscaled dot product ? What is the standard deviation after scaling by ? If the softmax receives inputs with standard deviation 8, roughly how concentrated will the output distribution be?
Problem
Show that attention is permutation-equivariant: if is a permutation matrix and , then . Why does this mean that a transformer without positional encoding cannot distinguish token order?
Problem
The kernel interpretation says attention uses kernel . The "Performers" paper (Choromanski et al., 2021) proposes approximating this kernel with random features such that , enabling attention. What is the key mathematical identity that makes this possible, and what is the main accuracy tradeoff?
Related Comparisons
References
Canonical:
- Vaswani et al., "Attention Is All You Need" (NeurIPS 2017), arXiv:1706.03762. The transformer paper. See the paper notes page.
- Bahdanau, Cho, Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate" (ICLR 2015). Original soft-attention alignment.
Current:
- Choromanski et al., "Rethinking Attention with Performers" (ICLR 2021). Random-feature kernel approximation.
- Tsai et al., "Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel" (EMNLP 2019). Asymmetric kernel view.
- Olsson et al., "In-context Learning and Induction Heads" (2022). Mechanistic analysis of attention heads.
- Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapter 9 ("Transformers and Large Language Models").
Next Topics
The natural next steps from attention theory:
- KV cache: how autoregressive generation avoids recomputing attention
- Positional encoding: why attention needs position information and the mathematics of RoPE
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
Builds on This
- Attention as Kernel RegressionLayer 4
- Attention Sinks and Retrieval DecayLayer 4
- Attention Variants and EfficiencyLayer 4
- Context EngineeringLayer 5
- Flash AttentionLayer 5
- Forgetting Transformer (FoX)Layer 4
- Induction HeadsLayer 4
- KV CacheLayer 5
- Mamba and State-Space ModelsLayer 4
- Positional EncodingLayer 4
- Sparse Attention and Long ContextLayer 4
- Transformer ArchitectureLayer 4