Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Attention Mechanism Theory

Mathematical formulation of attention: scaled dot-product attention as soft dictionary lookup, why sqrt(d_k) scaling prevents softmax saturation, multi-head attention, and the connection to kernel methods.

AdvancedTier 2Current~60 min

Why This Matters

Input tokensThecatsatWQ · XQueries (Q)What am I looking for?WK · XKeys (K)What do I contain?WV · XValues (V)What do I contribute?Q·K/ √dsoftmaxα (weights)·VOutputExample: "sat" attends to "cat" (high weight) and "The" (low weight)0.1 The0.7 cat0.2 satAttention(Q, K, V) = softmax(QKᵀ / √d_k) · V

Attention is the computational primitive that makes transformers work. Every modern LLM. GPT-4, Claude, Gemini, Llama. processes information through billions of attention operations. Understanding attention mathematically means understanding why the specific formula softmax(QK/dk)V\text{softmax}(QK^\top / \sqrt{d_k})V takes the form it does, what goes wrong without the scaling factor, and how attention relates to classical ideas in statistics and kernel methods.

This topic focuses on the mathematical theory of attention itself, separated from the full transformer architecture, so you can build precise intuition for this single operation before composing it into larger systems.

Mental Model

Attention is a soft dictionary lookup. You have a query ("what am I looking for?"), a set of keys ("what does each entry contain?"), and a set of values ("what information does each entry carry?"). The query is compared against all keys to produce similarity scores. These scores become weights via softmax, and the output is a weighted sum of values.

Unlike a hard dictionary lookup (which returns the value of the exact matching key), attention returns a blend of all values, weighted by how well each key matches the query. This soft matching is what allows attention to combine information from multiple positions in a differentiable way.

Formal Setup and Notation

Let nn be the sequence length and dd the model dimension. The input is a matrix XRn×dX \in \mathbb{R}^{n \times d} where each row is a token embedding.

We project the input into three spaces:

Q=XWQRn×dk,K=XWKRn×dk,V=XWVRn×dvQ = XW_Q \in \mathbb{R}^{n \times d_k}, \quad K = XW_K \in \mathbb{R}^{n \times d_k}, \quad V = XW_V \in \mathbb{R}^{n \times d_v}

where WQ,WKRd×dkW_Q, W_K \in \mathbb{R}^{d \times d_k} and WVRd×dvW_V \in \mathbb{R}^{d \times d_v} are learned projection matrices.

Scaled Dot-Product Attention

Definition

Scaled Dot-Product Attention

The scaled dot-product attention function is:

Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

where the softmax is applied independently to each row of the n×nn \times n matrix QK/dkQK^\top / \sqrt{d_k}.

For a single query qiq_i (the ii-th row of QQ), the output is:

outputi=j=1nαijvj,αij=exp(qikj/dk)=1nexp(qik/dk)\text{output}_i = \sum_{j=1}^{n} \alpha_{ij} \, v_j, \qquad \alpha_{ij} = \frac{\exp(q_i^\top k_j / \sqrt{d_k})}{\sum_{\ell=1}^{n} \exp(q_i^\top k_\ell / \sqrt{d_k})}

The attention weights αij\alpha_{ij} form a probability distribution over positions: αij0\alpha_{ij} \geq 0 and jαij=1\sum_j \alpha_{ij} = 1.

Why Scale by dk\sqrt{d_k}

Proposition

Scaling Prevents Softmax Saturation

Statement

Assume q1,,qdk,k1,,kdkq_1, \ldots, q_{d_k}, k_1, \ldots, k_{d_k} are mutually independent random variables, each with mean 00 and variance 11. Then the dot product qkq^\top k has:

E[qk]=0,Var(qk)=dk\mathbb{E}[q^\top k] = 0, \qquad \text{Var}(q^\top k) = d_k

The scaled dot product qk/dkq^\top k / \sqrt{d_k} has variance 11, regardless of dkd_k.

Intuition

The dot product is a sum of dkd_k independent terms qikiq_i k_i, each with variance 11. By the independence assumption, the variance of the sum is the sum of the variances: dkd_k. Without scaling, as dkd_k grows, the dot products grow in magnitude, pushing the softmax inputs into regions where the gradient is near zero (softmax saturation). Dividing by dk\sqrt{d_k} normalizes the variance to 11, keeping the softmax in a regime with useful gradients.

Proof Sketch

Each entry qikiq_i k_i has E[qiki]=E[qi]E[ki]=0\mathbb{E}[q_i k_i] = \mathbb{E}[q_i]\mathbb{E}[k_i] = 0 and Var(qiki)=E[qi2]E[ki2]=11=1\text{Var}(q_i k_i) = \mathbb{E}[q_i^2]\mathbb{E}[k_i^2] = 1 \cdot 1 = 1 (using independence and the fact that mean is zero).

The dot product qk=i=1dkqikiq^\top k = \sum_{i=1}^{d_k} q_i k_i is a sum of dkd_k independent random variables, so Var(qk)=dk\text{Var}(q^\top k) = d_k.

After scaling: Var(qk/dk)=dk/dk=1\text{Var}(q^\top k / \sqrt{d_k}) = d_k / d_k = 1.

Why It Matters

Without scaling, a model with dk=512d_k = 512 would have dot products with standard deviation 51222.6\sqrt{512} \approx 22.6. Softmax inputs of magnitude 20+ produce outputs extremely close to 0 or 1, with gradients on the order of 10910^{-9}. Training would effectively freeze. The dk\sqrt{d_k} scaling is not a minor numerical convenience. It is essential for trainability.

Failure Mode

The assumption that qq and kk entries are independent with unit variance holds approximately at initialization (with proper weight initialization) but may not hold after training. In practice, the model learns to calibrate its own attention logits, so the scaling factor becomes less critical later in training. However, removing it entirely still causes training instability.

Attention as Soft Dictionary Lookup

Definition

Attention as Soft Dictionary Lookup

A hard dictionary lookup with query qq, keys {k1,,kn}\{k_1, \ldots, k_n\}, and values {v1,,vn}\{v_1, \ldots, v_n\} returns vjv_{j^*} where j=argmaxjsim(q,kj)j^* = \arg\max_j \, \text{sim}(q, k_j) for some similarity function.

Attention replaces the hard argmax\arg\max with a soft weighting:

output=j=1nexp(sim(q,kj))exp(sim(q,k))soft selection weightvj\text{output} = \sum_{j=1}^{n} \underbrace{\frac{\exp(\text{sim}(q, k_j))}{\sum_\ell \exp(\text{sim}(q, k_\ell))}}_{\text{soft selection weight}} \cdot v_j

The output is a convex combination of all values, with higher weight on values whose keys are more similar to the query.

In the transformer, the similarity function is sim(q,k)=qk/dk\text{sim}(q, k) = q^\top k / \sqrt{d_k}. Scaling by 1/dk1/\sqrt{d_k} is a fixed normalization, not a learned temperature. When dot-product magnitudes grow (at inference time with unusually large query norms, or when studying asymptotic behavior) the softmax becomes peaky and the soft lookup approaches a hard one.

Watch Out

Attention is not literally dictionary lookup

The soft-dictionary analogy is a useful mental model for the mechanics of a single attention head. It does not capture what attention computes in a trained transformer. Learned Q,K,VQ, K, V projections make queries and keys live in spaces that have no relation to a human-interpretable notion of "matching." Mechanistic interpretability work (Elhage et al. 2021, Olsson et al. 2022) shows heads implementing copying, induction, positional patterns, and composition. Use the analogy to understand the formula, not to predict what heads do.

Self-Attention vs Cross-Attention

Definition

Self-Attention

In self-attention, queries, keys, and values all come from the same input sequence:

Q=XWQ,K=XWK,V=XWVQ = XW_Q, \quad K = XW_K, \quad V = XW_V

Each token attends to all tokens in the same sequence (including itself). Self-attention is how a model builds contextual representations.

Definition

Cross-Attention

In cross-attention, queries come from one sequence and keys/values come from another:

Q=XtargetWQ,K=XsourceWK,V=XsourceWVQ = X_{\text{target}} W_Q, \quad K = X_{\text{source}} W_K, \quad V = X_{\text{source}} W_V

This is used in encoder-decoder models (e.g., for translation: the decoder attends to the encoder output) and in retrieval-augmented generation.

The mathematical formulation is identical. The only difference is whether QQ and K,VK, V derive from the same or different input matrices.

Multi-Head Attention

Definition

Multi-Head Attention

Instead of a single attention function, compute hh heads in parallel on lower-dimensional projections:

headi=Attention(XWQ(i),XWK(i),XWV(i))\text{head}_i = \text{Attention}(XW_Q^{(i)}, XW_K^{(i)}, XW_V^{(i)})

where WQ(i),WK(i)Rd×dkW_Q^{(i)}, W_K^{(i)} \in \mathbb{R}^{d \times d_k} and WV(i)Rd×dvW_V^{(i)} \in \mathbb{R}^{d \times d_v} with dk=dv=d/hd_k = d_v = d / h.

Concatenate and project:

MHA(X)=[head1;;headh]WO\text{MHA}(X) = [\text{head}_1; \ldots; \text{head}_h] \, W_O

where WORd×dW_O \in \mathbb{R}^{d \times d}.

Proposition

Multi-Head Attention Representational Capacity

Statement

Under the Vaswani et al. (2017) convention dk=dv=d/hd_k = d_v = d / h and ignoring bias terms, multi-head attention has the same weight parameter count as single-head attention with full head dimension dd:

Params(MHA)=h3d(d/h)+d2=3d2+d2=4d2\text{Params(MHA)} = h \cdot 3 \cdot d \cdot (d/h) + d^2 = 3d^2 + d^2 = 4d^2

Params(single-head)=3dd+d2=4d2\text{Params(single-head)} = 3 \cdot d \cdot d + d^2 = 4d^2

However, MHA can represent richer attention patterns: each head can specialize in a different type of relationship (syntactic, semantic, positional), and the output projection WOW_O learns how to combine these patterns.

The equality breaks under other widely used choices. Including biases adds 4d4d parameters. Multi-query attention (MQA) shares WK,WVW_K, W_V across all heads, giving d2+2d(d/h)+d2d^2 + 2d \cdot (d/h) + d^2. Grouped-query attention (GQA) interpolates between these extremes. See the MQA/GQA page for the exact counts.

Intuition

Multiple heads give the model multiple independent "attention channels." One head might track subject-verb agreement, another might track coreference, a third might focus on adjacent tokens. Single-head attention is forced to compress all these patterns into a single set of attention weights, which is a lossy compression. Multi-head attention avoids this by giving each pattern its own subspace.

Why It Matters

Empirically, reducing to a single head significantly degrades performance. The multi-head structure is one of the most important design decisions in the transformer. Mechanistic interpretability research has shown that individual heads do specialize: there are "induction heads" that copy patterns, "previous token heads" that attend to the immediately preceding token, and "name mover heads" that track entities.

Computational Complexity

The dominant cost of attention is computing the n×nn \times n attention matrix:

A=softmax(QK/dk)Rn×nA = \text{softmax}(QK^\top / \sqrt{d_k}) \in \mathbb{R}^{n \times n}

  • Compute: QKQK^\top requires O(n2dk)O(n^2 d_k) operations. Multiplying AVA \cdot V requires O(n2dv)O(n^2 d_v). Total: O(n2d)O(n^2 d) where d=hdkd = h \cdot d_k.
  • Memory: Storing AA requires O(n2)O(n^2) per head, or O(n2h)O(n^2 h) total.

For long sequences (n=100,000n = 100{,}000), the attention matrix has 101010^{10} entries per head. This quadratic scaling is the fundamental bottleneck for long-context models and motivates FlashAttention (which reduces memory to O(n)O(n) by tiling, without changing FLOPs), sparse attention, sub-quadratic architectures, and research into attention sinks and retrieval decay in streaming settings.

Connection to Kernel Methods

Proposition

Attention as a Kernel Smoother

Statement

Scaled dot-product attention can be written as a Nadaraya-Watson kernel regression estimator:

outputi=j=1nκ(qi,kj)=1nκ(qi,k)vj\text{output}_i = \sum_{j=1}^{n} \frac{\kappa(q_i, k_j)}{\sum_{\ell=1}^{n} \kappa(q_i, k_\ell)} \, v_j

where the kernel function is κ(q,k)=exp(qk/dk)\kappa(q, k) = \exp(q^\top k / \sqrt{d_k}).

This is a softmax (exponential dot-product) kernel: κ(q,k)=exp(q,k/dk)\kappa(q, k) = \exp(\langle q, k \rangle / \sqrt{d_k}). It is not a translation-invariant RBF kernel. Using qk=(q2+k2qk2)/2q^\top k = (\|q\|^2 + \|k\|^2 - \|q - k\|^2)/2:

exp(qkdk)=exp(q2+k22dk)exp(qk22dk)\exp\left(\frac{q^\top k}{\sqrt{d_k}}\right) = \exp\left(\frac{\|q\|^2 + \|k\|^2}{2\sqrt{d_k}}\right) \cdot \exp\left(-\frac{\|q - k\|^2}{2\sqrt{d_k}}\right)

The left factor breaks translation invariance: unlike the Gaussian RBF kernel, the softmax kernel depends on q\|q\| and k\|k\| separately, not just on qk\|q - k\|. Only when q\|q\| and k\|k\| are (approximately) constant across all query--key pairs does the softmax kernel reduce to an RBF kernel. Layer normalization controls the pre-projection norm but q=WQxq = W_Q x and k=WKxk = W_K x are not norm-constrained, so this equivalence is heuristic rather than exact. Tsai et al. (2019) treat the softmax kernel as an asymmetric dot-product kernel, which is the view we use here.

Intuition

The Nadaraya-Watson estimator is a classical nonparametric regression method: to estimate a function value at a query point, take a weighted average of observed values, where the weights are determined by a kernel measuring similarity between the query point and each data point. Attention is doing exactly this: it estimates the output at each position as a kernel-weighted average of value vectors.

Why It Matters

This connection has two major implications. First, it explains why attention works as a form of nonparametric in-context learning: the model can adapt its behavior at inference time by "regressing" on the input context, without updating weights. Second, it opens the door to efficient attention approximations via random feature maps for kernels (the "Performers" approach), which approximate the softmax kernel with O(n)O(n) complexity.

Failure Mode

The kernel interpretation is cleanest for a single attention head without learned projections. In practice, the learned WQ,WK,WVW_Q, W_K, W_V matrices transform the inputs before the kernel is applied, and multi-head attention applies multiple kernels simultaneously. The kernel analogy is useful for intuition but does not fully capture the representational power of learned multi-head attention.

Common Confusions

Watch Out

Attention weights are NOT learned parameters

The attention weights αij\alpha_{ij} are computed dynamically from the input at every forward pass. They change for every input sequence. The learned parameters are WQ,WK,WV,WOW_Q, W_K, W_V, W_O, which determine how attention is computed, not what the attention pattern is. This input-dependence is the key difference between attention and fixed linear layers.

Watch Out

Additive attention and dot-product attention are different

The original Bahdanau attention (2015) used additive scoring: score(q,k)=vtanh(Wqq+Wkk)\text{score}(q, k) = v^\top \tanh(W_q q + W_k k). Dot-product attention uses score(q,k)=qk\text{score}(q, k) = q^\top k. Vaswani et al. (2017) showed that dot-product attention is faster (it is a single matrix multiplication) and performs comparably when properly scaled. Additive attention is more flexible but slower. Modern transformers exclusively use scaled dot-product attention.

Watch Out

O(n^2) is in the sequence length, not the model dimension

When people say attention is "quadratic," they mean quadratic in nn (sequence length), not dd (model dimension). The cost is O(n2d)O(n^2 d). For a fixed model size, doubling the context window quadruples the attention cost. For a fixed context, doubling the model dimension only doubles the cost.

Summary

  • Attention: softmax(QK/dk)V\text{softmax}(QK^\top / \sqrt{d_k}) V. soft dictionary lookup
  • Scaling by dk\sqrt{d_k} normalizes dot product variance to 1, preventing softmax saturation
  • Without scaling, dot product variance grows as dkd_k, killing gradients
  • Multi-head attention: hh parallel heads with dk=d/hd_k = d/h, same parameter count as single head
  • Self-attention: Q, K, V from same input. Cross-attention: Q from target, K/V from source
  • Attention is a kernel smoother (Nadaraya-Watson estimator with softmax kernel)
  • Computational cost: O(n2d)O(n^2 d) compute, O(n2)O(n^2) memory per head

Exercises

ExerciseCore

Problem

Suppose dk=64d_k = 64 and the entries of qq and kk are i.i.d. with mean 0 and variance 1. What is the standard deviation of the unscaled dot product qkq^\top k? What is the standard deviation after scaling by dk\sqrt{d_k}? If the softmax receives inputs with standard deviation 8, roughly how concentrated will the output distribution be?

ExerciseAdvanced

Problem

Show that attention is permutation-equivariant: if PP is a permutation matrix and X=PXX' = PX, then Attention(XWQ,XWK,XWV)=PAttention(XWQ,XWK,XWV)\text{Attention}(X'W_Q, X'W_K, X'W_V) = P \cdot \text{Attention}(XW_Q, XW_K, XW_V). Why does this mean that a transformer without positional encoding cannot distinguish token order?

ExerciseResearch

Problem

The kernel interpretation says attention uses kernel κ(q,k)=exp(qk/dk)\kappa(q, k) = \exp(q^\top k / \sqrt{d_k}). The "Performers" paper (Choromanski et al., 2021) proposes approximating this kernel with random features ϕ(x)\phi(x) such that κ(q,k)ϕ(q)ϕ(k)\kappa(q, k) \approx \phi(q)^\top \phi(k), enabling O(n)O(n) attention. What is the key mathematical identity that makes this possible, and what is the main accuracy tradeoff?

Related Comparisons

References

Canonical:

  • Vaswani et al., "Attention Is All You Need" (NeurIPS 2017), arXiv:1706.03762. The transformer paper. See the paper notes page.
  • Bahdanau, Cho, Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate" (ICLR 2015). Original soft-attention alignment.

Current:

  • Choromanski et al., "Rethinking Attention with Performers" (ICLR 2021). Random-feature kernel approximation.
  • Tsai et al., "Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel" (EMNLP 2019). Asymmetric kernel view.
  • Olsson et al., "In-context Learning and Induction Heads" (2022). Mechanistic analysis of attention heads.
  • Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapter 9 ("Transformers and Large Language Models").

Next Topics

The natural next steps from attention theory:

  • KV cache: how autoregressive generation avoids recomputing attention
  • Positional encoding: why attention needs position information and the mathematics of RoPE

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics