Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Attention Is All You Need (Paper)

The 2017 paper that introduced the transformer: self-attention replacing recurrence, multi-head attention, positional encoding, and what survived versus what changed in modern LLMs.

AdvancedTier 1Stable~45 min
0

Why This Matters

Input tokensThecatsatWQ · XQueries (Q)What am I looking for?WK · XKeys (K)What do I contain?WV · XValues (V)What do I contribute?Q·K/ √dsoftmaxα (weights)·VOutputExample: "sat" attends to "cat" (high weight) and "The" (low weight)0.1 The0.7 cat0.2 satAttention(Q, K, V) = softmax(QKᵀ / √d_k) · V

Vaswani et al. (2017) proposed replacing recurrence entirely with self-attention for sequence transduction. Before this paper, the dominant architectures for sequence tasks were LSTMs and GRUs with attention. The transformer removed the sequential bottleneck, enabling parallel computation across all positions. Every modern large language model (GPT series, Claude, Gemini, Llama) descends from the architecture described in this paper.

Reading the original paper in 2026 is still valuable, not because every detail survived, but because understanding what changed and why reveals how the field evolved.

Formal Definitions

Definition

Self-Attention

Given an input sequence of nn vectors packed into matrices Q,K,VRn×dkQ, K, V \in \mathbb{R}^{n \times d_k}, self-attention computes a weighted combination of value vectors where the weights are determined by pairwise similarity between queries and keys:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

The queries, keys, and values are all linear projections of the same input sequence, hence "self." The softmax operates row-wise, producing a stochastic matrix of attention weights. Each output position is a convex combination of all value vectors.

Definition

Multi-Head Attention

Multi-head attention runs hh independent attention functions in parallel, each on a lower-dimensional projection of the input:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O

where headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) with learned projections WiQ,WiKRdmodel×dkW_i^Q, W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}, WiVRdmodel×dvW_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}, and WORhdv×dmodelW^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}. The original paper uses h=8h = 8 and dk=dv=dmodel/h=64d_k = d_v = d_{\text{model}}/h = 64.

Definition

Positional Encoding

Since self-attention is permutation-equivariant (it treats the input as a set, not a sequence), explicit position information must be injected. The paper uses fixed sinusoidal positional encodings:

PE(pos,2i)=sin(pos/100002i/dmodel),PE(pos,2i+1)=cos(pos/100002i/dmodel)PE_{(\text{pos}, 2i)} = \sin(\text{pos} / 10000^{2i/d_{\text{model}}}), \quad PE_{(\text{pos}, 2i+1)} = \cos(\text{pos} / 10000^{2i/d_{\text{model}}})

These are added (not concatenated) to the input embeddings. The sinusoidal form was chosen because PEpos+kPE_{\text{pos}+k} can be expressed as a linear function of PEposPE_{\text{pos}}, which the authors hypothesized would help the model learn relative positions.

Key Contributions

Self-attention as the sole mechanism. Prior work used attention as an add-on to recurrent neural networks (Bahdanau et al., 2014). Vaswani et al. showed that attention alone, without any recurrence or convolution, could match or beat recurrent models on translation benchmarks.

Multi-head attention. Instead of a single attention function, the paper splits queries, keys, and values into hh heads, each operating on a dk=dmodel/hd_k = d_{\text{model}}/h dimensional subspace:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O

where each headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V).

Scaled dot-product attention. The attention function is:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

The dk\sqrt{d_k} scaling prevents dot products from growing large in magnitude, which would push softmax into saturated regions with vanishing gradients.

Positional encoding. Since self-attention is permutation-equivariant, the model has no notion of token order without explicit position information. The paper used sinusoidal positional encodings:

PE(pos,2i)=sin(pos/100002i/dmodel)PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{\text{model}}}) PE(pos,2i+1)=cos(pos/100002i/dmodel)PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{\text{model}}})

Encoder-decoder architecture. The original transformer had an encoder (6 layers of self-attention + FFN) and a decoder (6 layers of masked self-attention + cross-attention + FFN). The encoder processes the input sequence; the decoder generates the output sequence autoregressively.

Main Theorems

Proposition

Self-Attention Computational Properties

Statement

Self-attention computes softmax(QKT/d)V\text{softmax}(QK^T / \sqrt{d}) V in O(n2d)O(n^2 d) time and requires O(n2)O(n^2) memory for the attention matrix. Each output token is a weighted combination of all value vectors, where the weights depend on all pairwise query-key interactions.

Intuition

Every token attends to every other token. This gives the model global receptive field in a single layer, unlike convolutions (local) or recurrence (sequential). The cost is quadratic in sequence length.

Proof Sketch

The matrix QKTQK^T is n×nn \times n and costs O(n2d)O(n^2 d) to compute. The softmax is applied row-wise in O(n2)O(n^2). The final multiplication with VV (which is n×dn \times d) costs O(n2d)O(n^2 d). Total: O(n2d)O(n^2 d). Memory for the attention matrix: O(n2)O(n^2).

Why It Matters

The O(n2)O(n^2) complexity is the central limitation of transformers. It is why context lengths were initially limited to 512 or 1024 tokens. Flash attention, sparse attention, and linear attention variants all target this bottleneck.

Failure Mode

For long sequences (n>10,000n > 10{,}000), the quadratic cost becomes the training bottleneck. Naive implementation also suffers from memory bandwidth limits due to materializing the full n×nn \times n attention matrix. Flash attention (Dao et al., 2022) avoids materializing this matrix, achieving the same computation in less wall-clock time.

Proposition

Multi-Head Attention Capacity

Statement

Multi-head attention with hh heads of dimension dk=dmodel/hd_k = d_{\text{model}}/h has the same total parameter count as single-head attention with dimension dmodeld_{\text{model}}. However, multi-head attention can represent richer functions: each head can learn a different attention pattern (e.g., one head attends to syntactic relations, another to semantic similarity).

Intuition

Multiple heads allow the model to jointly attend to information from different representation subspaces at different positions. A single head must average across all types of relationships.

Proof Sketch

Single-head: 3dmodel23d_{\text{model}}^2 parameters for WQ,WK,WVW^Q, W^K, W^V plus dmodel2d_{\text{model}}^2 for WOW^O. Multi-head: h3dmodeldk=3dmodel2h \cdot 3 \cdot d_{\text{model}} \cdot d_k = 3d_{\text{model}}^2 plus dmodel2d_{\text{model}}^2 for WOW^O. Same total. But the rank-dkd_k structure of each head restricts each individual attention pattern, while the concatenation allows hh independent patterns.

Why It Matters

Multi-head attention is one of the few design choices from the original paper that survived unchanged. Empirically, different heads specialize: some attend to adjacent tokens, some attend to syntactically related tokens, some attend to the beginning of the sequence. This specialization emerges without supervision.

Failure Mode

Many heads become redundant during training. Pruning studies show that removing 20-40% of heads often has minimal effect on performance, suggesting the model is over-parameterized in the multi-head dimension. GQA (grouped query attention) exploits this by sharing key-value heads across query heads.

What Survived and What Changed

Survived (as of 2026):

  • Self-attention as the core mechanism
  • Multi-head attention
  • Residual connections
  • Layer normalization
  • The softmax(QKT/dk)V\text{softmax}(QK^T/\sqrt{d_k})V formula

Changed:

  • Decoder-only replaced encoder-decoder. GPT showed that a decoder-only architecture suffices for language modeling, and it became the default for LLMs. Encoder-decoder survives in some applications (T5, translation).
  • Pre-norm replaced post-norm. The original paper applied layer norm after the residual connection. Modern transformers apply it before (pre-norm), which stabilizes training for deep models. See residual stream internals.
  • RoPE replaced sinusoidal positions. Rotary positional embeddings (Su et al., 2021) encode relative positions through rotation matrices, enabling better length generalization than absolute sinusoidal encodings. See positional encoding for a full comparison.
  • GQA/MQA replaced standard multi-head. Grouped query attention reduces the KV cache size for inference, trading a small quality decrease for major memory savings at serving time. See attention variants.
  • SwiGLU replaced ReLU in FFN. The original FFN used ReLU activation. Modern LLMs use SwiGLU or GeGLU, which empirically improve performance.
  • Flash attention changed the implementation. The algorithm is mathematically identical, but the IO-aware implementation avoids materializing the full n×nn \times n attention matrix, making long contexts practical.

What the Paper Got Right

The core computation. Scaled dot-product attention (softmax(QKT/dk)V\text{softmax}(QK^T/\sqrt{d_k})V) has not changed in nine years. Every LLM in production uses this exact formula. The scaling factor derivation in the paper (dot products grow as dk\sqrt{d_k}, pushing softmax into saturation) is correct and important.

Multi-head attention. The insight that multiple low-rank attention patterns are better than a single full-rank one has held up. Head specialization (syntactic heads, positional heads, rare-token heads) has been confirmed by mechanistic interpretability research.

Residual connections and layer normalization. The paper adopted these from prior work, and they remain in every modern transformer. The skip connection pattern is critical for training deep models.

Parallelism over recurrence. The central thesis of the paper, that attention-only models can replace sequential processing, was correct. This enabled the scaling that drives modern LLMs.

What Aged

The encoder-decoder framing. The paper presented the transformer as a sequence-to-sequence translation model. The field moved to decoder-only causal models for generation and encoder-only models (BERT) for understanding. The encoder-decoder split is no longer the default.

Sinusoidal positional encodings. Replaced by learned positions, ALiBi, and then RoPE, which handles length generalization better.

The training setup. The paper trained on WMT translation data for a few days on 8 GPUs. Modern LLMs train on trillions of tokens across thousands of GPUs. The scaling regime is completely different.

Label smoothing as the main regularization. The paper used label smoothing with ϵ=0.1\epsilon = 0.1. Modern LLMs rely on dropout (or no dropout at scale), weight decay, and data diversity as the primary regularizers.

Common Confusions

Watch Out

The transformer is not just attention

The transformer block is attention + feedforward network + residual connections

  • layer normalization. The FFN contains roughly 2/3 of the parameters in a standard transformer block. Attention routes information; the FFN processes it. Both are necessary.
Watch Out

The original paper was about translation, not language modeling

Vaswani et al. (2017) demonstrated the transformer on machine translation (WMT 2014 English-German and English-French). The application to autoregressive language modeling came later with GPT (Radford et al., 2018). The decoder-only architecture for causal language modeling was not in the original paper.

Exercises

ExerciseCore

Problem

In a transformer with dmodel=512d_{\text{model}} = 512 and h=8h = 8 heads, what is the dimension dkd_k of each head? How many parameters are in one multi-head attention sublayer (including WQ,WK,WV,WOW^Q, W^K, W^V, W^O, excluding biases)?

ExerciseAdvanced

Problem

Explain why the scaling factor dk\sqrt{d_k} is necessary. What goes wrong if you remove it? Derive the expected magnitude of qkq \cdot k when qq and kk are random vectors with independent entries of mean 0 and variance 1.

References

Canonical:

  • Vaswani et al., "Attention Is All You Need" (NeurIPS 2017)
  • Bahdanau et al., "Neural Machine Translation by Jointly Learning to Align and Translate" (ICLR 2015). The attention mechanism that the transformer generalized.

Predecessors and context:

  • Radford et al., "Improving Language Understanding by Generative Pre-Training" (2018). GPT-1: first decoder-only transformer for language modeling.
  • Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers" (NAACL 2019). Encoder-only variant.

Current evolution:

  • Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention" (NeurIPS 2022)
  • Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021)
  • Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models" (2023). The grouped query attention used in modern LLMs.
  • Phuong & Hutter, "Formal Algorithms for Transformers" (2022). A mathematical reference for the transformer formalism.

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics