Attention Is All You Need (Paper)

Sneiderman, Robby

LLM Construction

Attention Is All You Need (Paper)

The 2017 paper that introduced the transformer: self-attention replacing recurrence, multi-head attention, positional encoding, and what survived versus what changed in modern LLMs.

AdvancedTier 1StableReference~45 min

Prerequisites

Transformer Architecture

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 4 | tier 1. This page has 1 direct prerequisite and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Flash Attention

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Vaswani et al. (2017) proposed replacing recurrence entirely with self-attention for sequence transduction. Before this paper, the dominant architectures for sequence tasks were LSTMs and GRUs with attention. The transformer removed the sequential bottleneck, enabling parallel computation across all positions. Every modern large language model (GPT series, Claude, Gemini, Llama) descends from the architecture described in this paper.

Reading the original paper in 2026 is still valuable, not because every detail survived, but because understanding what changed and why reveals how the field evolved.

Historical Lineage

Sequence modeling before 2017 was dominated by recurrence. RNNs and LSTMs (Hochreiter and Schmidhuber, 1997) process tokens step-by-step, bottlenecked by sequential computation and by vanishing gradients over long dependencies. Bahdanau, Cho, and Bengio (2015, arXiv:1409.0473) added a soft alignment mechanism on top of an encoder-decoder RNN, letting the decoder attend to arbitrary source positions rather than squeezing all context through a single hidden state. The Transformer (Vaswani et al., 2017) removed recurrence entirely, keeping only self-attention and positional encodings, which exposed massive parallelism on accelerators and set the architectural template for everything that followed.

What Happened After

Capabilities absent from the 2017 paper emerged as the architecture was scaled and repurposed: in-context learning at scale (Brown et al., 2020, GPT-3, arXiv:2005.14165), chain-of-thought prompting (Wei et al., 2022, arXiv:2201.11903), and instruction following via RLHF (Ouyang et al., 2022, InstructGPT, arXiv:2203.02155). These behaviors were not predicted by the original paper. The architecture hosted them once data, scale, and training objectives changed around it.

Formal Definitions

Definition

Self-Attention $Attention (Q, K, V)$

Given an input sequence of $n$ vectors packed into matrices $Q, K, V \in \mathbb{R}^{n \times d_k}$ , self-attention computes a weighted combination of value vectors where the weights are determined by pairwise similarity between queries and keys:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$

The queries, keys, and values are all linear projections of the same input sequence, hence "self." The softmax operates row-wise, producing a stochastic matrix of attention weights. Each output position is a convex combination of all value vectors.

Definition

Multi-Head Attention $MultiHead (Q, K, V)$

Multi-head attention runs $h$ independent attention functions in parallel, each on a lower-dimensional projection of the input:

$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$

where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$ with learned projections $W_i^Q, W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$ , $W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$ , and $W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$ . The original paper uses $h = 8$ and $d_k = d_v = d_{\text{model}}/h = 64$ .

Definition

Positional Encoding $P E_{(p os, i)}$

Since self-attention is permutation-equivariant (it treats the input as a set, not a sequence), explicit position information must be injected. The paper uses fixed sinusoidal positional encodings:

$PE_{(\text{pos}, 2i)} = \sin(\text{pos} / 10000^{2i/d_{\text{model}}}), \quad PE_{(\text{pos}, 2i+1)} = \cos(\text{pos} / 10000^{2i/d_{\text{model}}})$

These are added (not concatenated) to the input embeddings. The sinusoidal form was chosen because $PE_{\text{pos}+k}$ can be expressed as a linear function of $PE_{\text{pos}}$ , which the authors hypothesized would help the model learn relative positions.

Key Contributions

Self-attention as the sole mechanism. Prior work used attention as an add-on to recurrent neural networks (Bahdanau et al., 2014). Vaswani et al. showed that attention alone, without any recurrence or convolution, could match or beat recurrent models on translation benchmarks.

Multi-head attention. Instead of a single attention function, the paper splits queries, keys, and values into $h$ heads, each operating on a $d_k = d_{\text{model}}/h$ dimensional subspace:

$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$

where each $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$ .

Scaled dot-product attention. The attention function is:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$

The $\sqrt{d_k}$ scaling prevents dot products from growing large in magnitude, which would push softmax into saturated regions with vanishing gradients.

Positional encoding. Since self-attention is permutation-equivariant, the model has no notion of token order without explicit position information. The paper used sinusoidal positional encodings:

$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{\text{model}}})$ $PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{\text{model}}})$

Encoder-decoder architecture. The original transformer had an encoder (6 layers of self-attention + FFN) and a decoder (6 layers of masked self-attention + cross-attention + FFN). The encoder processes the input sequence; the decoder generates the output sequence autoregressively.

Main Theorems

Proposition

Self-Attention Computational Properties

Statement

Self-attention computes $\text{softmax}(QK^T / \sqrt{d}) V$ in $O(n^2 d)$ time and requires $O(n^2)$ memory for the attention matrix. Each output token is a weighted combination of all value vectors, where the weights depend on all pairwise query-key interactions.

Intuition

Every token attends to every other token. This gives the model global receptive field in a single layer, unlike convolutions (local) or recurrence (sequential). The cost is quadratic in sequence length.

Proof Sketch

The matrix $QK^T$ is $n \times n$ and costs $O(n^2 d)$ to compute. The softmax is applied row-wise in $O(n^2)$ . The final multiplication with $V$ (which is $n \times d$ ) costs $O(n^2 d)$ . Total: $O(n^2 d)$ . Memory for the attention matrix: $O(n^2)$ .

Why It Matters

The $O(n^2)$ complexity is the central limitation of transformers. It is why context lengths were initially limited to 512 or 1024 tokens. Flash attention, sparse attention, and linear attention variants all target this bottleneck.

Failure Mode

For long sequences ( $n > 10{,}000$ ), the quadratic cost becomes the training bottleneck. Naive implementation also suffers from memory bandwidth limits due to materializing the full $n \times n$ attention matrix. Flash attention (Dao et al., 2022) avoids materializing this matrix, achieving the same computation in less wall-clock time.

report a correction →

Proposition

Multi-Head Attention Capacity

Statement

Multi-head attention with $h$ heads of dimension $d_k = d_{\text{model}}/h$ has the same total parameter count as single-head attention with dimension $d_{\text{model}}$ . However, multi-head attention can represent richer functions: each head can learn a different attention pattern (e.g., one head attends to syntactic relations, another to semantic similarity).

Intuition

Multiple heads allow the model to jointly attend to information from different representation subspaces at different positions. A single head must average across all types of relationships.

Proof Sketch

Single-head: $3d_{\text{model}}^2$ parameters for $W^Q, W^K, W^V$ plus $d_{\text{model}}^2$ for $W^O$ . Multi-head: $h \cdot 3 \cdot d_{\text{model}} \cdot d_k = 3d_{\text{model}}^2$ plus $d_{\text{model}}^2$ for $W^O$ . Same total. But the rank- $d_k$ structure of each head restricts each individual attention pattern, while the concatenation allows $h$ independent patterns.

Why It Matters

Multi-head attention is one of the few design choices from the original paper that survived unchanged. Empirically, different heads specialize: some attend to adjacent tokens, some attend to syntactically related tokens, some attend to the beginning of the sequence. This specialization emerges without supervision.

Failure Mode

Many heads become redundant during training. Pruning studies show that removing 20-40% of heads often has minimal effect on performance, suggesting the model is over-parameterized in the multi-head dimension. GQA (grouped query attention) exploits this by sharing key-value heads across query heads.

report a correction →

What Survived and What Changed

Survived (as of 2026):

Self-attention as the core mechanism
Multi-head attention
Residual connections
Layer normalization
The $\text{softmax}(QK^T/\sqrt{d_k})V$ formula

Changed:

Decoder-only replaced encoder-decoder. GPT showed that a decoder-only architecture suffices for language modeling, and it became the default for LLMs. Encoder-decoder survives in some applications (T5, translation).
Pre-norm replaced post-norm. The original paper applied layer norm after the residual connection. Modern transformers apply it before (pre-norm), which stabilizes training for deep models. See residual stream internals.
RoPE replaced sinusoidal positions. Rotary positional embeddings (Su et al., 2021) encode relative positions through rotation matrices, enabling better length generalization than absolute sinusoidal encodings. See positional encoding for a full comparison.
GQA/MQA replaced standard multi-head. Grouped query attention reduces the KV cache size for inference, trading a small quality decrease for major memory savings at serving time. See attention variants.
SwiGLU replaced ReLU in FFN. The original FFN used ReLU activation. Modern LLMs use SwiGLU or GeGLU, which empirically improve performance.
Flash attention changed the implementation. The algorithm is mathematically identical, but the IO-aware implementation avoids materializing the full $n \times n$ attention matrix, making long contexts practical.

What the Paper Got Right

The core computation. Scaled dot-product attention ( $\text{softmax}(QK^T/\sqrt{d_k})V$ ) has not changed in nine years. Every LLM in production uses this exact formula. The scaling factor derivation in the paper (dot products grow as $\sqrt{d_k}$ , pushing softmax into saturation) is correct and important.

Multi-head attention. The insight that multiple low-rank attention patterns are better than a single full-rank one has held up. Head specialization (syntactic heads, positional heads, rare-token heads) has been confirmed by mechanistic interpretability research.

Residual connections and layer normalization. The paper adopted these from prior work, and they remain in every modern transformer. The skip connection pattern is critical for training deep models.

Parallelism over recurrence. The central thesis of the paper, that attention-only models can replace sequential processing, was correct. This enabled the scaling that drives modern LLMs.

What Aged

The encoder-decoder framing. The paper presented the transformer as a sequence-to-sequence translation model. The field moved to decoder-only causal models for generation and encoder-only models (BERT) for understanding. The encoder-decoder split is no longer the default.

Sinusoidal positional encodings. Replaced by learned positions, ALiBi, and then RoPE, which handles length generalization better.

The training setup. The paper trained on WMT translation data for a few days on 8 GPUs. Modern LLMs train on trillions of tokens across thousands of GPUs. The scaling regime is completely different.

Label smoothing as the main regularization. The paper used label smoothing with $\epsilon = 0.1$ . Modern LLMs rely on dropout (or no dropout at scale), weight decay, and data diversity as the primary regularizers.

Common Confusions

Watch Out

The transformer is not just attention

A transformer block is the combination of attention, a feedforward network, residual connections, and layer normalization. The FFN contains roughly 2/3 of the parameters in a standard transformer block. Attention routes information; the FFN processes it. Both are necessary.

Watch Out

The original paper was about translation, not language modeling

Vaswani et al. (2017) demonstrated the transformer on machine translation (WMT 2014 English-German and English-French). The application to autoregressive language modeling came later with GPT (Radford et al., 2018). The decoder-only architecture for causal language modeling was not in the original paper.

Exercises

ExerciseCore

Problem

In a transformer with $d_{\text{model}} = 512$ and $h = 8$ heads, what is the dimension $d_k$ of each head? How many parameters are in one multi-head attention sublayer (including $W^Q, W^K, W^V, W^O$ , excluding biases)?

ExerciseAdvanced

Problem

Explain why the scaling factor $\sqrt{d_k}$ is necessary. What goes wrong if you remove it? Derive the expected magnitude of $q \cdot k$ when $q$ and $k$ are random vectors with independent entries of mean 0 and variance 1.

References

Canonical:

Vaswani et al., "Attention Is All You Need" (NeurIPS 2017)
Bahdanau et al., "Neural Machine Translation by Jointly Learning to Align and Translate" (ICLR 2015). The attention mechanism that the transformer generalized.

Predecessors and context:

Radford et al., "Improving Language Understanding by Generative Pre-Training" (2018). GPT-1: first decoder-only transformer for language modeling.
Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers" (NAACL 2019). Encoder-only variant.

Current evolution:

Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention" (NeurIPS 2022)
Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021)
Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models" (2023). The grouped query attention used in modern LLMs.
Phuong & Hutter, "Formal Algorithms for Transformers" (2022). A mathematical reference for the transformer formalism.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Transformer Architecturelayer 4 · tier 2

Derived topics

3

Flash Attentionlayer 5 · tier 2
KV Cachelayer 5 · tier 2
Positional Encodinglayer 4 · tier 3

Graph-backed continuations

Flash Attention Positional Encoding KV Cache