Paper breakdown

Attention Is All You Need

Ashish Vaswani et al. · 2017 · NeurIPS 2017

Replaces recurrence and convolution in sequence transduction with stacked self-attention. Establishes the transformer block — multi-head scaled dot-product attention plus position-wise feed-forward layers — that every modern large language model still uses.

arXiv:1706.03762

Overview

Vaswani et al. (2017) replaced the recurrent and convolutional backbone of sequence-to-sequence models with stacked self-attention and position-wise feed-forward layers. The paper reports state-of-the-art BLEU on WMT 2014 English-to-German (28.4) and English-to-French (41.8) at a fraction of the training cost of the strongest prior recurrent and convolutional baselines.

The motivation was concrete. Recurrent networks compute hidden states sequentially: $h_t = f(h_{t-1}, x_t)$ . That dependence chains training across the time axis and prevents the kind of full-batch parallelism that GPUs are good at. The transformer keeps a residual connection and a layer norm, but routes all token-to-token interaction through attention, which factorises into matrix multiplications over the entire sequence at once.

What the paper shipped was not just a benchmark result but a reusable block. The same encoder block, dropped into BERT, GPT, T5, and every modern open-weight model, has not changed in structure since 2017. What changed around it: scale, data, training objective, and positional encoding.

Mathematical Contributions

Scaled dot-product attention

Given query, key, and value matrices $Q \in \mathbb{R}^{n \times d_k}$ , $K \in \mathbb{R}^{m \times d_k}$ , $V \in \mathbb{R}^{m \times d_v}$ , the paper defines:

$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V$

The $\sqrt{d_k}$ scaling is not cosmetic. With $Q$ and $K$ entries drawn iid with mean 0 and variance 1, each entry of $QK^\top$ has variance $d_k$ . Without the scale, large $d_k$ pushes the softmax into saturated regions where its Jacobian collapses, the gradient vanishes, and training stalls. Dividing by $\sqrt{d_k}$ keeps the pre-softmax logits at unit variance.

Multi-head attention

Rather than a single attention map of dimension $d_{\text{model}}$ , the paper splits the projection into $h$ independent heads:

$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W^O$

with $\text{head}_i = \text{Attention}(Q W^Q_i, K W^K_i, V W^V_i)$ and per-head dimension $d_k = d_v = d_{\text{model}} / h$ . The original paper uses $d_{\text{model}} = 512$ and $h = 8$ . The total parameter count is held fixed; what changes is that different heads can specialise to different relations (syntactic, positional, lexical) without competing for the same projection.

Sinusoidal positional encoding

Self-attention is permutation-equivariant — $\text{Attention}(\Pi Q, \Pi K, \Pi V) = \Pi\,\text{Attention}(Q, K, V)$ for any permutation matrix $\Pi$ . Positions must be injected explicitly. The paper adds fixed sinusoidal encodings:

$PE_{(\text{pos}, 2i)} = \sin(\text{pos} / 10000^{2i/d_{\text{model}}}), \quad PE_{(\text{pos}, 2i+1)} = \cos(\text{pos} / 10000^{2i/d_{\text{model}}})$

The geometric frequency schedule means that $PE_{\text{pos}+k}$ is a linear function of $PE_{\text{pos}}$ , so the model can in principle learn relative-position attention via dot products. See positional encoding for the breakdown.

Encoder and decoder blocks

Each encoder block is $\text{LayerNorm}(x + \text{Sublayer}(x))$ , where $\text{Sublayer}$ is either multi-head self-attention or a position-wise two-layer MLP with ReLU. The decoder adds a third sublayer — masked self-attention — that prevents attending to future positions during training. The mask is applied by setting future-position logits to $-\infty$ before the softmax, which makes the post-softmax weight exactly zero.

Computational cost

The paper analyses sequence-transduction layers along three axes: total complexity per layer, sequential operations, and maximum path length between any two positions. Self-attention is $O(n^2 \cdot d)$ per layer with $O(1)$ sequential operations and path length 1; recurrence is $O(n \cdot d^2)$ with $O(n)$ sequential operations and path length $n$ . For typical $n < d$ , self-attention is both cheaper and shorter-path — that is the architectural argument.

Connections to TheoremPath Topics

Transformer architecture — the full block, including residuals and layer norm.
Attention mechanism theory — the scaled dot-product mechanics.
Attention mechanisms (history) — Bahdanau (2014), Luong (2015), and what the transformer kept and dropped.
Positional encoding — sinusoidal vs. learned vs. RoPE vs. ALiBi.
Softmax and numerical stability — why the $\sqrt{d_k}$ scaling matters.
Residual stream and transformer internals — how the residual path carries information across layers.
Batch normalization — for context on why the paper uses LayerNorm instead.

Why It Matters Now

Every production large language model in 2026 — GPT-4 family, Claude family, Gemini, Llama, DeepSeek, Mistral — uses the same encoder/decoder block defined in this paper. What changed: the position encoding (RoPE, ALiBi), the attention sparsity pattern (grouped-query, sliding-window, ring), the activation (SwiGLU instead of ReLU), and the normalisation placement (pre-norm instead of post-norm). The core $\text{softmax}(QK^\top/\sqrt{d_k}) V$ multiplication and the head-splitting structure are unchanged.

The paper also marks a clean methodological shift: the proposed model is faster to train than the recurrent baseline it beats. Architectural progress in deep learning had often traded compute for accuracy. The transformer traded recurrence for parallel hardware utilisation and got both.

References

Canonical:

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762.

Direct precursors:

Bahdanau, D., Cho, K., & Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR. arXiv:1409.0473.
Luong, M.-T., Pham, H., & Manning, C. D. (2015). "Effective Approaches to Attention-based Neural Machine Translation." EMNLP. arXiv:1508.04025.

Follow-on work the paper enabled:

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL. arXiv:1810.04805.
Brown, T. et al. (2020). "Language Models are Few-Shot Learners." NeurIPS. arXiv:2005.14165.

Critique and refinement:

Press, O., Smith, N. A., & Lewis, M. (2022). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation." ICLR. arXiv:2108.12409.
Su, J. et al. (2024). "RoFormer: Enhanced Transformer with Rotary Position Embedding." Neurocomputing. arXiv:2104.09864.

Connected topics

Last reviewed: May 5, 2026