Skip to main content

Paper breakdown

Attention Is All You Need

Ashish Vaswani et al. · 2017 · NeurIPS 2017

Replaces recurrence and convolution in sequence transduction with stacked self-attention. Establishes the transformer block — multi-head scaled dot-product attention plus position-wise feed-forward layers — that every modern large language model still uses.

Overview

Vaswani et al. (2017) replaced the recurrent and convolutional backbone of sequence-to-sequence models with stacked self-attention and position-wise feed-forward layers. The paper reports state-of-the-art BLEU on WMT 2014 English-to-German (28.4) and English-to-French (41.8) at a fraction of the training cost of the strongest prior recurrent and convolutional baselines.

The motivation was concrete. Recurrent networks compute hidden states sequentially: ht=f(ht1,xt)h_t = f(h_{t-1}, x_t). That dependence chains training across the time axis and prevents the kind of full-batch parallelism that GPUs are good at. The transformer keeps a residual connection and a layer norm, but routes all token-to-token interaction through attention, which factorises into matrix multiplications over the entire sequence at once.

What the paper shipped was not just a benchmark result but a reusable block. The same encoder block, dropped into BERT, GPT, T5, and every modern open-weight model, has not changed in structure since 2017. What changed around it: scale, data, training objective, and positional encoding.

Mathematical Contributions

Scaled dot-product attention

Given query, key, and value matrices QRn×dkQ \in \mathbb{R}^{n \times d_k}, KRm×dkK \in \mathbb{R}^{m \times d_k}, VRm×dvV \in \mathbb{R}^{m \times d_v}, the paper defines:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V

The dk\sqrt{d_k} scaling is not cosmetic. With QQ and KK entries drawn iid with mean 0 and variance 1, each entry of QKQK^\top has variance dkd_k. Without the scale, large dkd_k pushes the softmax into saturated regions where its Jacobian collapses, the gradient vanishes, and training stalls. Dividing by dk\sqrt{d_k} keeps the pre-softmax logits at unit variance.

Multi-head attention

Rather than a single attention map of dimension dmodeld_{\text{model}}, the paper splits the projection into hh independent heads:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W^O

with headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(Q W^Q_i, K W^K_i, V W^V_i) and per-head dimension dk=dv=dmodel/hd_k = d_v = d_{\text{model}} / h. The original paper uses dmodel=512d_{\text{model}} = 512 and h=8h = 8. The total parameter count is held fixed; what changes is that different heads can specialise to different relations (syntactic, positional, lexical) without competing for the same projection.

Sinusoidal positional encoding

Self-attention is permutation-equivariant — Attention(ΠQ,ΠK,ΠV)=ΠAttention(Q,K,V)\text{Attention}(\Pi Q, \Pi K, \Pi V) = \Pi\,\text{Attention}(Q, K, V) for any permutation matrix Π\Pi. Positions must be injected explicitly. The paper adds fixed sinusoidal encodings:

PE(pos,2i)=sin(pos/100002i/dmodel),PE(pos,2i+1)=cos(pos/100002i/dmodel)PE_{(\text{pos}, 2i)} = \sin(\text{pos} / 10000^{2i/d_{\text{model}}}), \quad PE_{(\text{pos}, 2i+1)} = \cos(\text{pos} / 10000^{2i/d_{\text{model}}})

The geometric frequency schedule means that PEpos+kPE_{\text{pos}+k} is a linear function of PEposPE_{\text{pos}}, so the model can in principle learn relative-position attention via dot products. See positional encoding for the breakdown.

Encoder and decoder blocks

Each encoder block is LayerNorm(x+Sublayer(x))\text{LayerNorm}(x + \text{Sublayer}(x)), where Sublayer\text{Sublayer} is either multi-head self-attention or a position-wise two-layer MLP with ReLU. The decoder adds a third sublayer — masked self-attention — that prevents attending to future positions during training. The mask is applied by setting future-position logits to -\infty before the softmax, which makes the post-softmax weight exactly zero.

Computational cost

The paper analyses sequence-transduction layers along three axes: total complexity per layer, sequential operations, and maximum path length between any two positions. Self-attention is O(n2d)O(n^2 \cdot d) per layer with O(1)O(1) sequential operations and path length 1; recurrence is O(nd2)O(n \cdot d^2) with O(n)O(n) sequential operations and path length nn. For typical n<dn < d, self-attention is both cheaper and shorter-path — that is the architectural argument.

Connections to TheoremPath Topics

Why It Matters Now

Every production large language model in 2026 — GPT-4 family, Claude family, Gemini, Llama, DeepSeek, Mistral — uses the same encoder/decoder block defined in this paper. What changed: the position encoding (RoPE, ALiBi), the attention sparsity pattern (grouped-query, sliding-window, ring), the activation (SwiGLU instead of ReLU), and the normalisation placement (pre-norm instead of post-norm). The core softmax(QK/dk)V\text{softmax}(QK^\top/\sqrt{d_k}) V multiplication and the head-splitting structure are unchanged.

The paper also marks a clean methodological shift: the proposed model is faster to train than the recurrent baseline it beats. Architectural progress in deep learning had often traded compute for accuracy. The transformer traded recurrence for parallel hardware utilisation and got both.

References

Canonical:

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762.

Direct precursors:

  • Bahdanau, D., Cho, K., & Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR. arXiv:1409.0473.
  • Luong, M.-T., Pham, H., & Manning, C. D. (2015). "Effective Approaches to Attention-based Neural Machine Translation." EMNLP. arXiv:1508.04025.

Follow-on work the paper enabled:

  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL. arXiv:1810.04805.
  • Brown, T. et al. (2020). "Language Models are Few-Shot Learners." NeurIPS. arXiv:2005.14165.

Critique and refinement:

  • Press, O., Smith, N. A., & Lewis, M. (2022). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation." ICLR. arXiv:2108.12409.
  • Su, J. et al. (2024). "RoFormer: Enhanced Transformer with Rotary Position Embedding." Neurocomputing. arXiv:2104.09864.

Connected topics

Last reviewed: May 5, 2026