LLM Construction
Attention Is All You Need (Paper)
The 2017 paper that introduced the transformer: self-attention replacing recurrence, multi-head attention, positional encoding, and what survived versus what changed in modern LLMs.
Prerequisites
Why This Matters
Vaswani et al. (2017) proposed replacing recurrence entirely with self-attention for sequence transduction. Before this paper, the dominant architectures for sequence tasks were LSTMs and GRUs with attention. The transformer removed the sequential bottleneck, enabling parallel computation across all positions. Every modern large language model (GPT series, Claude, Gemini, Llama) descends from the architecture described in this paper.
Reading the original paper in 2026 is still valuable, not because every detail survived, but because understanding what changed and why reveals how the field evolved.
Formal Definitions
Self-Attention
Given an input sequence of vectors packed into matrices , self-attention computes a weighted combination of value vectors where the weights are determined by pairwise similarity between queries and keys:
The queries, keys, and values are all linear projections of the same input sequence, hence "self." The softmax operates row-wise, producing a stochastic matrix of attention weights. Each output position is a convex combination of all value vectors.
Multi-Head Attention
Multi-head attention runs independent attention functions in parallel, each on a lower-dimensional projection of the input:
where with learned projections , , and . The original paper uses and .
Positional Encoding
Since self-attention is permutation-equivariant (it treats the input as a set, not a sequence), explicit position information must be injected. The paper uses fixed sinusoidal positional encodings:
These are added (not concatenated) to the input embeddings. The sinusoidal form was chosen because can be expressed as a linear function of , which the authors hypothesized would help the model learn relative positions.
Key Contributions
Self-attention as the sole mechanism. Prior work used attention as an add-on to recurrent neural networks (Bahdanau et al., 2014). Vaswani et al. showed that attention alone, without any recurrence or convolution, could match or beat recurrent models on translation benchmarks.
Multi-head attention. Instead of a single attention function, the paper splits queries, keys, and values into heads, each operating on a dimensional subspace:
where each .
Scaled dot-product attention. The attention function is:
The scaling prevents dot products from growing large in magnitude, which would push softmax into saturated regions with vanishing gradients.
Positional encoding. Since self-attention is permutation-equivariant, the model has no notion of token order without explicit position information. The paper used sinusoidal positional encodings:
Encoder-decoder architecture. The original transformer had an encoder (6 layers of self-attention + FFN) and a decoder (6 layers of masked self-attention + cross-attention + FFN). The encoder processes the input sequence; the decoder generates the output sequence autoregressively.
Main Theorems
Self-Attention Computational Properties
Statement
Self-attention computes in time and requires memory for the attention matrix. Each output token is a weighted combination of all value vectors, where the weights depend on all pairwise query-key interactions.
Intuition
Every token attends to every other token. This gives the model global receptive field in a single layer, unlike convolutions (local) or recurrence (sequential). The cost is quadratic in sequence length.
Proof Sketch
The matrix is and costs to compute. The softmax is applied row-wise in . The final multiplication with (which is ) costs . Total: . Memory for the attention matrix: .
Why It Matters
The complexity is the central limitation of transformers. It is why context lengths were initially limited to 512 or 1024 tokens. Flash attention, sparse attention, and linear attention variants all target this bottleneck.
Failure Mode
For long sequences (), the quadratic cost becomes the training bottleneck. Naive implementation also suffers from memory bandwidth limits due to materializing the full attention matrix. Flash attention (Dao et al., 2022) avoids materializing this matrix, achieving the same computation in less wall-clock time.
Multi-Head Attention Capacity
Statement
Multi-head attention with heads of dimension has the same total parameter count as single-head attention with dimension . However, multi-head attention can represent richer functions: each head can learn a different attention pattern (e.g., one head attends to syntactic relations, another to semantic similarity).
Intuition
Multiple heads allow the model to jointly attend to information from different representation subspaces at different positions. A single head must average across all types of relationships.
Proof Sketch
Single-head: parameters for plus for . Multi-head: plus for . Same total. But the rank- structure of each head restricts each individual attention pattern, while the concatenation allows independent patterns.
Why It Matters
Multi-head attention is one of the few design choices from the original paper that survived unchanged. Empirically, different heads specialize: some attend to adjacent tokens, some attend to syntactically related tokens, some attend to the beginning of the sequence. This specialization emerges without supervision.
Failure Mode
Many heads become redundant during training. Pruning studies show that removing 20-40% of heads often has minimal effect on performance, suggesting the model is over-parameterized in the multi-head dimension. GQA (grouped query attention) exploits this by sharing key-value heads across query heads.
What Survived and What Changed
Survived (as of 2026):
- Self-attention as the core mechanism
- Multi-head attention
- Residual connections
- Layer normalization
- The formula
Changed:
- Decoder-only replaced encoder-decoder. GPT showed that a decoder-only architecture suffices for language modeling, and it became the default for LLMs. Encoder-decoder survives in some applications (T5, translation).
- Pre-norm replaced post-norm. The original paper applied layer norm after the residual connection. Modern transformers apply it before (pre-norm), which stabilizes training for deep models. See residual stream internals.
- RoPE replaced sinusoidal positions. Rotary positional embeddings (Su et al., 2021) encode relative positions through rotation matrices, enabling better length generalization than absolute sinusoidal encodings. See positional encoding for a full comparison.
- GQA/MQA replaced standard multi-head. Grouped query attention reduces the KV cache size for inference, trading a small quality decrease for major memory savings at serving time. See attention variants.
- SwiGLU replaced ReLU in FFN. The original FFN used ReLU activation. Modern LLMs use SwiGLU or GeGLU, which empirically improve performance.
- Flash attention changed the implementation. The algorithm is mathematically identical, but the IO-aware implementation avoids materializing the full attention matrix, making long contexts practical.
What the Paper Got Right
The core computation. Scaled dot-product attention () has not changed in nine years. Every LLM in production uses this exact formula. The scaling factor derivation in the paper (dot products grow as , pushing softmax into saturation) is correct and important.
Multi-head attention. The insight that multiple low-rank attention patterns are better than a single full-rank one has held up. Head specialization (syntactic heads, positional heads, rare-token heads) has been confirmed by mechanistic interpretability research.
Residual connections and layer normalization. The paper adopted these from prior work, and they remain in every modern transformer. The skip connection pattern is critical for training deep models.
Parallelism over recurrence. The central thesis of the paper, that attention-only models can replace sequential processing, was correct. This enabled the scaling that drives modern LLMs.
What Aged
The encoder-decoder framing. The paper presented the transformer as a sequence-to-sequence translation model. The field moved to decoder-only causal models for generation and encoder-only models (BERT) for understanding. The encoder-decoder split is no longer the default.
Sinusoidal positional encodings. Replaced by learned positions, ALiBi, and then RoPE, which handles length generalization better.
The training setup. The paper trained on WMT translation data for a few days on 8 GPUs. Modern LLMs train on trillions of tokens across thousands of GPUs. The scaling regime is completely different.
Label smoothing as the main regularization. The paper used label smoothing with . Modern LLMs rely on dropout (or no dropout at scale), weight decay, and data diversity as the primary regularizers.
Common Confusions
The transformer is not just attention
The transformer block is attention + feedforward network + residual connections
- layer normalization. The FFN contains roughly 2/3 of the parameters in a standard transformer block. Attention routes information; the FFN processes it. Both are necessary.
The original paper was about translation, not language modeling
Vaswani et al. (2017) demonstrated the transformer on machine translation (WMT 2014 English-German and English-French). The application to autoregressive language modeling came later with GPT (Radford et al., 2018). The decoder-only architecture for causal language modeling was not in the original paper.
Exercises
Problem
In a transformer with and heads, what is the dimension of each head? How many parameters are in one multi-head attention sublayer (including , excluding biases)?
Problem
Explain why the scaling factor is necessary. What goes wrong if you remove it? Derive the expected magnitude of when and are random vectors with independent entries of mean 0 and variance 1.
References
Canonical:
- Vaswani et al., "Attention Is All You Need" (NeurIPS 2017)
- Bahdanau et al., "Neural Machine Translation by Jointly Learning to Align and Translate" (ICLR 2015). The attention mechanism that the transformer generalized.
Predecessors and context:
- Radford et al., "Improving Language Understanding by Generative Pre-Training" (2018). GPT-1: first decoder-only transformer for language modeling.
- Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers" (NAACL 2019). Encoder-only variant.
Current evolution:
- Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention" (NeurIPS 2022)
- Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021)
- Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models" (2023). The grouped query attention used in modern LLMs.
- Phuong & Hutter, "Formal Algorithms for Transformers" (2022). A mathematical reference for the transformer formalism.
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1