LLM Construction
Attention Mechanisms History
The evolution of attention from Bahdanau (2014) additive alignment to Luong dot-product attention to self-attention in transformers. How attention solved the fixed-length bottleneck of seq2seq models.
Prerequisites
Why This Matters
Attention is the core operation inside every transformer, and transformers are the architecture behind all modern large language models. But attention did not appear with transformers in 2017. It was introduced for sequence-to-sequence machine translation in 2014, three years before "Attention Is All You Need." Understanding the original motivation, the fixed-length bottleneck problem, makes the design choices of modern attention clear.
Mental Model
In a standard encoder-decoder RNN, the encoder reads an entire input sentence and compresses it into a single fixed-length vector. The decoder must generate the entire output from this one vector. For long sentences, this bottleneck causes information loss.
Attention solves this by letting the decoder look back at all encoder hidden states when generating each output token. Instead of one summary vector, the decoder gets a dynamically weighted combination of encoder states. Different output tokens attend to different parts of the input.
The Fixed-Length Bottleneck
In the original seq2seq model (Sutskever et al., 2014), the encoder processes input tokens through an RNN to produce hidden states . Only the final hidden state is passed to the decoder. For long sequences, must encode everything, and empirically, translation quality degrades sharply for sentences longer than 20-30 tokens.
Bahdanau Attention (2014)
Bahdanau, Cho, and Bengio introduced additive attention to address the bottleneck.
Bahdanau (Additive) Attention
At each decoder time step , compute an alignment score between decoder state and each encoder hidden state :
where , , and are learned parameters. Normalize to get attention weights:
The context vector is the weighted sum of encoder states:
This context vector is concatenated with and fed to the decoder RNN.
The name "additive" comes from the term inside the . The alignment function is a small feedforward network with one hidden layer.
Key insight: the alignment scores are differentiable, so the entire mechanism is end-to-end trainable. No hard alignment supervision is needed. The model learns which source words are relevant for each target word.
Luong Attention (2015)
Luong, Pham, and Manning proposed simpler alternatives.
Luong (Multiplicative) Attention
Dot-product attention: . No learned parameters in the scoring function.
General attention: . One learned matrix .
Concat attention: . Similar to Bahdanau but concatenates instead of adding.
The rest of the mechanism (softmax normalization, weighted sum) is the same.
Dot-product attention is computationally cheaper than additive attention: it requires no learned parameters in the scoring function and can be computed as a matrix multiplication. This efficiency becomes critical when attention scales to self-attention over long sequences.
Luong also introduced local attention, where the model attends only to a window of encoder positions near an estimated alignment point, reducing cost from to where is the window size.
Main Theorems
Attention as Differentiable Soft Alignment
Statement
The attention mechanism computes a soft alignment between decoder position and encoder positions . The context vector is a convex combination of encoder states (since and ). In the limit where one and all others , this recovers hard alignment to a single source position.
Intuition
Traditional machine translation used explicit word alignment (source word 3 maps to target word 5). Attention learns this alignment implicitly as a byproduct of optimizing translation quality. The alignment is soft: a target word can attend to multiple source words simultaneously, which handles one-to-many and many-to-one alignments naturally.
Proof Sketch
By the properties of softmax, and , so lies in the convex hull of . As the temperature of the softmax approaches zero, the distribution becomes a one-hot vector, recovering hard alignment.
Why It Matters
Soft alignment allows gradient-based training. Hard alignment is discrete and requires reinforcement learning or marginalization to train. This differentiability is why attention became ubiquitous: it integrates smoothly into any neural architecture.
Failure Mode
When the encoder states are similar (low diversity), the attention weights become nearly uniform and the context vector is approximately the mean of all encoder states. This reverts to the bottleneck problem. This can happen with poorly trained encoders or very long sequences where RNN hidden states converge.
From Cross-Attention to Self-Attention
In Bahdanau and Luong attention, the queries come from the decoder and the keys/values come from the encoder. This is cross-attention: attending from one sequence to another.
Self-attention (Vaswani et al., 2017) applies attention within a single sequence. Each token attends to all other tokens in the same sequence:
where , , are linear projections of the same input sequence. The scaling prevents dot products from growing large in high dimensions, which would push softmax into saturation.
Self-attention removes the sequential bottleneck of RNNs entirely. Every token can directly attend to every other token in sequential steps (though total computation). This parallelism is why transformers train much faster than RNNs on modern hardware.
Timeline Summary
| Year | Contribution | Key Innovation |
|---|---|---|
| 2014 | Bahdanau et al. | Additive attention for seq2seq translation |
| 2015 | Luong et al. | Dot-product attention, local attention |
| 2016 | Decomposable Attention | Attention without recurrence for NLI |
| 2017 | Vaswani et al. | Self-attention, multi-head attention, transformers |
Common Confusions
Attention is not unique to transformers
Attention was used with RNNs for three years before transformers existed. The transformer contribution was showing that self-attention alone (without recurrence or convolution) is sufficient, plus multi-head attention, positional encoding, and the specific architecture.
Attention weights are not explanations
Attention weights show where the model looks, not what it computes. High attention to a token does not mean that token causally determines the output. Jain and Wallace (2019) showed that alternative attention distributions can produce identical predictions. Use attention as a weak signal, not a causal explanation.
Scaled dot-product is not just a convenience
The scaling in transformer attention is necessary, not optional. Without it, for large , dot products have variance proportional to , pushing softmax outputs toward one-hot vectors. This causes vanishing gradients. The scaling keeps the softmax in a regime where gradients flow.
Canonical Examples
Alignment in English-French translation
Translating "The agreement on the European Economic Area was signed in August 1992." Bahdanau attention learns that the French word "accord" aligns to "agreement," "economique" aligns to "Economic," and "signe" aligns to "signed." The alignment is monotonic for this pair but can be non-monotonic for languages with different word orders (e.g., English-Japanese).
Exercises
Problem
In Bahdanau attention with encoder hidden states of dimension and decoder states of dimension , using an alignment network with hidden dimension , how many parameters does the attention mechanism have?
Problem
Show that as the softmax temperature in , the attention weights converge to a one-hot vector selecting the encoder position with the highest alignment score.
References
Canonical:
- Bahdanau, Cho, Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate" (2014)
- Luong, Pham, Manning, "Effective Approaches to Attention-based Neural Machine Translation" (2015)
- Vaswani et al., "Attention Is All You Need" (2017)
Current:
-
Jain & Wallace, "Attention is not Explanation" (2019), NAACL
-
Wiegreffe & Pinter, "Attention is not not Explanation" (2019), EMNLP
-
Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12
Next Topics
- Transformer architecture: self-attention, multi-head attention, and the full transformer stack
- Positional encoding: how transformers represent sequence order without recurrence
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Recurrent Neural NetworksLayer 3
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A