Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Attention Mechanisms History

The evolution of attention from Bahdanau (2014) additive alignment to Luong dot-product attention to self-attention in transformers. How attention solved the fixed-length bottleneck of seq2seq models.

CoreTier 2Stable~45 min

Why This Matters

Attention is the core operation inside every transformer, and transformers are the architecture behind all modern large language models. But attention did not appear with transformers in 2017. It was introduced for sequence-to-sequence machine translation in 2014, three years before "Attention Is All You Need." Understanding the original motivation, the fixed-length bottleneck problem, makes the design choices of modern attention clear.

Mental Model

In a standard encoder-decoder RNN, the encoder reads an entire input sentence and compresses it into a single fixed-length vector. The decoder must generate the entire output from this one vector. For long sentences, this bottleneck causes information loss.

Attention solves this by letting the decoder look back at all encoder hidden states when generating each output token. Instead of one summary vector, the decoder gets a dynamically weighted combination of encoder states. Different output tokens attend to different parts of the input.

The Fixed-Length Bottleneck

In the original seq2seq model (Sutskever et al., 2014), the encoder processes input tokens x1,,xTx_1, \ldots, x_T through an RNN to produce hidden states h1,,hTh_1, \ldots, h_T. Only the final hidden state hTh_T is passed to the decoder. For long sequences, hTh_T must encode everything, and empirically, translation quality degrades sharply for sentences longer than 20-30 tokens.

Bahdanau Attention (2014)

Bahdanau, Cho, and Bengio introduced additive attention to address the bottleneck.

Definition

Bahdanau (Additive) Attention

At each decoder time step tt, compute an alignment score between decoder state sts_t and each encoder hidden state hjh_j:

et,j=vTtanh(Wsst1+Whhj)e_{t,j} = v^T \tanh(W_s s_{t-1} + W_h h_j)

where WsW_s, WhW_h, and vv are learned parameters. Normalize to get attention weights:

αt,j=exp(et,j)k=1Texp(et,k)\alpha_{t,j} = \frac{\exp(e_{t,j})}{\sum_{k=1}^{T}\exp(e_{t,k})}

The context vector is the weighted sum of encoder states:

ct=j=1Tαt,jhjc_t = \sum_{j=1}^{T}\alpha_{t,j} h_j

This context vector ctc_t is concatenated with st1s_{t-1} and fed to the decoder RNN.

The name "additive" comes from the Wsst1+WhhjW_s s_{t-1} + W_h h_j term inside the tanh\tanh. The alignment function is a small feedforward network with one hidden layer.

Key insight: the alignment scores αt,j\alpha_{t,j} are differentiable, so the entire mechanism is end-to-end trainable. No hard alignment supervision is needed. The model learns which source words are relevant for each target word.

Luong Attention (2015)

Luong, Pham, and Manning proposed simpler alternatives.

Definition

Luong (Multiplicative) Attention

Dot-product attention: et,j=stThje_{t,j} = s_t^T h_j. No learned parameters in the scoring function.

General attention: et,j=stTWhje_{t,j} = s_t^T W h_j. One learned matrix WW.

Concat attention: et,j=vTtanh(W[st;hj])e_{t,j} = v^T \tanh(W[s_t; h_j]). Similar to Bahdanau but concatenates instead of adding.

The rest of the mechanism (softmax normalization, weighted sum) is the same.

Dot-product attention is computationally cheaper than additive attention: it requires no learned parameters in the scoring function and can be computed as a matrix multiplication. This efficiency becomes critical when attention scales to self-attention over long sequences.

Luong also introduced local attention, where the model attends only to a window of encoder positions near an estimated alignment point, reducing cost from O(T)O(T) to O(D)O(D) where DD is the window size.

Main Theorems

Proposition

Attention as Differentiable Soft Alignment

Statement

The attention mechanism computes a soft alignment between decoder position tt and encoder positions 1,,T1, \ldots, T. The context vector ct=jαt,jhjc_t = \sum_j \alpha_{t,j} h_j is a convex combination of encoder states (since αt,j0\alpha_{t,j} \geq 0 and jαt,j=1\sum_j \alpha_{t,j} = 1). In the limit where one αt,j1\alpha_{t,j} \to 1 and all others 0\to 0, this recovers hard alignment to a single source position.

Intuition

Traditional machine translation used explicit word alignment (source word 3 maps to target word 5). Attention learns this alignment implicitly as a byproduct of optimizing translation quality. The alignment is soft: a target word can attend to multiple source words simultaneously, which handles one-to-many and many-to-one alignments naturally.

Proof Sketch

By the properties of softmax, αt,j(0,1)\alpha_{t,j} \in (0,1) and jαt,j=1\sum_j \alpha_{t,j} = 1, so ctc_t lies in the convex hull of {hj}\{h_j\}. As the temperature of the softmax approaches zero, the distribution becomes a one-hot vector, recovering hard alignment.

Why It Matters

Soft alignment allows gradient-based training. Hard alignment is discrete and requires reinforcement learning or marginalization to train. This differentiability is why attention became ubiquitous: it integrates smoothly into any neural architecture.

Failure Mode

When the encoder states hjh_j are similar (low diversity), the attention weights become nearly uniform and the context vector is approximately the mean of all encoder states. This reverts to the bottleneck problem. This can happen with poorly trained encoders or very long sequences where RNN hidden states converge.

From Cross-Attention to Self-Attention

In Bahdanau and Luong attention, the queries come from the decoder and the keys/values come from the encoder. This is cross-attention: attending from one sequence to another.

Self-attention (Vaswani et al., 2017) applies attention within a single sequence. Each token attends to all other tokens in the same sequence:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

where QQ, KK, VV are linear projections of the same input sequence. The dk\sqrt{d_k} scaling prevents dot products from growing large in high dimensions, which would push softmax into saturation.

Self-attention removes the sequential bottleneck of RNNs entirely. Every token can directly attend to every other token in O(1)O(1) sequential steps (though O(T2)O(T^2) total computation). This parallelism is why transformers train much faster than RNNs on modern hardware.

Timeline Summary

YearContributionKey Innovation
2014Bahdanau et al.Additive attention for seq2seq translation
2015Luong et al.Dot-product attention, local attention
2016Decomposable AttentionAttention without recurrence for NLI
2017Vaswani et al.Self-attention, multi-head attention, transformers

Common Confusions

Watch Out

Attention is not unique to transformers

Attention was used with RNNs for three years before transformers existed. The transformer contribution was showing that self-attention alone (without recurrence or convolution) is sufficient, plus multi-head attention, positional encoding, and the specific architecture.

Watch Out

Attention weights are not explanations

Attention weights show where the model looks, not what it computes. High attention to a token does not mean that token causally determines the output. Jain and Wallace (2019) showed that alternative attention distributions can produce identical predictions. Use attention as a weak signal, not a causal explanation.

Watch Out

Scaled dot-product is not just a convenience

The dk\sqrt{d_k} scaling in transformer attention is necessary, not optional. Without it, for large dkd_k, dot products have variance proportional to dkd_k, pushing softmax outputs toward one-hot vectors. This causes vanishing gradients. The scaling keeps the softmax in a regime where gradients flow.

Canonical Examples

Example

Alignment in English-French translation

Translating "The agreement on the European Economic Area was signed in August 1992." Bahdanau attention learns that the French word "accord" aligns to "agreement," "economique" aligns to "Economic," and "signe" aligns to "signed." The alignment is monotonic for this pair but can be non-monotonic for languages with different word orders (e.g., English-Japanese).

Exercises

ExerciseCore

Problem

In Bahdanau attention with encoder hidden states of dimension dh=256d_h = 256 and decoder states of dimension ds=256d_s = 256, using an alignment network with hidden dimension da=128d_a = 128, how many parameters does the attention mechanism have?

ExerciseAdvanced

Problem

Show that as the softmax temperature τ0\tau \to 0 in αt,j=exp(et,j/τ)kexp(et,k/τ)\alpha_{t,j} = \frac{\exp(e_{t,j}/\tau)}{\sum_k \exp(e_{t,k}/\tau)}, the attention weights converge to a one-hot vector selecting the encoder position with the highest alignment score.

References

Canonical:

  • Bahdanau, Cho, Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate" (2014)
  • Luong, Pham, Manning, "Effective Approaches to Attention-based Neural Machine Translation" (2015)
  • Vaswani et al., "Attention Is All You Need" (2017)

Current:

  • Jain & Wallace, "Attention is not Explanation" (2019), NAACL

  • Wiegreffe & Pinter, "Attention is not not Explanation" (2019), EMNLP

  • Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics