Attention Mechanisms History

Sneiderman, Robby

LLM Construction

Attention Mechanisms History

The evolution of attention from Bahdanau (2014) additive alignment to Luong dot-product attention to self-attention in transformers. How attention solved the fixed-length bottleneck of seq2seq models.

CoreTier 2StableSupporting~45 min

Prerequisites

Recurrent Neural Networks Byte Level Language Models

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 3 | tier 2. This page has 2 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Transformer Architecture

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Attention is the core operation inside every transformer, and transformers are the architecture behind all modern large language models. But attention did not appear with transformers in 2017. It was introduced for sequence-to-sequence machine translation in 2014, three years before "Attention Is All You Need." Understanding the original motivation, the fixed-length bottleneck problem, makes the design choices of modern attention clear.

Mental Model

In a standard encoder-decoder RNN, the encoder reads an entire input sentence and compresses it into a single fixed-length vector. The decoder must generate the entire output from this one vector. For long sentences, this bottleneck causes information loss.

Attention solves this by letting the decoder look back at all encoder hidden states when generating each output token. Instead of one summary vector, the decoder gets a dynamically weighted combination of encoder states. Different output tokens attend to different parts of the input.

The Fixed-Length Bottleneck

In the original seq2seq model (Sutskever et al., 2014), the encoder processes input tokens $x_1, \ldots, x_T$ through an RNN to produce hidden states $h_1, \ldots, h_T$ . Only the final hidden state $h_T$ is passed to the decoder. For long sequences, $h_T$ must encode everything, and empirically, translation quality degrades sharply for sentences longer than 20-30 tokens.

Bahdanau Attention (2014)

Bahdanau, Cho, and Bengio introduced additive attention to address the bottleneck.

Definition

Bahdanau (Additive) Attention

At each decoder time step $t$ , compute an alignment score between decoder state $s_t$ and each encoder hidden state $h_j$ :

$e_{t,j} = v^T \tanh(W_s s_{t-1} + W_h h_j)$

where $W_s$ , $W_h$ , and $v$ are learned parameters. Normalize to get attention weights:

$\alpha_{t,j} = \frac{\exp(e_{t,j})}{\sum_{k=1}^{T}\exp(e_{t,k})}$

The context vector is the weighted sum of encoder states:

$c_t = \sum_{j=1}^{T}\alpha_{t,j} h_j$

This context vector $c_t$ is concatenated with $s_{t-1}$ and fed to the decoder RNN.

The name "additive" comes from the $W_s s_{t-1} + W_h h_j$ term inside the $\tanh$ . The alignment function is a small feedforward network with one hidden layer.

Key insight: the alignment scores $\alpha_{t,j}$ are differentiable, so the entire mechanism is end-to-end trainable. No hard alignment supervision is needed. The model learns which source words are relevant for each target word.

Luong Attention (2015)

Luong, Pham, and Manning proposed simpler alternatives.

Definition

Luong (Multiplicative) Attention

Dot-product attention: $e_{t,j} = s_t^T h_j$ . No learned parameters in the scoring function.

General attention: $e_{t,j} = s_t^T W h_j$ . One learned matrix $W$ .

Concat attention: $e_{t,j} = v^T \tanh(W[s_t; h_j])$ . Similar to Bahdanau but concatenates instead of adding.

The rest of the mechanism (softmax normalization, weighted sum) is the same.

Dot-product attention is computationally cheaper than additive attention: it requires no learned parameters in the scoring function and can be computed as a matrix multiplication. This efficiency becomes critical when attention scales to self-attention over long sequences.

Luong also introduced local attention, where the model attends only to a window of encoder positions near an estimated alignment point, reducing cost from $O(T)$ to $O(D)$ where $D$ is the window size.

Main Theorems

Proposition

Attention as Differentiable Soft Alignment

Statement

The attention mechanism computes a soft alignment between decoder position $t$ and encoder positions $1, \ldots, T$ . The context vector $c_t = \sum_j \alpha_{t,j} h_j$ is a convex combination of encoder states (since $\alpha_{t,j} \geq 0$ and $\sum_j \alpha_{t,j} = 1$ ). In the limit where one $\alpha_{t,j} \to 1$ and all others $\to 0$ , this recovers hard alignment to a single source position.

Intuition

Traditional machine translation used explicit word alignment (source word 3 maps to target word 5). Attention learns this alignment implicitly as a byproduct of optimizing translation quality. The alignment is soft: a target word can attend to multiple source words simultaneously, which handles one-to-many and many-to-one alignments naturally.

Proof Sketch

By the properties of softmax, $\alpha_{t,j} \in (0,1)$ and $\sum_j \alpha_{t,j} = 1$ , so $c_t$ lies in the convex hull of $\{h_j\}$ . As the temperature of the softmax approaches zero, the distribution becomes a one-hot vector, recovering hard alignment.

Why It Matters

Soft alignment allows gradient-based training. Hard alignment is discrete and requires reinforcement learning or marginalization to train. This differentiability is why attention became ubiquitous: it integrates smoothly into any neural architecture.

Failure Mode

When the encoder states $h_j$ are similar (low diversity), the attention weights become nearly uniform and the context vector is approximately the mean of all encoder states. This reverts to the bottleneck problem. This can happen with poorly trained encoders or very long sequences where RNN hidden states converge.

report a correction →

From Cross-Attention to Self-Attention

In Bahdanau and Luong attention, the queries come from the decoder and the keys/values come from the encoder. This is cross-attention: attending from one sequence to another.

Self-attention (Vaswani et al., 2017) applies attention within a single sequence. Each token attends to all other tokens in the same sequence:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

where $Q$ , $K$ , $V$ are linear projections of the same input sequence. The $\sqrt{d_k}$ scaling prevents dot products from growing large in high dimensions, which would push softmax into saturation.

Self-attention removes the sequential bottleneck of RNNs entirely. Every token can directly attend to every other token in $O(1)$ sequential steps (though $O(T^2)$ total computation). This parallelism is why transformers train much faster than RNNs on modern hardware.

Timeline Summary

Year	Contribution	Key Innovation
2014	Bahdanau et al.	Additive attention for seq2seq translation
2015	Luong et al.	Dot-product attention, local attention
2016	Decomposable Attention	Attention without recurrence for NLI
2017	Vaswani et al.	Self-attention, multi-head attention, transformers

Common Confusions

Watch Out

Attention is not unique to transformers

Attention was used with RNNs for three years before transformers existed. The transformer contribution was showing that self-attention alone (without recurrence or convolution) is sufficient, plus multi-head attention, positional encoding, and the specific architecture.

Watch Out

Attention weights are not explanations

Attention weights show where the model looks, not what it computes. High attention to a token does not mean that token causally determines the output. Jain and Wallace (2019) showed that alternative attention distributions can produce identical predictions. Use attention as a weak signal, not a causal explanation.

Watch Out

Scaled dot-product is not just a convenience

The $\sqrt{d_k}$ scaling in transformer attention is necessary, not optional. Without it, for large $d_k$ , dot products have variance proportional to $d_k$ , pushing softmax outputs toward one-hot vectors. This causes vanishing gradients. The scaling keeps the softmax in a regime where gradients flow.

Canonical Examples

Example

Alignment in English-French translation

Translating "The agreement on the European Economic Area was signed in August 1992." Bahdanau attention learns that the French word "accord" aligns to "agreement," "economique" aligns to "Economic," and "signe" aligns to "signed." The alignment is monotonic for this pair but can be non-monotonic for languages with different word orders (e.g., English-Japanese).

Exercises

ExerciseCore

Problem

In Bahdanau attention with encoder hidden states of dimension $d_h = 256$ and decoder states of dimension $d_s = 256$ , using an alignment network with hidden dimension $d_a = 128$ , how many parameters does the attention mechanism have?

ExerciseAdvanced

Problem

Show that as the softmax temperature $\tau \to 0$ in $\alpha_{t,j} = \frac{\exp(e_{t,j}/\tau)}{\sum_k \exp(e_{t,k}/\tau)}$ , the attention weights converge to a one-hot vector selecting the encoder position with the highest alignment score.

References

Canonical:

Bahdanau, Cho, Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate" (2014)
Luong, Pham, Manning, "Effective Approaches to Attention-based Neural Machine Translation" (2015)
Vaswani et al., "Attention Is All You Need" (2017)

Current:

Jain & Wallace, "Attention is not Explanation" (2019), NAACL
Wiegreffe & Pinter, "Attention is not not Explanation" (2019), EMNLP
Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12

Next Topics

Transformer architecture: self-attention, multi-head attention, and the full transformer stack
Positional encoding: how transformers represent sequence order without recurrence

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Recurrent Neural Networkslayer 3 · tier 2
Byte-Level Language Modelslayer 4 · tier 3

Derived topics

2

Transformer Architecturelayer 4 · tier 2
Positional Encodinglayer 4 · tier 3

Graph-backed continuations

Transformer Architecture Positional Encoding