Induction Heads

Sneiderman, Robby

LLM Construction

Induction Heads

Induction heads are attention head circuits that implement a specific kind of pattern completion: given a sequence like [A][B]...[A], they predict [B]. Olsson et al. (2022) give strong causal (ablation) evidence that these heads exist in small attention-only models, and correlational co-occurrence evidence linking their formation to a sudden jump in in-context-learning ability in larger transformers. They are one mechanism among several that have been proposed for in-context learning, not the whole story.

AdvancedTier 2CurrentCore spine~40 min

Prerequisites

Attention Mechanism Theory Transformer Architecture Mechanistic Interpretability Residual Stream and Transformer Internals

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 4 | tier 2. This page has 5 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Mechanistic Interpretability: Features, Circuits, and Causal Faithfulness

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

When a language model sees "The capital of France is Paris ... The capital of Germany is" and predicts "Berlin", it is doing in-context learning: using patterns in the prompt without updating its weights. Induction heads are one specific two-layer attention circuit that handles the simplest version of in-context lookup: "I have seen [A] [B] somewhere earlier in the prompt; now I see [A] again, so I should output [B]."

Three reasons this circuit is worth knowing in detail. First, it is one of the mechanistic-interpretability results where the proposed mechanism has been tested causally — Olsson et al. (2022) ablate the heads and watch the in-context-learning behavior degrade in small attention-only models. Second, the same paper reports a tight temporal correlation between induction-head formation and the loss-curve bump associated with the onset of in-context learning in larger models; that's correlational, not causal, in the large-model regime. Third, the heads emerge through a sudden phase transition during training, connecting to grokking and training dynamics. Below: the formal statement of the circuit, a static figure of the mechanism, and a live lab that trains the same circuit from scratch in your browser.

In-context learning in large transformers is not fully explained by induction heads alone — alternative or complementary mechanisms (function-vector heads, gradient-descent-in-activations stories à la von Oswald et al. 2023, multi-step composition circuits) have been proposed and partially evidenced. Treat induction heads as one well-characterized component, not as the mechanism.

Evidence Ladder

The evidence for induction heads is unusually strong for a mechanistic interpretability claim, but it is not uniform across model scales. Keep the levels separate:

Claim	Best evidence	What it does not prove
A two-layer attention circuit can implement `[A][B]...[A] -> [B]` lookup	Transformer-circuits construction plus direct inspection of previous-token and induction heads	That every in-context-learning behavior routes through this circuit
Specific small attention-only models use induction heads causally	Targeted head ablations and path patching in Olsson et al. Sections 3-4	That the same head family has the same causal share in full production models
Induction-head formation co-occurs with an ICL loss transition	Training-time induction-score and loss curves in Olsson et al. Sections 5-6	That the transition is the sole cause of broader ICL in large models
Later work finds related copying and retrieval heads	Automated circuit discovery, sparse-feature tooling, and long-context retrieval-head studies	That "retrieval head" and "induction head" are identical concepts

This ladder is the right reading discipline. The small-model circuit claim is causal. The large-model training-dynamics claim is a strong correlation plus mechanistic analogy. The broader ICL claim is an open research program, not a settled theorem.

The Mechanism

Definition

Induction Head

An induction head is a later-layer attention head in a two-head circuit. An earlier previous-token head writes predecessor identity into the residual stream; the induction head then attends from the current repeated token to earlier positions whose predecessor matches that token. In the canonical pattern ... [A] [B] ... [A], the head attends to [B] because [B] has predecessor [A], raising the next-token probability of [B].

Proposition

Induction Head Circuit

Statement

An induction head is a two-head circuit spanning two attention layers that implements the following pattern completion:

Given a sequence containing ... [A] [B] ... [A], the circuit predicts [B] as the next token after the second [A].

The circuit works through composition of two attention heads:

Previous-token head (Layer $L$ ): An attention head in an earlier layer whose attention pattern shifts information one position back. For each position $i$ , it writes the identity of token $i-1$ into the residual stream at position $i$ . After this head, position $i$ carries information about "what came before me."
Induction head (Layer $L' > L$ ): An attention head in a later layer that matches the current token with earlier positions that had the same predecessor. At the second [A], this head attends to the position after the first [A] (which is [B]), because the previous-token head has placed [A]'s identity at position [B].

The composition is: the induction head's query-key matching uses the output of the previous-token head, creating a Q-composition or K-composition circuit.

Intuition

Think of it as a two-step lookup. Step 1: at every position, write a note saying "the token before me was X." Step 2: when you encounter token [A], search for other places where the note says "the token before me was [A]." Attend to those places, and copy what you find there. What you find is [B], because [B] follows the first [A] and its note says "[A] was before me."

This is a bigram lookup table computed dynamically from the context, without being stored in the weights.

Why It Matters

Induction heads are the strongest mechanistic account we have for in-context learning of exact repetition, and they require composition across layers (a single-layer transformer cannot implement the circuit). This is one of the clearest arguments that transformer depth is used for compositional computation, not just hierarchical features. The circuit also helps explain why per-token loss on later tokens in a sequence is lower than on early tokens: the induction head can only help when there are previous patterns to match. How much of total in-context learning they explain in large models is an open empirical question.

Failure Mode

Induction heads implement a specific, limited form of pattern matching: exact token repetition. They do not explain more sophisticated in-context learning (e.g., learning a novel function from labeled examples, reasoning by analogy, or arithmetic on unseen inputs). Even for the patterns they do handle, the clean causal evidence is strongest in small attention-only models from Olsson et al.; in large production transformers, evidence is primarily correlational (induction-head scores co-evolve with in-context learning loss), not fully causal.

report a correction →

Induction circuit · two-layer attention composition

Layer 0 annotates every position with its predecessor. Layer 1, at the second “A”, queries for any position whose Layer-0 write equals “A”. Exactly one position qualifies: the slot right after the first “A”, which holds B. The induction head copies that slot’s tokenas the prediction. The same circuit, scaled up, is what lets a real LM finish “Germany → ___” after seeing “France → Paris” once in context.

The figure above is the entire algorithm on a worked example. Two facts about it carry over to real models: (i) the previous-token write is what makes the matching key at the second [A] actually carry the predecessor's identity, so a one-layer model cannot do this — composition across two layers is load-bearing; (ii) the induction head's attention pattern, on a sequence with a repeated token, has a characteristic diagonal stripe shifted by one position (it attends from each occurrence of A to the position right after the previous A). That stripe is the visual signature you will look for in the lab below.

Example

A minimal prefix-match diagnostic

Use a vocabulary of tokens {A, B, C, D, E} and construct prompts like C A B D A. The target after the final A is B, because the earlier bigram A B occurred in the same context. A random attention head has no reason to prefer the earlier post-A position, so its average prefix-match score is near the uniform floor: one plausible matching position among about $T$ positions gives roughly $1/T$ .

An induction head changes two measurements at once. Its attention map sends mass from the final A to the earlier B position, and the unembedding direction raises the next-token logit for B. That joint test matters: attention alone can be misleading if the value vector does not write useful information, and output probability alone can be supplied by another circuit. The diagnostic is strongest when attention, logit attribution, and ablation all point to the same head pair.

Watch the Circuit Form

The companion Induction Heads Lab trains a 2-layer attention-only transformer from scratch on a synthetic copying task — sequences of the form [random tokens] [A] [B] [random tokens] [A] where the model must predict [B]. The same circuit described above has to assemble itself from random initialization. What to look for in the lab:

Loss curve (blue, left axis) drops sharply at one specific step. Prefix-match score (amber, right axis) climbs from ≈ 1/T ≈ 0.07 (random attention floor) to above 0.4 at the same step. That joint moment is the phase transition.
Attention heatmap for a head where the score is climbing: look for the diagonal-shifted-by-one stripe. Cells (i, j) light up when token j follows the previous occurrence of token i. That stripe is the induction circuit.
Ablate any head by right-clicking its chip. Zeroing the layer-1 head with the highest score collapses prefix-match for the whole network. Ablate a non-induction head and almost nothing happens. This is the small-scale version of the causal evidence in Olsson §3.

Open the Induction Heads Lab →

Pure TypeScript, hand-written backward pass, gradient-checked against finite differences (atol 5e-3). No tensor library, no GPU.

The Phase Transition

Proposition

Induction Head Phase Transition

Statement

During training, induction heads emerge abruptly at a specific training step, co-occurring with:

A sudden drop in in-context learning loss (the per-token loss on tokens later in the sequence decreases sharply)
A sudden increase in the "induction head score" (the model's ability to complete [A][B]...[A] -> [B] patterns)
A brief training loss spike (the model temporarily gets worse before getting better)

This transition happens at the same training step across all sequence positions and attention heads involved, consistent with a phase transition rather than gradual improvement.

Intuition

Before the transition, the model uses simple bigram statistics (token $B$ follows token $A$ based on corpus frequency). At the transition, the model discovers that it can do much better by copying from context. This new strategy requires two heads to coordinate (composition), which is why it appears suddenly: partial composition does not help. Either the circuit works or it does not.

The training loss spike likely occurs because the model is reorganizing its internal representations to support the new circuit, temporarily disrupting existing computations.

Why It Matters

This is one of the best-documented examples of a capability emerging as a phase transition during training. It provides concrete evidence for the hypothesis that capabilities emerge discretely rather than continuously, which has implications for AI safety: sudden capability gains may be hard to predict or control.

Failure Mode

The phase transition is most clearly visible in small models (1-4 layers) on controlled data. In large models trained on diverse data, the transition may be smoother or may occur at different times for different types of pattern completion. The clean phase-transition signature is partly a consequence of the controlled experimental setting.

report a correction →

Composition: How Attention Heads Talk to Each Other

Induction heads require attention head composition: the output of one head feeds into the computation of another. There are three types:

Q-composition: Head B uses the output of Head A to form its queries. "What I'm looking for depends on what Head A found."

K-composition: Head B uses the output of Head A to form its keys. "What I advertise to other positions depends on what Head A wrote."

V-composition: Head B uses the output of Head A to form its values. "What I pass forward depends on what Head A contributed."

Induction heads primarily use K-composition: the previous-token head writes "my predecessor was [A]" into the residual stream. The induction head uses this as part of the key, so positions with predecessor [A] become high-key-similarity matches for queries from the current [A].

Elhage et al. (2021), in the "Virtual Attention Heads" section of A Mathematical Framework for Transformer Circuits, treat all three composition types on equal footing. Q-composition and V-composition are documented in the same framework and can coexist in a trained model: a Q-composing head lets an earlier head reshape what a later head is looking for, and a V-composing head lets an earlier head reshape what a later head passes forward. The empirical finding is that the induction circuit specifically is dominated by K-composition; Q-composition and V-composition variants exist but are not the primary induction mechanism.

This compositional structure is what makes transformer circuits more expressive than single-layer attention. The residual stream acts as a shared memory bus—heads in different layers can build on each other's outputs.

Recent mechanistic interpretability tooling extends this picture. Sparse autoencoders trained on the residual stream (Bricken et al., 2023) recover sparse, monosemantic feature directions and have been used to localize features associated with induction-style pattern completion. Wu et al. (2024) identify a related family of "retrieval heads" in long-context models: some retrieval heads are induction heads, but the class is broader and includes heads specialized for copying factual spans across long distances.

How to Detect Induction Heads

Given a trained transformer, you can identify induction heads by:

Prefix matching score: Feed sequences of the form [random tokens] [A] [B] [random tokens] [A]. Measure how much probability the model assigns to [B] after the second [A]. High score = induction head behavior.
Attention pattern inspection: An induction head's attention pattern on repeated sequences shows a diagonal stripe offset by one position: position $i$ attends to position $j$ where token $j =$ token $i$ and $j < i$ , shifted by one to attend to the token after the match.
Ablation: Zero out specific attention heads and measure the change in in-context learning loss. If ablating a head in a later layer destroys in-context learning, and ablating a head in an earlier layer has the same effect, those heads likely form an induction circuit.

Common Confusions

Watch Out

Induction heads are not just copying

Induction heads copy tokens from earlier in the sequence, but the circuit is more structured than simple copying. The previous-token head performs a relational operation (shift-by-one), and the induction head performs a content-based lookup. The composition of these two operations creates a pattern-completion algorithm. Calling it "copying" misses the algorithmic structure.

Watch Out

A single attention head cannot be an induction head

The induction mechanism requires two heads across two layers. A single attention head in a one-layer transformer can learn to attend to previous occurrences of the current token, but it cannot implement the full [A][B]...[A] -> [B] pattern because it has no way to shift attention by one position. The composition across layers is the key insight.

Watch Out

Induction heads do not explain all of in-context learning

Induction heads explain exact pattern repetition: if the model has seen [A][B] before in the context, it can predict [B] after [A]. This does not explain: learning a new classification rule from labeled examples, performing arithmetic on novel inputs, or analogical reasoning. These require more complex circuits that may generalize the induction head principle but are not yet fully understood.

Watch Out

The causal evidence is strong in small models, correlational in large ones

Popular summaries often say induction heads "are" the mechanism of in-context learning. Olsson et al. (2022) are more careful: they provide causal evidence (targeted ablations, direct circuit analysis) in small attention-only models, and correlational evidence in larger, more realistic transformers (induction-head score co-evolves with the in-context learning loss curve, prefix-match scores track ICL across scales). They explicitly frame induction heads as accounting for "a substantial fraction" of in-context learning, not all of it. Treat strong claims like "induction heads are the mechanism of ICL" as an extrapolation beyond what is currently established in full-scale models.

Exercises

ExerciseCore

Problem

A transformer processes the sequence "The cat sat on the mat. The cat sat on the ___." Explain which tokens the induction head attends to when predicting the blank, and why.

ExerciseAdvanced

Problem

Explain why a one-layer transformer cannot implement an induction head, but a two-layer transformer can. What is the minimum number of attention heads needed?

References

Canonical:

Olsson et al., "In-context Learning and Induction Heads" (Anthropic, 2022), transformer-circuits.pub/2022/in-context-learning-and-induction-heads. Sections 2-4 establish the circuit; Sections 5-6 carry the small-vs-large causal/correlational distinction emphasized on this page.
Elhage et al., "A Mathematical Framework for Transformer Circuits" (Anthropic, 2021), transformer-circuits.pub/2021/framework. Sections on Q/K/V-composition and Virtual Attention Heads formalize the two-layer structure induction heads exploit.

Supporting causal evidence and extensions:

Nanda et al., "Progress Measures for Grokking via Mechanistic Interpretability" (ICLR 2023). Circuit-analysis methodology overlaps; useful for contrasting sudden-emergence claims.
Conmy et al., "Towards Automated Circuit Discovery for Mechanistic Interpretability" (NeurIPS 2023). Generalizes the ablation/edge-attribution tools used to isolate induction heads.
Singh et al., "The Transient Nature of Emergent In-Context Learning in Transformers" (NeurIPS 2023). Shows ICL can appear and then disappear during training, complicating the "induction heads = ICL" story.
Bricken et al., "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" (Anthropic, 2023), transformer-circuits.pub/2023/monosemantic-features. Sparse autoencoders recover monosemantic feature directions in the residual stream, a tool used to localize induction-related features.
Wu et al., "Retrieval Head Mechanistically Explains Long-Context Factuality" (arXiv:2404.15574, 2024). Identifies retrieval heads in long-context models; the class overlaps with induction heads but is strictly broader.

Where the claim is softer than popular summaries suggest:

Akyürek et al., "What learning algorithm is in-context learning?" (ICLR 2023). Argues in-context learning on regression tasks implements gradient descent, a different mechanism from induction-style pattern copying.
Hendel, Geva, Globerson, "In-Context Learning Creates Task Vectors" (EMNLP 2023). Evidence that ICL in large models routes through compressed task-vector representations, not just induction-style lookups.
Garg, Tsipras, Liang, Valiant, "What Can Transformers Learn In-Context? A Case Study of Simple Function Classes" (arXiv:2208.01066, 2022). Grounds the induction-head-to-ICL emergence story by showing transformers can in-context-learn well-defined function classes (linear functions, two-layer networks, decision trees), setting a concrete benchmark for what ICL mechanisms must explain.

Next Topics

Mechanistic interpretability: the broader research program of understanding transformer internals
Residual stream and transformer internals: how information flows through the transformer

Last reviewed: July 3, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Mechanistic Interpretability: Features, Circuits, and Causal Faithfulnesslayer 4 · tier 1
Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and Scalinglayer 4 · tier 1
Attention Mechanism Theorylayer 4 · tier 2
Residual Stream and Transformer Internalslayer 4 · tier 2
Transformer Architecturelayer 4 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.