Residual Stream and Transformer Internals

Sneiderman, Robby

LLM Construction

Residual Stream and Transformer Internals

The residual stream as the central computational highway in transformers: attention and FFN blocks read from and write to it. Pre-norm vs post-norm, FFN as key-value memory, and the logit lens for inspecting intermediate representations.

AdvancedTier 2CurrentCore spine~50 min

Prerequisites

Transformer Architecture Fox Forget Gate Gradient Flow and Vanishing Gradients

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 4 | tier 2. This page has 3 direct prerequisites and 5 published dependents.

Open Atlas Prerequisites Leads to

What next

Jacobian Lens and Global Workspace Interpretability

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The standard description of a transformer lists attention, FFN, and layer norm as sequential operations. This framing obscures the real computational structure. The key insight from mechanistic interpretability research is that the residual stream is the primary object. Attention and FFN blocks are side computations that read from the stream, transform the information, and write their results back.

This perspective clarifies why skip connections are not just a training trick (as in ResNets), but the core data bus of the architecture. It also explains why techniques like the logit lens work: you can read off meaningful predictions at every layer because the residual stream carries a running estimate of the output.

Mental Model

Think of the residual stream as a shared whiteboard. Each attention head and each FFN block can read what is on the whiteboard, do some computation, and add its contribution. No block erases what is already there; each block only adds. The final answer is whatever is on the whiteboard after all blocks have written to it.

Formal Setup

Definition

Residual Stream $x^{(l)}$

For a transformer with $L$ layers, the residual stream at position $t$ after layer $l$ is:

$x_t^{(l)} = x_t^{(0)} + \sum_{i=1}^{l} \left( \text{Attn}^{(i)}(x^{(i-1)})_t + \text{FFN}^{(i)}(x^{(i-1)} + \text{Attn}^{(i)}(x^{(i-1)}))_t \right)$

where $x_t^{(0)}$ is the token embedding plus positional encoding. Each layer adds two terms: the attention output and the FFN output.

Main Theorems

Proposition

Residual Stream Decomposition

Statement

The final residual stream $x_t^{(L)}$ can be written as a sum of $1 + L \cdot H + L$ terms:

$x_t^{(L)} = x_t^{(0)} + \sum_{l=1}^{L} \sum_{h=1}^{H} a_{t,l,h} + \sum_{l=1}^{L} f_{t,l}$

where $a_{t,l,h}$ is the output of attention head $h$ in layer $l$ and $f_{t,l}$ is the output of the FFN in layer $l$ . For a 32-layer model with 32 heads, this is $1 + 1024 + 32 = 1057$ additive terms.

Intuition

Every attention head and every FFN block contributes independently to the final representation. The final logits are a linear function of this sum (via the unembedding matrix). This means individual heads and FFN blocks have interpretable, separable contributions to the output distribution.

Proof Sketch

Expand the residual recurrence $x^{(l)} = x^{(l-1)} + \text{Attn}^{(l)}(\cdot) + \text{FFN}^{(l)}(\cdot)$ telescopically from $l = 1$ to $l = L$ . Multi-head attention is already a sum over heads: $\text{Attn}^{(l)} = \sum_h W_O^{(l,h)} \text{head}_h^{(l)}$ .

Why It Matters

This decomposition is the foundation of mechanistic interpretability. If you want to understand why the model predicts a particular token, you can examine which heads and FFN blocks contribute the most to the logit of that token. This is called the "logit attribution" technique.

Failure Mode

The decomposition ignores that each term is computed from previous terms, not independently. The outputs of layer $l$ depend on all of layers $1, \ldots, l-1$ . So while the final representation is a linear sum, the individual terms are nonlinear functions of each other. Direct path patching or causal interventions are needed to establish causal (not just additive) contributions.

report a correction →

Pre-Norm vs Post-Norm

Definition

Post-Norm Transformer

The original transformer (Vaswani et al., 2017) applies layer normalization after the residual addition:

$x^{(l)} = \text{LayerNorm}(x^{(l-1)} + \text{Sublayer}(x^{(l-1)}))$

This is called post-norm because normalization comes after the skip.

Definition

Pre-Norm Transformer

GPT-2 and most modern LLMs apply layer normalization before the sublayer:

$x^{(l)} = x^{(l-1)} + \text{Sublayer}(\text{LayerNorm}(x^{(l-1)}))$

This is called pre-norm because normalization comes before the sublayer computation.

Proposition

Pre-Norm Gradient Flow

Statement

In a pre-norm transformer, the gradient of the loss $\mathcal{L}$ with respect to the residual stream at layer $l$ satisfies:

$\frac{\partial \mathcal{L}}{\partial x^{(l)}} = \frac{\partial \mathcal{L}}{\partial x^{(L)}} + \text{(terms from layers } l+1, \ldots, L\text{)}$

The first term is a direct path from the loss to layer $l$ with no multiplicative degradation. This direct path does not exist in post-norm because the normalization after the residual breaks the identity shortcut.

Intuition

Pre-norm preserves a clean identity path in the residual stream. The gradient can flow directly from the output back to any layer without passing through normalization layers. Post-norm inserts normalization into the gradient path, which can cause gradient magnitude issues in deep networks.

Proof Sketch

For pre-norm, $x^{(l)} = x^{(l-1)} + g(x^{(l-1)})$ where $g$ includes the normalization internally. By the chain rule, $\partial x^{(L)} / \partial x^{(l)}$ contains an identity term from the skip connection. For post-norm, $x^{(l)} = \text{LN}(x^{(l-1)} + g(x^{(l-1)}))$ , and the layer norm Jacobian has no identity component.

Why It Matters

Pre-norm is why modern LLMs can be trained at depths of 80+ layers without careful learning rate warmup schemes. The original post-norm transformer required warmup and was fragile beyond 12 layers. This architectural choice has as much practical impact as the attention mechanism itself.

Failure Mode

Pre-norm can produce representations that grow in magnitude across layers because there is no normalization on the residual stream itself. This is sometimes addressed by adding a final layer norm before the output projection. Some recent work (e.g., DeepNorm) combines pre-norm and post-norm properties for very deep models.

report a correction →

FFN as Key-Value Memory

Geva et al. (2021) showed that FFN layers can be interpreted as key-value memories. The first projection $W_1$ maps the input to a set of "keys." The activation function selects which keys are active. The second projection $W_2$ maps the active keys to "values" that are written to the residual stream.

For a two-layer FFN with ReLU:

$\text{FFN}(x) = W_2 \, \text{ReLU}(W_1 x) = \sum_{i=1}^{d_{\text{ff}}} \max(w_{1,i}^T x, 0) \cdot w_{2,i}$

Each neuron $i$ fires when the input $x$ has high dot product with key $w_{1,i}$ , and contributes value $w_{2,i}$ to the output. This makes the FFN a sparse, high-dimensional associative memory.

Logit Lens

Definition

Logit Lens

The logit lens projects the residual stream at intermediate layer $l$ through the final unembedding matrix $W_U$ to obtain a distribution over the vocabulary:

$p^{(l)} = \text{softmax}(W_U \cdot \text{LayerNorm}(x^{(l)}))$

This shows what token the model would predict if forced to output at layer $l$ .

The logit lens reveals that transformers build up their predictions incrementally. Early layers produce vague semantic categories. Middle layers narrow to the correct topic. Final layers select the specific token. This progression is visible as the correct token's rank improving (moving toward rank 1) across layers.

Attention Head Composition

Because attention heads in different layers all read from and write to the same residual stream, later heads can build on earlier heads' outputs. This is called composition and comes in three forms:

Q-composition: Head B in layer $l+1$ forms its queries using the output of Head A in layer $l$ . The query at position $t$ depends on what Head A wrote at position $t$ .

K-composition: Head B uses Head A's output to form keys. Positions that Head A has annotated become easier or harder to attend to.

V-composition: Head B uses Head A's output as part of the values it reads. The content that Head B extracts depends on what Head A has contributed.

Composition is what makes transformers more than a stack of independent attention operations. It allows induction heads (a pattern-matching circuit that requires K-composition across two layers) and other multi-step algorithms. Without composition, each layer would operate independently on the original embeddings, and the network would be no more expressive than a single wide layer.

Direct Logit Attribution

The residual stream decomposition enables direct logit attribution: quantifying how much each attention head and FFN block contributes to the logit of a specific output token.

For a target token $v$ with unembedding vector $u_v$ , the logit contribution of attention head $(l, h)$ is:

$\text{DLA}_{l,h}(v) = u_v^T \cdot a_{t,l,h}$

where $a_{t,l,h}$ is the head's output vector. The total logit is:

$\text{logit}(v) = u_v^T x_t^{(L)} = u_v^T x_t^{(0)} + \sum_{l,h} \text{DLA}_{l,h}(v) + \sum_l u_v^T f_{t,l}$

This allows you to identify which heads "vote for" a particular token and which "vote against" it. A head with large positive $\text{DLA}_{l,h}(v)$ strongly promotes token $v$ ; a head with large negative DLA suppresses it.

DLA is fast to compute (just dot products) and gives a first approximation of circuit structure. Its limitation: it captures only direct contributions, not indirect ones (where head A influences head B which then influences the logit).

Mechanistic Interpretability on the Residual Stream

Superposition. Transformer residual streams are $d$ -dimensional, but models represent many more features than $d$ by placing them along near-orthogonal (not axis-aligned) directions. Elhage et al. (2022) formalize this as superposition: when features are sparse, a model can pack $k \gg d$ features into $\mathbb{R}^d$ by accepting small interference between them. This is the theoretical motivation for sparse-decomposition methods: features are directions in residual-stream space, but those directions are not coordinate axes.

Sparse autoencoders (SAEs). SAEs are the practical instantiation of the "residual stream as communication channel" intuition. An SAE learns an over-complete dictionary $D \in \mathbb{R}^{d \times m}$ with $m \gg d$ and encodes each residual vector $x^{(l)}$ as a sparse code $z$ with $x^{(l)} \approx Dz$ and $\|z\|_0$ small. Bricken et al. (2023) showed that SAE features on a one-layer transformer are substantially more monosemantic than raw neurons. Templeton et al. (2024) scaled the method to Claude 3 Sonnet and recovered millions of interpretable features, including abstract concepts such as the Golden Gate Bridge, sycophancy, and deception. The features correspond to directions in residual-stream space rather than individual neurons, which is what superposition predicts.

Activation patching and causal tracing. Activation patching is the main causal tool for residual-stream analysis. Given a clean input and a corrupted input, swap or zero the residual-stream activation at a specific layer and position, run the model forward, and measure the effect on the output. Meng et al. (2022, ROME) used causal tracing to localize factual knowledge to mid-layer MLP blocks in GPT. Wang et al. (2022) used activation patching to identify the indirect object identification (IOI) circuit in GPT-2 small, isolating a small set of attention heads responsible for the behavior. Unlike direct logit attribution, activation patching captures indirect effects: if head A only matters because it feeds head B, patching at A reveals the dependence.

Tuned lens. The logit lens works because the residual stream is already in the unembedding basis, but per-layer basis drift degrades early-layer predictions. The tuned lens (Belrose et al., 2023) corrects this by learning a per-layer affine map $A_\ell x^{(\ell)} + b_\ell$ before the unembedding, trained to match the final-layer distribution. The tuned lens recovers sharper intermediate predictions and makes the incremental-prediction picture more quantitative.

Register tokens. As a related residual-stream phenomenon, Darcet et al. (2024) observed that vision transformers develop high-norm outlier activations on low-information image patches; adding dedicated register tokens absorbs these outliers and cleans up the residual stream.

Evaluation Ladder

Question	What to measure	Failure signal
Attribution sanity check	Compare direct logit attribution, logit lens, and tuned-lens rankings for the same prompt.	A head looks important under one lens but not under neighboring views.
Causal intervention	Patch, ablate, or resample the proposed residual-stream component and measure the target behavior.	Attribution is large but intervention has little effect.
Position specificity	Test the same component across token positions and prompt variants.	A circuit explanation only works for one memorized string.
Feature separation	Check whether the direction is sparse and stable across examples, layers, and random seeds.	The feature is a mixed direction that changes with the prompt distribution.
Negative controls	Run corrupted prompts where the behavior should disappear.	The named circuit fires even when the task structure is absent.

Circuit Diagnostic Pattern

Use patching before naming. Start with a clean prompt where the model gets the behavior right and a corrupted prompt where it fails. Patch residual-stream activations from clean into corrupted runs by layer and position, then measure whether the target logit or answer recovers. Only after the causal patch localizes a layer-position region should you name a head, MLP block, or SAE feature. Otherwise the analysis is descriptive geometry, not a circuit claim.

Common Confusions

Watch Out

Residual connections are not just for gradient flow

In ResNets, skip connections were motivated by gradient flow during training. In transformers, the residual stream serves a different role: it is the communication bus between all heads and FFN blocks. Even if gradient flow were not an issue, the residual structure would still be necessary for the additive composition of different computations.

Watch Out

The logit lens is approximate, not exact

The logit lens applies the final unembedding to intermediate representations, but those representations were not trained to be decoded at intermediate layers. The fact that it works at all is informative, but the predictions at early layers should be interpreted qualitatively, not as precise probability estimates. The "tuned lens" (Belrose et al., 2023) trains a separate affine probe per layer for better intermediate decoding.

Exercises

ExerciseCore

Problem

A 24-layer transformer has 16 attention heads per layer. How many additive terms contribute to the final residual stream representation at a single token position? List the categories of terms.

ExerciseAdvanced

Problem

Explain why post-norm transformers are harder to train at depth 48+ compared to pre-norm, using the gradient flow argument. Specifically, what happens to $\partial \mathcal{L} / \partial x^{(1)}$ in both cases?

References

Canonical:

Elhage et al., A Mathematical Framework for Transformer Circuits (2021)
Geva et al., Transformer Feed-Forward Layers Are Key-Value Memories (2021)

Current:

nostalgebraist, The Logit Lens (2020, LessWrong blog post)
Belrose et al., Eliciting Latent Predictions from Transformers with the Tuned Lens (2023), arXiv:2303.08112
Xiong et al., On Layer Normalization in the Transformer Architecture (2020)

Mechanistic interpretability:

Bricken et al., Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (2023). transformer-circuits.pub/2023/monosemantic-features
Templeton et al., Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (2024). transformer-circuits.pub/2024/scaling-monosemanticity
Elhage et al., Toy Models of Superposition (2022), arXiv:2209.10652
Meng et al., Locating and Editing Factual Associations in GPT (2022), arXiv:2202.05262. ROME
Wang et al., Interpretability in the Wild: Indirect Object Identification in GPT-2 Small (2022), arXiv:2211.00593
Darcet et al., Vision Transformers Need Registers (2024), arXiv:2309.16588

Next Topics

Mechanistic interpretability: using the residual stream decomposition to reverse-engineer model behavior

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Gradient Flow and Vanishing Gradientslayer 2 · tier 1
Forgetting Transformer (FoX)layer 4 · tier 2
Transformer Architecturelayer 4 · tier 2

Derived topics

5

Mechanistic Interpretability: Features, Circuits, and Causal Faithfulnesslayer 4 · tier 1
Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and Scalinglayer 4 · tier 1
Induction Headslayer 4 · tier 2
Jacobian Lens and Global Workspace Interpretabilitylayer 4 · tier 2
Truth Directions and Linear Probeslayer 4 · tier 2

Graph-backed continuations

Jacobian Lens and Global Workspace Interpretability Mechanistic Interpretability: Features, Circuits, and Causal Faithfulness Induction Heads Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and Scaling Truth Directions and Linear Probes