Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Residual Stream and Transformer Internals

The residual stream as the central computational highway in transformers: attention and FFN blocks read from and write to it. Pre-norm vs post-norm, FFN as key-value memory, and the logit lens for inspecting intermediate representations.

AdvancedTier 2Current~50 min
0

Why This Matters

The standard description of a transformer lists attention, FFN, and layer norm as sequential operations. This framing obscures the real computational structure. The key insight from mechanistic interpretability research is that the residual stream is the primary object. Attention and FFN blocks are side computations that read from the stream, transform the information, and write their results back.

This perspective clarifies why skip connections are not just a training trick (as in ResNets), but the core data bus of the architecture. It also explains why techniques like the logit lens work: you can read off meaningful predictions at every layer because the residual stream carries a running estimate of the output.

Mental Model

Think of the residual stream as a shared whiteboard. Each attention head and each FFN block can read what is on the whiteboard, do some computation, and add its contribution. No block erases what is already there; each block only adds. The final answer is whatever is on the whiteboard after all blocks have written to it.

Formal Setup

Definition

Residual Stream

For a transformer with LL layers, the residual stream at position tt after layer ll is:

xt(l)=xt(0)+i=1l(Attn(i)(x(i1))t+FFN(i)(x(i1)+Attn(i)(x(i1)))t)x_t^{(l)} = x_t^{(0)} + \sum_{i=1}^{l} \left( \text{Attn}^{(i)}(x^{(i-1)})_t + \text{FFN}^{(i)}(x^{(i-1)} + \text{Attn}^{(i)}(x^{(i-1)}))_t \right)

where xt(0)x_t^{(0)} is the token embedding plus positional encoding. Each layer adds two terms: the attention output and the FFN output.

Main Theorems

Proposition

Residual Stream Decomposition

Statement

The final residual stream xt(L)x_t^{(L)} can be written as a sum of 1+LH+L1 + L \cdot H + L terms:

xt(L)=xt(0)+l=1Lh=1Hat,l,h+l=1Lft,lx_t^{(L)} = x_t^{(0)} + \sum_{l=1}^{L} \sum_{h=1}^{H} a_{t,l,h} + \sum_{l=1}^{L} f_{t,l}

where at,l,ha_{t,l,h} is the output of attention head hh in layer ll and ft,lf_{t,l} is the output of the FFN in layer ll. For a 32-layer model with 32 heads, this is 1+1024+32=10571 + 1024 + 32 = 1057 additive terms.

Intuition

Every attention head and every FFN block contributes independently to the final representation. The final logits are a linear function of this sum (via the unembedding matrix). This means individual heads and FFN blocks have interpretable, separable contributions to the output distribution.

Proof Sketch

Expand the residual recurrence x(l)=x(l1)+Attn(l)()+FFN(l)()x^{(l)} = x^{(l-1)} + \text{Attn}^{(l)}(\cdot) + \text{FFN}^{(l)}(\cdot) telescopically from l=1l = 1 to l=Ll = L. Multi-head attention is already a sum over heads: Attn(l)=hWO(l,h)headh(l)\text{Attn}^{(l)} = \sum_h W_O^{(l,h)} \text{head}_h^{(l)}.

Why It Matters

This decomposition is the foundation of mechanistic interpretability. If you want to understand why the model predicts a particular token, you can examine which heads and FFN blocks contribute the most to the logit of that token. This is called the "logit attribution" technique.

Failure Mode

The decomposition ignores that each term is computed from previous terms, not independently. The outputs of layer ll depend on all of layers 1,,l11, \ldots, l-1. So while the final representation is a linear sum, the individual terms are nonlinear functions of each other. Direct path patching or causal interventions are needed to establish causal (not just additive) contributions.

Pre-Norm vs Post-Norm

Definition

Post-Norm Transformer

The original transformer (Vaswani et al., 2017) applies layer normalization after the residual addition:

x(l)=LayerNorm(x(l1)+Sublayer(x(l1)))x^{(l)} = \text{LayerNorm}(x^{(l-1)} + \text{Sublayer}(x^{(l-1)}))

This is called post-norm because normalization comes after the skip.

Definition

Pre-Norm Transformer

GPT-2 and most modern LLMs apply layer normalization before the sublayer:

x(l)=x(l1)+Sublayer(LayerNorm(x(l1)))x^{(l)} = x^{(l-1)} + \text{Sublayer}(\text{LayerNorm}(x^{(l-1)}))

This is called pre-norm because normalization comes before the sublayer computation.

Proposition

Pre-Norm Gradient Flow

Statement

In a pre-norm transformer, the gradient of the loss L\mathcal{L} with respect to the residual stream at layer ll satisfies:

Lx(l)=Lx(L)+(terms from layers l+1,,L)\frac{\partial \mathcal{L}}{\partial x^{(l)}} = \frac{\partial \mathcal{L}}{\partial x^{(L)}} + \text{(terms from layers } l+1, \ldots, L\text{)}

The first term is a direct path from the loss to layer ll with no multiplicative degradation. This direct path does not exist in post-norm because the normalization after the residual breaks the identity shortcut.

Intuition

Pre-norm preserves a clean identity path in the residual stream. The gradient can flow directly from the output back to any layer without passing through normalization layers. Post-norm inserts normalization into the gradient path, which can cause gradient magnitude issues in deep networks.

Proof Sketch

For pre-norm, x(l)=x(l1)+g(x(l1))x^{(l)} = x^{(l-1)} + g(x^{(l-1)}) where gg includes the normalization internally. By the chain rule, x(L)/x(l)\partial x^{(L)} / \partial x^{(l)} contains an identity term from the skip connection. For post-norm, x(l)=LN(x(l1)+g(x(l1)))x^{(l)} = \text{LN}(x^{(l-1)} + g(x^{(l-1)})), and the layer norm Jacobian has no identity component.

Why It Matters

Pre-norm is why modern LLMs can be trained at depths of 80+ layers without careful learning rate warmup schemes. The original post-norm transformer required warmup and was fragile beyond 12 layers. This architectural choice has as much practical impact as the attention mechanism itself.

Failure Mode

Pre-norm can produce representations that grow in magnitude across layers because there is no normalization on the residual stream itself. This is sometimes addressed by adding a final layer norm before the output projection. Some recent work (e.g., DeepNorm) combines pre-norm and post-norm properties for very deep models.

FFN as Key-Value Memory

Geva et al. (2021) showed that FFN layers can be interpreted as key-value memories. The first projection W1W_1 maps the input to a set of "keys." The activation function selects which keys are active. The second projection W2W_2 maps the active keys to "values" that are written to the residual stream.

For a two-layer FFN with ReLU:

FFN(x)=W2ReLU(W1x)=i=1dffmax(w1,iTx,0)w2,i\text{FFN}(x) = W_2 \, \text{ReLU}(W_1 x) = \sum_{i=1}^{d_{\text{ff}}} \max(w_{1,i}^T x, 0) \cdot w_{2,i}

Each neuron ii fires when the input xx has high dot product with key w1,iw_{1,i}, and contributes value w2,iw_{2,i} to the output. This makes the FFN a sparse, high-dimensional associative memory.

Logit Lens

Definition

Logit Lens

The logit lens projects the residual stream at intermediate layer ll through the final unembedding matrix WUW_U to obtain a distribution over the vocabulary:

p(l)=softmax(WULayerNorm(x(l)))p^{(l)} = \text{softmax}(W_U \cdot \text{LayerNorm}(x^{(l)}))

This shows what token the model would predict if forced to output at layer ll.

The logit lens reveals that transformers build up their predictions incrementally. Early layers produce vague semantic categories. Middle layers narrow to the correct topic. Final layers select the specific token. This progression is visible as the correct token's rank improving (moving toward rank 1) across layers.

Attention Head Composition

Because attention heads in different layers all read from and write to the same residual stream, later heads can build on earlier heads' outputs. This is called composition and comes in three forms:

Q-composition: Head B in layer l+1l+1 forms its queries using the output of Head A in layer ll. The query at position tt depends on what Head A wrote at position tt.

K-composition: Head B uses Head A's output to form keys. Positions that Head A has annotated become easier or harder to attend to.

V-composition: Head B uses Head A's output as part of the values it reads. The content that Head B extracts depends on what Head A has contributed.

Composition is what makes transformers more than a stack of independent attention operations. It allows induction heads (a pattern-matching circuit that requires K-composition across two layers) and other multi-step algorithms. Without composition, each layer would operate independently on the original embeddings, and the network would be no more expressive than a single wide layer.

Direct Logit Attribution

The residual stream decomposition enables direct logit attribution: quantifying how much each attention head and FFN block contributes to the logit of a specific output token.

For a target token vv with unembedding vector uvu_v, the logit contribution of attention head (l,h)(l, h) is:

DLAl,h(v)=uvTat,l,h\text{DLA}_{l,h}(v) = u_v^T \cdot a_{t,l,h}

where at,l,ha_{t,l,h} is the head's output vector. The total logit is:

logit(v)=uvTxt(L)=uvTxt(0)+l,hDLAl,h(v)+luvTft,l\text{logit}(v) = u_v^T x_t^{(L)} = u_v^T x_t^{(0)} + \sum_{l,h} \text{DLA}_{l,h}(v) + \sum_l u_v^T f_{t,l}

This allows you to identify which heads "vote for" a particular token and which "vote against" it. A head with large positive DLAl,h(v)\text{DLA}_{l,h}(v) strongly promotes token vv; a head with large negative DLA suppresses it.

DLA is fast to compute (just dot products) and gives a first approximation of circuit structure. Its limitation: it captures only direct contributions, not indirect ones (where head A influences head B which then influences the logit).

Common Confusions

Watch Out

Residual connections are not just for gradient flow

In ResNets, skip connections were motivated by gradient flow during training. In transformers, the residual stream serves a different role: it is the communication bus between all heads and FFN blocks. Even if gradient flow were not an issue, the residual structure would still be necessary for the additive composition of different computations.

Watch Out

The logit lens is approximate, not exact

The logit lens applies the final unembedding to intermediate representations, but those representations were not trained to be decoded at intermediate layers. The fact that it works at all is informative, but the predictions at early layers should be interpreted qualitatively, not as precise probability estimates. The "tuned lens" (Belrose et al., 2023) trains a separate affine probe per layer for better intermediate decoding.

Exercises

ExerciseCore

Problem

A 24-layer transformer has 16 attention heads per layer. How many additive terms contribute to the final residual stream representation at a single token position? List the categories of terms.

ExerciseAdvanced

Problem

Explain why post-norm transformers are harder to train at depth 48+ compared to pre-norm, using the gradient flow argument. Specifically, what happens to L/x(1)\partial \mathcal{L} / \partial x^{(1)} in both cases?

References

Canonical:

  • Elhage et al., A Mathematical Framework for Transformer Circuits (2021)
  • Geva et al., Transformer Feed-Forward Layers Are Key-Value Memories (2021)

Current:

  • nostalgebraist, The Logit Lens (2020, blog)
  • Belrose et al., Eliciting Latent Predictions from Transformers with the Tuned Lens (2023)
  • Xiong et al., On Layer Normalization in the Transformer Architecture (2020)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics