LLM Construction
Residual Stream and Transformer Internals
The residual stream as the central computational highway in transformers: attention and FFN blocks read from and write to it. Pre-norm vs post-norm, FFN as key-value memory, and the logit lens for inspecting intermediate representations.
Prerequisites
Why This Matters
The standard description of a transformer lists attention, FFN, and layer norm as sequential operations. This framing obscures the real computational structure. The key insight from mechanistic interpretability research is that the residual stream is the primary object. Attention and FFN blocks are side computations that read from the stream, transform the information, and write their results back.
This perspective clarifies why skip connections are not just a training trick (as in ResNets), but the core data bus of the architecture. It also explains why techniques like the logit lens work: you can read off meaningful predictions at every layer because the residual stream carries a running estimate of the output.
Mental Model
Think of the residual stream as a shared whiteboard. Each attention head and each FFN block can read what is on the whiteboard, do some computation, and add its contribution. No block erases what is already there; each block only adds. The final answer is whatever is on the whiteboard after all blocks have written to it.
Formal Setup
Residual Stream
For a transformer with layers, the residual stream at position after layer is:
where is the token embedding plus positional encoding. Each layer adds two terms: the attention output and the FFN output.
Main Theorems
Residual Stream Decomposition
Statement
The final residual stream can be written as a sum of terms:
where is the output of attention head in layer and is the output of the FFN in layer . For a 32-layer model with 32 heads, this is additive terms.
Intuition
Every attention head and every FFN block contributes independently to the final representation. The final logits are a linear function of this sum (via the unembedding matrix). This means individual heads and FFN blocks have interpretable, separable contributions to the output distribution.
Proof Sketch
Expand the residual recurrence telescopically from to . Multi-head attention is already a sum over heads: .
Why It Matters
This decomposition is the foundation of mechanistic interpretability. If you want to understand why the model predicts a particular token, you can examine which heads and FFN blocks contribute the most to the logit of that token. This is called the "logit attribution" technique.
Failure Mode
The decomposition ignores that each term is computed from previous terms, not independently. The outputs of layer depend on all of layers . So while the final representation is a linear sum, the individual terms are nonlinear functions of each other. Direct path patching or causal interventions are needed to establish causal (not just additive) contributions.
Pre-Norm vs Post-Norm
Post-Norm Transformer
The original transformer (Vaswani et al., 2017) applies layer normalization after the residual addition:
This is called post-norm because normalization comes after the skip.
Pre-Norm Transformer
GPT-2 and most modern LLMs apply layer normalization before the sublayer:
This is called pre-norm because normalization comes before the sublayer computation.
Pre-Norm Gradient Flow
Statement
In a pre-norm transformer, the gradient of the loss with respect to the residual stream at layer satisfies:
The first term is a direct path from the loss to layer with no multiplicative degradation. This direct path does not exist in post-norm because the normalization after the residual breaks the identity shortcut.
Intuition
Pre-norm preserves a clean identity path in the residual stream. The gradient can flow directly from the output back to any layer without passing through normalization layers. Post-norm inserts normalization into the gradient path, which can cause gradient magnitude issues in deep networks.
Proof Sketch
For pre-norm, where includes the normalization internally. By the chain rule, contains an identity term from the skip connection. For post-norm, , and the layer norm Jacobian has no identity component.
Why It Matters
Pre-norm is why modern LLMs can be trained at depths of 80+ layers without careful learning rate warmup schemes. The original post-norm transformer required warmup and was fragile beyond 12 layers. This architectural choice has as much practical impact as the attention mechanism itself.
Failure Mode
Pre-norm can produce representations that grow in magnitude across layers because there is no normalization on the residual stream itself. This is sometimes addressed by adding a final layer norm before the output projection. Some recent work (e.g., DeepNorm) combines pre-norm and post-norm properties for very deep models.
FFN as Key-Value Memory
Geva et al. (2021) showed that FFN layers can be interpreted as key-value memories. The first projection maps the input to a set of "keys." The activation function selects which keys are active. The second projection maps the active keys to "values" that are written to the residual stream.
For a two-layer FFN with ReLU:
Each neuron fires when the input has high dot product with key , and contributes value to the output. This makes the FFN a sparse, high-dimensional associative memory.
Logit Lens
Logit Lens
The logit lens projects the residual stream at intermediate layer through the final unembedding matrix to obtain a distribution over the vocabulary:
This shows what token the model would predict if forced to output at layer .
The logit lens reveals that transformers build up their predictions incrementally. Early layers produce vague semantic categories. Middle layers narrow to the correct topic. Final layers select the specific token. This progression is visible as the correct token's rank improving (moving toward rank 1) across layers.
Attention Head Composition
Because attention heads in different layers all read from and write to the same residual stream, later heads can build on earlier heads' outputs. This is called composition and comes in three forms:
Q-composition: Head B in layer forms its queries using the output of Head A in layer . The query at position depends on what Head A wrote at position .
K-composition: Head B uses Head A's output to form keys. Positions that Head A has annotated become easier or harder to attend to.
V-composition: Head B uses Head A's output as part of the values it reads. The content that Head B extracts depends on what Head A has contributed.
Composition is what makes transformers more than a stack of independent attention operations. It allows induction heads (a pattern-matching circuit that requires K-composition across two layers) and other multi-step algorithms. Without composition, each layer would operate independently on the original embeddings, and the network would be no more expressive than a single wide layer.
Direct Logit Attribution
The residual stream decomposition enables direct logit attribution: quantifying how much each attention head and FFN block contributes to the logit of a specific output token.
For a target token with unembedding vector , the logit contribution of attention head is:
where is the head's output vector. The total logit is:
This allows you to identify which heads "vote for" a particular token and which "vote against" it. A head with large positive strongly promotes token ; a head with large negative DLA suppresses it.
DLA is fast to compute (just dot products) and gives a first approximation of circuit structure. Its limitation: it captures only direct contributions, not indirect ones (where head A influences head B which then influences the logit).
Common Confusions
Residual connections are not just for gradient flow
In ResNets, skip connections were motivated by gradient flow during training. In transformers, the residual stream serves a different role: it is the communication bus between all heads and FFN blocks. Even if gradient flow were not an issue, the residual structure would still be necessary for the additive composition of different computations.
The logit lens is approximate, not exact
The logit lens applies the final unembedding to intermediate representations, but those representations were not trained to be decoded at intermediate layers. The fact that it works at all is informative, but the predictions at early layers should be interpreted qualitatively, not as precise probability estimates. The "tuned lens" (Belrose et al., 2023) trains a separate affine probe per layer for better intermediate decoding.
Exercises
Problem
A 24-layer transformer has 16 attention heads per layer. How many additive terms contribute to the final residual stream representation at a single token position? List the categories of terms.
Problem
Explain why post-norm transformers are harder to train at depth 48+ compared to pre-norm, using the gradient flow argument. Specifically, what happens to in both cases?
References
Canonical:
- Elhage et al., A Mathematical Framework for Transformer Circuits (2021)
- Geva et al., Transformer Feed-Forward Layers Are Key-Value Memories (2021)
Current:
- nostalgebraist, The Logit Lens (2020, blog)
- Belrose et al., Eliciting Latent Predictions from Transformers with the Tuned Lens (2023)
- Xiong et al., On Layer Normalization in the Transformer Architecture (2020)
Next Topics
- Mechanistic interpretability: using the residual stream decomposition to reverse-engineer model behavior
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1