What Each Configuration Does
Both configurations use the same building blocks: multi-head attention, feedforward networks, residual connections, and layer normalization. They differ only in where normalization is applied relative to the sublayer and the residual addition.
Post-norm (original transformer, Vaswani et al. 2017):
Normalization is applied after the residual addition. The sublayer output is added to the residual stream, then the sum is normalized.
Pre-norm (GPT-2, LLaMA, PaLM):
Normalization is applied to the input before it enters the sublayer. The sublayer output is added directly to the residual stream without further normalization.
Gradient Flow: Why Pre-Norm Trains More Stably
Consider a transformer with layers. In the pre-norm configuration, the residual stream passes through the network with additive updates:
The gradient of the loss with respect to includes a direct path through the identity:
The identity term ensures that gradients flow from the final layer to the first layer without attenuation, regardless of depth. This is analogous to how skip connections in ResNets prevent vanishing gradients.
In post-norm, the LayerNorm after the residual addition breaks this direct path. The gradient must pass through every LayerNorm operation:
Each LayerNorm rescales and recenters, which can amplify or attenuate gradients depending on the layer statistics. At initialization, the output variance of each sublayer must be carefully controlled to prevent the product of Jacobians from exploding or vanishing. This is why post-norm transformers require learning rate warmup: the early training phase needs small updates until the normalization statistics stabilize.
Side-by-Side Comparison
| Property | Post-Norm | Pre-Norm |
|---|---|---|
| Normalization placement | After residual add | Before sublayer |
| Gradient path | Through all LayerNorms | Direct identity shortcut |
| Training stability | Requires warmup, sensitive to LR | Stable without warmup |
| Final performance | Can be slightly better (when tuned) | Slightly worse ceiling |
| Maximum stable depth | ~12-24 layers without tricks | Hundreds of layers |
| Used in | Original transformer, BERT | GPT-2/3, LLaMA, PaLM, Gemini |
| Output scale | Normalized at each layer | Grows with depth (needs final LN) |
| Initialization sensitivity | High | Low |
Why Modern LLMs Use Pre-Norm
Three practical reasons dominate:
1. Training stability at scale. Post-norm transformers with more than 24 layers become difficult to train without careful initialization (e.g., scaling sublayer outputs by ) and extended warmup. Pre-norm trains stably with standard initialization even at 100+ layers. When training a 175B parameter model for weeks on thousands of GPUs, robustness to hyperparameter choices saves millions of dollars.
2. Removing warmup simplifies schedules. Post-norm requires thousands of warmup steps where the learning rate increases linearly from near zero. With pre-norm, you can use cosine decay from the start or with minimal warmup. This reduces one tuning degree of freedom.
3. Predictable scaling. The scaling laws for LLMs were established primarily with pre-norm architectures. Post-norm introduces an additional source of variance in training dynamics that complicates scaling predictions.
The Performance Gap
Multiple studies (Xiong et al. 2020, Liu et al. 2020) found that post-norm achieves marginally better final performance when training succeeds. The hypothesis: post-norm's LayerNorm after residual addition constrains the representation more tightly, which acts as implicit regularization.
The gap is small (typically under 1% on downstream benchmarks) and closes at large scale. For LLMs with billions of parameters, the stability advantage of pre-norm far outweighs the marginal performance edge of post-norm. No team has published a successful post-norm model at the 100B+ scale without significant architectural modifications.
Variants and Hybrids
RMSNorm replaces LayerNorm in many pre-norm architectures (LLaMA, Gemma). It drops the mean-centering step:
This is ~10-15% faster than LayerNorm and empirically equivalent for pre-norm transformers.
Sandwich-norm applies normalization both before and after the sublayer, combining aspects of both. This was used in some early GPT variants but is not standard.
DeepNorm (Wang et al. 2022) modifies post-norm by scaling the residual connection: with . This stabilizes post-norm at depth (up to 1000 layers) while retaining its performance advantage.
Common Confusions
Pre-norm and post-norm are not about batch normalization
This comparison concerns layer normalization placement within transformer blocks, not the choice between BatchNorm and LayerNorm. Transformers use LayerNorm (or RMSNorm), normalizing across the feature dimension of a single token. BatchNorm normalizes across the batch dimension and is not used in standard transformers.
Pre-norm requires a final LayerNorm that post-norm does not
In pre-norm, the output of the last residual block is , which is not normalized. A final LayerNorm before the output head is necessary. In post-norm, the last layer already passes through LayerNorm. Forgetting the final LayerNorm in pre-norm leads to unbounded output magnitudes that grow with depth.
The warmup requirement for post-norm is not just about learning rate
Post-norm instability is not solely a learning rate issue. The fundamental problem is that gradients through stacked LayerNorms can have highly variable magnitude at initialization. Warmup gives the normalization statistics time to stabilize, but even with warmup, deep post-norm models (48+ layers) can diverge. The initialization of sublayer weights and the residual scaling both matter.
Pre-norm does not make all normalization irrelevant
Pre-norm moves normalization earlier in the computation graph but does not eliminate it. Without any normalization, transformer training is unstable regardless of placement. The pre-norm benefit is specifically about the interaction between normalization and the residual stream.
References
- Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS 2017. (Original post-norm transformer architecture.)
- Xiong, R. et al. (2020). "On Layer Normalization in the Transformer Architecture." ICML 2020. (Theoretical analysis of gradient flow in pre-norm vs post-norm, proves pre-norm gradient bounds.)
- Radford, A. et al. (2019). "Language Models are Unsupervised Multitask Learners." (GPT-2, adopting pre-norm for decoder-only models.)
- Baevski, A. and Auli, M. (2019). "Adaptive Input Representations for Neural Language Modeling." ICLR 2019. (Early adoption of pre-norm for language modeling.)
- Zhang, B. and Sennrich, R. (2019). "Root Mean Square Layer Normalization." NeurIPS 2019. (RMSNorm, used in LLaMA and other pre-norm architectures.)
- Wang, H. et al. (2022). "DeepNet: Scaling Transformers to 1,000 Layers." arXiv:2203.00555. (DeepNorm for stabilizing post-norm at extreme depth.)
- Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. (Pre-norm + RMSNorm as the modern LLM standard.)