Pre-Norm vs Post-Norm Transformers: Training Stability, Gradient Flow, and LLM Practice

What Each Configuration Does

Both configurations use the same building blocks: multi-head attention, feedforward networks, residual connections, and layer normalization. They differ only in where normalization is applied relative to the sublayer and the residual addition.

Post-norm (original transformer, Vaswani et al. 2017):

$x_{l+1} = \text{LayerNorm}(x_l + \text{Sublayer}(x_l))$

Normalization is applied after the residual addition. The sublayer output is added to the residual stream, then the sum is normalized.

Pre-norm (GPT-2, LLaMA, PaLM):

$x_{l+1} = x_l + \text{Sublayer}(\text{LayerNorm}(x_l))$

Normalization is applied to the input before it enters the sublayer. The sublayer output is added directly to the residual stream without further normalization.

Gradient Flow: Why Pre-Norm Trains More Stably

Consider a transformer with $L$ layers. In the pre-norm configuration, the residual stream passes through the network with additive updates:

$x_L = x_0 + \sum_{l=1}^{L} f_l(\text{LayerNorm}(x_{l-1}))$

The gradient of the loss with respect to $x_0$ includes a direct path through the identity:

$\frac{\partial x_L}{\partial x_0} = I + \sum_{l=1}^{L} \frac{\partial f_l}{\partial x_0}$

The identity term $I$ ensures that gradients flow from the final layer to the first layer without attenuation, regardless of depth. This is analogous to how skip connections in ResNets prevent vanishing gradients.

In post-norm, the LayerNorm after the residual addition breaks this direct path. The gradient must pass through every LayerNorm operation:

$\frac{\partial x_L}{\partial x_0} = \prod_{l=1}^{L} \frac{\partial \text{LayerNorm}_l}{\partial \cdot} \cdot (I + \frac{\partial f_l}{\partial \cdot})$

Each LayerNorm rescales and recenters, which can amplify or attenuate gradients depending on the layer statistics. At initialization, the output variance of each sublayer must be carefully controlled to prevent the product of Jacobians from exploding or vanishing. This is why post-norm transformers require learning rate warmup: the early training phase needs small updates until the normalization statistics stabilize.

Side-by-Side Comparison

Property	Post-Norm	Pre-Norm
Normalization placement	After residual add	Before sublayer
Gradient path	Through all LayerNorms	Direct identity shortcut
Training stability	Requires warmup, sensitive to LR	Stable without warmup
Final performance	Can be slightly better (when tuned)	Slightly worse ceiling
Maximum stable depth	~12-24 layers without tricks	Hundreds of layers
Used in	Original transformer, BERT	GPT-2/3, LLaMA, PaLM, Gemini
Output scale	Normalized at each layer	Grows with depth (needs final LN)
Initialization sensitivity	High	Low

Why Modern LLMs Use Pre-Norm

Three practical reasons dominate:

1. Training stability at scale. Post-norm transformers with more than 24 layers become difficult to train without careful initialization (e.g., scaling sublayer outputs by $1/\sqrt{2L}$ ) and extended warmup. Pre-norm trains stably with standard initialization even at 100+ layers. When training a 175B parameter model for weeks on thousands of GPUs, robustness to hyperparameter choices saves millions of dollars.

2. Removing warmup simplifies schedules. Post-norm requires thousands of warmup steps where the learning rate increases linearly from near zero. With pre-norm, you can use cosine decay from the start or with minimal warmup. This reduces one tuning degree of freedom.

3. Predictable scaling. The scaling laws for LLMs were established primarily with pre-norm architectures. Post-norm introduces an additional source of variance in training dynamics that complicates scaling predictions.

The Performance Gap

Multiple studies (Xiong et al. 2020, Liu et al. 2020) found that post-norm achieves marginally better final performance when training succeeds. The hypothesis: post-norm's LayerNorm after residual addition constrains the representation more tightly, which acts as implicit regularization.

The gap is small (typically under 1% on downstream benchmarks) and closes at large scale. For LLMs with billions of parameters, the stability advantage of pre-norm far outweighs the marginal performance edge of post-norm. No team has published a successful post-norm model at the 100B+ scale without significant architectural modifications.

Variants and Hybrids

RMSNorm replaces LayerNorm in many pre-norm architectures (LLaMA, Gemma). It drops the mean-centering step:

$\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^d x_i^2 + \epsilon}} \cdot \gamma$

This is ~10-15% faster than LayerNorm and empirically equivalent for pre-norm transformers.

Sandwich-norm applies normalization both before and after the sublayer, combining aspects of both. This was used in some early GPT variants but is not standard.

DeepNorm (Wang et al. 2022) modifies post-norm by scaling the residual connection: $x_{l+1} = \text{LayerNorm}(\alpha x_l + f(x_l))$ with $\alpha > 1$ . This stabilizes post-norm at depth (up to 1000 layers) while retaining its performance advantage.

Common Confusions

Watch Out

Pre-norm and post-norm are not about batch normalization

This comparison concerns layer normalization placement within transformer blocks, not the choice between BatchNorm and LayerNorm. Transformers use LayerNorm (or RMSNorm), normalizing across the feature dimension of a single token. BatchNorm normalizes across the batch dimension and is not used in standard transformers.

Watch Out

Pre-norm requires a final LayerNorm that post-norm does not

In pre-norm, the output of the last residual block is $x_0 + \sum f_l(\cdot)$ , which is not normalized. A final LayerNorm before the output head is necessary. In post-norm, the last layer already passes through LayerNorm. Forgetting the final LayerNorm in pre-norm leads to unbounded output magnitudes that grow with depth.

Watch Out

The warmup requirement for post-norm is not just about learning rate

Post-norm instability is not solely a learning rate issue. The fundamental problem is that gradients through stacked LayerNorms can have highly variable magnitude at initialization. Warmup gives the normalization statistics time to stabilize, but even with warmup, deep post-norm models (48+ layers) can diverge. The initialization of sublayer weights and the residual scaling both matter.

Watch Out

Pre-norm does not make all normalization irrelevant

Pre-norm moves normalization earlier in the computation graph but does not eliminate it. Without any normalization, transformer training is unstable regardless of placement. The pre-norm benefit is specifically about the interaction between normalization and the residual stream.

References

Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS 2017. (Original post-norm transformer architecture.)
Xiong, R. et al. (2020). "On Layer Normalization in the Transformer Architecture." ICML 2020. (Theoretical analysis of gradient flow in pre-norm vs post-norm, proves pre-norm gradient bounds.)
Radford, A. et al. (2019). "Language Models are Unsupervised Multitask Learners." (GPT-2, adopting pre-norm for decoder-only models.)
Baevski, A. and Auli, M. (2019). "Adaptive Input Representations for Neural Language Modeling." ICLR 2019. (Early adoption of pre-norm for language modeling.)
Zhang, B. and Sennrich, R. (2019). "Root Mean Square Layer Normalization." NeurIPS 2019. (RMSNorm, used in LLaMA and other pre-norm architectures.)
Wang, H. et al. (2022). "DeepNet: Scaling Transformers to 1,000 Layers." arXiv:2203.00555. (DeepNorm for stabilizing post-norm at extreme depth.)
Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. (Pre-norm + RMSNorm as the modern LLM standard.)