What Each Does
Both methods normalize activations within a single training example across the feature dimension. Given an activation vector , they produce a normalized output with learned scale (and optionally shift) parameters.
LayerNorm (Ba et al., 2016) computes:
where is the mean, is the variance, is the learned scale, and is the learned shift. This involves two reduction operations (mean, variance) and two learned parameter vectors.
RMSNorm (Zhang and Sennrich, 2019) computes:
where is the learned scale. There is no mean subtraction and no learned shift . This involves one reduction operation (root mean square) and one learned parameter vector.
Why Dropping the Mean Works
The key insight is that the re-centering operation in LayerNorm (subtracting ) does not contribute meaningfully to training stability or final model quality in transformers. The normalization's benefit comes primarily from controlling the scale of activations, not their center.
Consider the two operations separately. Scale normalization (dividing by or RMS) prevents activations from growing unboundedly across layers, which would cause gradient explosion and numerical overflow. Mean centering ensures the normalized activations have zero mean. But the learned affine parameters and in LayerNorm can shift the distribution arbitrarily after normalization, making the zero-mean property of the intermediate computation redundant. In RMSNorm, the learned alone is sufficient to control both the scale and the effective center of the output distribution.
Empirically, Zhang and Sennrich (2019) showed that RMSNorm matches LayerNorm in translation quality across multiple WMT benchmarks, and subsequent work at scale (LLaMA, Gemma, Mistral) has confirmed this finding.
Side-by-Side Comparison
| Property | LayerNorm | RMSNorm |
|---|---|---|
| Centering | Yes (subtracts mean ) | No |
| Scaling divisor | Standard deviation | Root mean square |
| Learned parameters | (scale) and (shift) | (scale) only |
| Parameter count | per layer | per layer |
| Reduction operations | 2 (mean, variance) | 1 (sum of squares) |
| Computational cost | Higher (two passes or fused kernel) | Lower (one pass) |
| Wall-clock speedup | Baseline | ~10-15% faster per norm operation |
| Output mean | Zero (before affine) | Not centered |
| Invariance | Invariant to shift and scale of input | Invariant to scale of input only |
| Used in | GPT-2, GPT-3, BERT, original Transformer | LLaMA 1/2/3, Gemma, Mistral, Qwen, PaLM |
| Year introduced | 2016 | 2019 |
Computational Savings
The speedup from RMSNorm is modest per operation (~10-15%) but significant at scale because normalization runs at every layer, both after attention and after the feed-forward block. In a 70B parameter model with 80 layers, the normalization operations are called 160 times per forward pass per token. Over trillions of training tokens, a 10% speedup per norm compounds to meaningful savings in GPU-hours.
The reduction from two statistics (, ) to one () matters because reduction operations require cross-thread synchronization on GPUs. Each reduction involves a global read of the entire activation vector, partial sums across threads, and a synchronization barrier. Eliminating one reduction per norm call reduces memory bandwidth pressure and synchronization overhead.
Removing the parameter also halves the parameter count of the normalization layers. While norm parameters are a tiny fraction of total model parameters, the elimination simplifies the computation graph and kernel fusion.
Pre-Norm vs. Post-Norm
The choice of RMSNorm vs. LayerNorm is orthogonal to the choice of pre-norm vs. post-norm placement. Pre-norm applies normalization before the attention or feed-forward block:
Post-norm applies normalization after the residual addition:
Modern LLMs almost universally use pre-norm placement with RMSNorm. The original Transformer used post-norm with LayerNorm. Pre-norm is more stable during training because the residual path is unmodified, preserving gradient flow.
When LayerNorm is Still Preferred
LayerNorm retains an advantage in encoder-only architectures (BERT-style) where the zero-mean property of the intermediate representation may help with bidirectional attention patterns. Some evidence suggests that mean centering helps when the model must distinguish between "no information" (zero vector) and "information centered elsewhere," which matters more in masked language modeling than in autoregressive generation.
For new autoregressive transformer training, there is no compelling reason to use LayerNorm over RMSNorm.
Common Confusions
RMSNorm is not the same as dividing by the L2 norm
The RMS is , which equals . Dividing by RMS gives , which projects onto a sphere of radius , not radius 1. This distinction matters for the scale of activations entering downstream layers.
RMSNorm and LayerNorm are not interchangeable without retuning
Switching from LayerNorm to RMSNorm in a pretrained model requires retraining or at minimum fine-tuning. The learned and parameters of LayerNorm encode information that a single in RMSNorm represents differently. You cannot simply drop from a trained LayerNorm and expect the same behavior.
The speedup is not from reducing parameter count
The normalization parameters are a negligible fraction of total model parameters. The speedup comes from eliminating one GPU reduction operation (the mean computation) per normalization call, reducing memory bandwidth and synchronization overhead.
References
- Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). "Layer Normalization." arXiv:1607.06450. (Original LayerNorm paper.)
- Zhang, B. and Sennrich, R. (2019). "Root Mean Square Layer Normalization." NeurIPS 2019. (RMSNorm paper, ablations showing mean centering is unnecessary.)
- Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. Section 2.1 (RMSNorm with pre-normalization in LLaMA.)
- Chowdhery, A. et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311. (RMSNorm usage in PaLM architecture.)
- Ioffe, S. and Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." ICML 2015. (Context: batch normalization as the predecessor to LayerNorm.)
- Xiong, R. et al. (2020). "On Layer Normalization in the Transformer Architecture." ICML 2020. (Analysis of pre-norm vs. post-norm placement.)