RMSNorm vs LayerNorm: Why Modern LLMs Dropped Mean Centering

What Each Does

Both methods normalize activations within a single training example across the feature dimension. Given an activation vector $x \in \mathbb{R}^d$ , they produce a normalized output with learned scale (and optionally shift) parameters.

LayerNorm (Ba et al., 2016) computes:

$\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$

where $\mu = \frac{1}{d}\sum_{i=1}^d x_i$ is the mean, $\sigma^2 = \frac{1}{d}\sum_{i=1}^d (x_i - \mu)^2$ is the variance, $\gamma \in \mathbb{R}^d$ is the learned scale, and $\beta \in \mathbb{R}^d$ is the learned shift. This involves two reduction operations (mean, variance) and two learned parameter vectors.

RMSNorm (Zhang and Sennrich, 2019) computes:

$\text{RMSNorm}(x) = \gamma \odot \frac{x}{\text{RMS}(x) + \epsilon}, \quad \text{RMS}(x) = \sqrt{\frac{1}{d}\sum_{i=1}^d x_i^2}$

where $\gamma \in \mathbb{R}^d$ is the learned scale. There is no mean subtraction and no learned shift $\beta$ . This involves one reduction operation (root mean square) and one learned parameter vector.

Why Dropping the Mean Works

The key insight is that the re-centering operation in LayerNorm (subtracting $\mu$ ) does not contribute meaningfully to training stability or final model quality in transformers. The normalization's benefit comes primarily from controlling the scale of activations, not their center.

Consider the two operations separately. Scale normalization (dividing by $\sigma$ or RMS) prevents activations from growing unboundedly across layers, which would cause gradient explosion and numerical overflow. Mean centering ensures the normalized activations have zero mean. But the learned affine parameters $\gamma$ and $\beta$ in LayerNorm can shift the distribution arbitrarily after normalization, making the zero-mean property of the intermediate computation redundant. In RMSNorm, the learned $\gamma$ alone is sufficient to control both the scale and the effective center of the output distribution.

Empirically, Zhang and Sennrich (2019) showed that RMSNorm matches LayerNorm in translation quality across multiple WMT benchmarks, and subsequent work at scale (LLaMA, Gemma, Mistral) has confirmed this finding.

Side-by-Side Comparison

Property	LayerNorm	RMSNorm
Centering	Yes (subtracts mean $\mu$ )	No
Scaling divisor	Standard deviation $\sigma$	Root mean square
Learned parameters	$\gamma$ (scale) and $\beta$ (shift)	$\gamma$ (scale) only
Parameter count	$2d$ per layer	$d$ per layer
Reduction operations	2 (mean, variance)	1 (sum of squares)
Computational cost	Higher (two passes or fused kernel)	Lower (one pass)
Wall-clock speedup	Baseline	~10-15% faster per norm operation
Output mean	Zero (before affine)	Not centered
Invariance	Invariant to shift and scale of input	Invariant to scale of input only
Used in	GPT-2, GPT-3, BERT, original Transformer	LLaMA 1/2/3, Gemma, Mistral, Qwen, PaLM
Year introduced	2016	2019

Computational Savings

The speedup from RMSNorm is modest per operation (~10-15%) but significant at scale because normalization runs at every layer, both after attention and after the feed-forward block. In a 70B parameter model with 80 layers, the normalization operations are called 160 times per forward pass per token. Over trillions of training tokens, a 10% speedup per norm compounds to meaningful savings in GPU-hours.

The reduction from two statistics ( $\mu$ , $\sigma^2$ ) to one ( $\text{RMS}$ ) matters because reduction operations require cross-thread synchronization on GPUs. Each reduction involves a global read of the entire activation vector, partial sums across threads, and a synchronization barrier. Eliminating one reduction per norm call reduces memory bandwidth pressure and synchronization overhead.

Removing the $\beta$ parameter also halves the parameter count of the normalization layers. While norm parameters are a tiny fraction of total model parameters, the elimination simplifies the computation graph and kernel fusion.

Pre-Norm vs. Post-Norm

The choice of RMSNorm vs. LayerNorm is orthogonal to the choice of pre-norm vs. post-norm placement. Pre-norm applies normalization before the attention or feed-forward block:

$x_{l+1} = x_l + \text{Block}(\text{Norm}(x_l))$

Post-norm applies normalization after the residual addition:

$x_{l+1} = \text{Norm}(x_l + \text{Block}(x_l))$

Modern LLMs almost universally use pre-norm placement with RMSNorm. The original Transformer used post-norm with LayerNorm. Pre-norm is more stable during training because the residual path is unmodified, preserving gradient flow.

When LayerNorm is Still Preferred

LayerNorm retains an advantage in encoder-only architectures (BERT-style) where the zero-mean property of the intermediate representation may help with bidirectional attention patterns. Some evidence suggests that mean centering helps when the model must distinguish between "no information" (zero vector) and "information centered elsewhere," which matters more in masked language modeling than in autoregressive generation.

For new autoregressive transformer training, there is no compelling reason to use LayerNorm over RMSNorm.

Common Confusions

Watch Out

RMSNorm is not the same as dividing by the L2 norm

The RMS is $\sqrt{\frac{1}{d}\sum x_i^2}$ , which equals $\|x\|_2 / \sqrt{d}$ . Dividing by RMS gives $x \cdot \sqrt{d} / \|x\|_2$ , which projects onto a sphere of radius $\sqrt{d}$ , not radius 1. This distinction matters for the scale of activations entering downstream layers.

Watch Out

RMSNorm and LayerNorm are not interchangeable without retuning

Switching from LayerNorm to RMSNorm in a pretrained model requires retraining or at minimum fine-tuning. The learned $\gamma$ and $\beta$ parameters of LayerNorm encode information that a single $\gamma$ in RMSNorm represents differently. You cannot simply drop $\beta$ from a trained LayerNorm and expect the same behavior.

Watch Out

The speedup is not from reducing parameter count

The normalization parameters are a negligible fraction of total model parameters. The speedup comes from eliminating one GPU reduction operation (the mean computation) per normalization call, reducing memory bandwidth and synchronization overhead.

References

Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). "Layer Normalization." arXiv:1607.06450. (Original LayerNorm paper.)
Zhang, B. and Sennrich, R. (2019). "Root Mean Square Layer Normalization." NeurIPS 2019. (RMSNorm paper, ablations showing mean centering is unnecessary.)
Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. Section 2.1 (RMSNorm with pre-normalization in LLaMA.)
Chowdhery, A. et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311. (RMSNorm usage in PaLM architecture.)
Ioffe, S. and Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." ICML 2015. (Context: batch normalization as the predecessor to LayerNorm.)
Xiong, R. et al. (2020). "On Layer Normalization in the Transformer Architecture." ICML 2020. (Analysis of pre-norm vs. post-norm placement.)