Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

RMSNorm vs. LayerNorm

LayerNorm normalizes activations by centering (subtracting the mean) and scaling (dividing by the standard deviation), then applies a learned affine transformation. RMSNorm drops the mean centering step and normalizes by the root mean square only. RMSNorm is 10-15% faster at the same expressivity for transformer training. LLaMA, Gemma, Mistral, and most modern LLMs use RMSNorm.

What Each Does

Both methods normalize activations within a single training example across the feature dimension. Given an activation vector xRdx \in \mathbb{R}^d, they produce a normalized output with learned scale (and optionally shift) parameters.

LayerNorm (Ba et al., 2016) computes:

LayerNorm(x)=γxμσ2+ϵ+β\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

where μ=1di=1dxi\mu = \frac{1}{d}\sum_{i=1}^d x_i is the mean, σ2=1di=1d(xiμ)2\sigma^2 = \frac{1}{d}\sum_{i=1}^d (x_i - \mu)^2 is the variance, γRd\gamma \in \mathbb{R}^d is the learned scale, and βRd\beta \in \mathbb{R}^d is the learned shift. This involves two reduction operations (mean, variance) and two learned parameter vectors.

RMSNorm (Zhang and Sennrich, 2019) computes:

RMSNorm(x)=γxRMS(x)+ϵ,RMS(x)=1di=1dxi2\text{RMSNorm}(x) = \gamma \odot \frac{x}{\text{RMS}(x) + \epsilon}, \quad \text{RMS}(x) = \sqrt{\frac{1}{d}\sum_{i=1}^d x_i^2}

where γRd\gamma \in \mathbb{R}^d is the learned scale. There is no mean subtraction and no learned shift β\beta. This involves one reduction operation (root mean square) and one learned parameter vector.

Why Dropping the Mean Works

The key insight is that the re-centering operation in LayerNorm (subtracting μ\mu) does not contribute meaningfully to training stability or final model quality in transformers. The normalization's benefit comes primarily from controlling the scale of activations, not their center.

Consider the two operations separately. Scale normalization (dividing by σ\sigma or RMS) prevents activations from growing unboundedly across layers, which would cause gradient explosion and numerical overflow. Mean centering ensures the normalized activations have zero mean. But the learned affine parameters γ\gamma and β\beta in LayerNorm can shift the distribution arbitrarily after normalization, making the zero-mean property of the intermediate computation redundant. In RMSNorm, the learned γ\gamma alone is sufficient to control both the scale and the effective center of the output distribution.

Empirically, Zhang and Sennrich (2019) showed that RMSNorm matches LayerNorm in translation quality across multiple WMT benchmarks, and subsequent work at scale (LLaMA, Gemma, Mistral) has confirmed this finding.

Side-by-Side Comparison

PropertyLayerNormRMSNorm
CenteringYes (subtracts mean μ\mu)No
Scaling divisorStandard deviation σ\sigmaRoot mean square
Learned parametersγ\gamma (scale) and β\beta (shift)γ\gamma (scale) only
Parameter count2d2d per layerdd per layer
Reduction operations2 (mean, variance)1 (sum of squares)
Computational costHigher (two passes or fused kernel)Lower (one pass)
Wall-clock speedupBaseline~10-15% faster per norm operation
Output meanZero (before affine)Not centered
InvarianceInvariant to shift and scale of inputInvariant to scale of input only
Used inGPT-2, GPT-3, BERT, original TransformerLLaMA 1/2/3, Gemma, Mistral, Qwen, PaLM
Year introduced20162019

Computational Savings

The speedup from RMSNorm is modest per operation (~10-15%) but significant at scale because normalization runs at every layer, both after attention and after the feed-forward block. In a 70B parameter model with 80 layers, the normalization operations are called 160 times per forward pass per token. Over trillions of training tokens, a 10% speedup per norm compounds to meaningful savings in GPU-hours.

The reduction from two statistics (μ\mu, σ2\sigma^2) to one (RMS\text{RMS}) matters because reduction operations require cross-thread synchronization on GPUs. Each reduction involves a global read of the entire activation vector, partial sums across threads, and a synchronization barrier. Eliminating one reduction per norm call reduces memory bandwidth pressure and synchronization overhead.

Removing the β\beta parameter also halves the parameter count of the normalization layers. While norm parameters are a tiny fraction of total model parameters, the elimination simplifies the computation graph and kernel fusion.

Pre-Norm vs. Post-Norm

The choice of RMSNorm vs. LayerNorm is orthogonal to the choice of pre-norm vs. post-norm placement. Pre-norm applies normalization before the attention or feed-forward block:

xl+1=xl+Block(Norm(xl))x_{l+1} = x_l + \text{Block}(\text{Norm}(x_l))

Post-norm applies normalization after the residual addition:

xl+1=Norm(xl+Block(xl))x_{l+1} = \text{Norm}(x_l + \text{Block}(x_l))

Modern LLMs almost universally use pre-norm placement with RMSNorm. The original Transformer used post-norm with LayerNorm. Pre-norm is more stable during training because the residual path is unmodified, preserving gradient flow.

When LayerNorm is Still Preferred

LayerNorm retains an advantage in encoder-only architectures (BERT-style) where the zero-mean property of the intermediate representation may help with bidirectional attention patterns. Some evidence suggests that mean centering helps when the model must distinguish between "no information" (zero vector) and "information centered elsewhere," which matters more in masked language modeling than in autoregressive generation.

For new autoregressive transformer training, there is no compelling reason to use LayerNorm over RMSNorm.

Common Confusions

Watch Out

RMSNorm is not the same as dividing by the L2 norm

The RMS is 1dxi2\sqrt{\frac{1}{d}\sum x_i^2}, which equals x2/d\|x\|_2 / \sqrt{d}. Dividing by RMS gives xd/x2x \cdot \sqrt{d} / \|x\|_2, which projects onto a sphere of radius d\sqrt{d}, not radius 1. This distinction matters for the scale of activations entering downstream layers.

Watch Out

RMSNorm and LayerNorm are not interchangeable without retuning

Switching from LayerNorm to RMSNorm in a pretrained model requires retraining or at minimum fine-tuning. The learned γ\gamma and β\beta parameters of LayerNorm encode information that a single γ\gamma in RMSNorm represents differently. You cannot simply drop β\beta from a trained LayerNorm and expect the same behavior.

Watch Out

The speedup is not from reducing parameter count

The normalization parameters are a negligible fraction of total model parameters. The speedup comes from eliminating one GPU reduction operation (the mean computation) per normalization call, reducing memory bandwidth and synchronization overhead.

References

  1. Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). "Layer Normalization." arXiv:1607.06450. (Original LayerNorm paper.)
  2. Zhang, B. and Sennrich, R. (2019). "Root Mean Square Layer Normalization." NeurIPS 2019. (RMSNorm paper, ablations showing mean centering is unnecessary.)
  3. Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. Section 2.1 (RMSNorm with pre-normalization in LLaMA.)
  4. Chowdhery, A. et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311. (RMSNorm usage in PaLM architecture.)
  5. Ioffe, S. and Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." ICML 2015. (Context: batch normalization as the predecessor to LayerNorm.)
  6. Xiong, R. et al. (2020). "On Layer Normalization in the Transformer Architecture." ICML 2020. (Analysis of pre-norm vs. post-norm placement.)