Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Training Techniques

Batch Normalization

Normalize activations within mini-batches to stabilize and accelerate training, then scale and shift with learnable parameters.

CoreTier 1Stable~50 min

Why This Matters

Batch normalization is one of the most widely-adopted practical innovations in deep learning. Introduced by Ioffe and Szegedy (2015), it enabled training of much deeper networks by stabilizing training dynamics. Nearly every modern convolutional architecture uses it (or a close variant). Understanding what it does and why it helps. especially since the original explanation turned out to be incomplete. is essential for any practitioner or theorist.

Raw activationsAfter BatchNormh1h1h2h2h3h3h4h4h5h5h6h6h7h7h8h8b1b2b3b4b5b6(x - μ) / σPer-feature:μ = 0, σ = 1Then learn:γ·x̂ + βEach column is normalized independently across the batch dimensionReduces internal covariate shift

Mental Model

Think of each layer in a deep network as receiving inputs from the layer before it. If those input distributions shift wildly during training (because the previous layers' weights keep changing), the current layer is constantly chasing a moving target. Batch norm forces each layer's inputs to have consistent statistics (zero mean, unit variance) before the layer processes them, then lets the network learn the optimal mean and variance through learnable parameters.

The Batch Normalization Transform

Definition

Batch Normalization (Training)

Given a mini-batch B={x1,,xm}B = \{x_1, \ldots, x_m\} of activations at a particular layer, the batch norm transform computes:

μB=1mi=1mxi,σB2=1mi=1m(xiμB)2\mu_B = \frac{1}{m}\sum_{i=1}^{m} x_i, \quad \sigma_B^2 = \frac{1}{m}\sum_{i=1}^{m}(x_i - \mu_B)^2

x^i=xiμBσB2+ϵ\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

yi=γx^i+β=BNγ,β(xi)y_i = \gamma \hat{x}_i + \beta = \text{BN}_{\gamma,\beta}(x_i)

Here γ\gamma (scale) and β\beta (shift) are learnable parameters, and ϵ\epsilon is a small constant for numerical stability. Without γ\gamma and β\beta, the network could not represent the identity transform through the normalization layer.

Definition

Batch Normalization (Inference)

At test time, we do not use the current mini-batch statistics. Instead, BN uses running averages accumulated during training:

y=γxE[x]Var[x]+ϵ+βy = \gamma \cdot \frac{x - \mathbb{E}[x]}{\sqrt{\text{Var}[x] + \epsilon}} + \beta

where E[x]\mathbb{E}[x] and Var[x]\text{Var}[x] are the running mean and variance computed as exponential moving averages over training batches. This makes inference deterministic and independent of batch composition.

The Internal Covariate Shift Story (and Why It Is Disputed)

The original paper motivated BN by the concept of internal covariate shift (ICS): as parameters of earlier layers change during training, the distribution of inputs to later layers shifts, forcing those layers to continuously adapt. BN was supposed to fix this by stabilizing those distributions.

However, Santurkar et al. (2018) showed experimentally that:

  1. BN does not significantly reduce internal covariate shift
  2. Networks with BN can exhibit more ICS than networks without it
  3. BN still dramatically helps training despite this

So the ICS story is at best incomplete. The field has since proposed several competing explanations: loss-landscape smoothing (Santurkar et al. 2018), length-direction decoupling of the weight vector (Kohler et al. 2019), and implicit learning-rate scheduling via batch statistics. No single explanation has become consensus; each captures part of the effect.

Why Batch Normalization Actually Helps (Leading Hypothesis)

Proposition

BN Smooths the Loss Landscape

Statement

Under the loss-landscape-smoothing hypothesis (Santurkar et al. 2018), batch normalization reduces the variation of the loss and its gradients: the Lipschitz constant of the loss decreases and the second-derivative smoothness improves. This is currently the best-supported mechanistic explanation, not a closed question.

Intuition

A smoother loss landscape means that gradient descent steps are more predictive: the gradient at your current point is a better approximation of what the loss will actually do when you take a step. This is why BN allows higher learning rates: the landscape is well-behaved enough that large steps do not overshoot.

Proof Sketch

Santurkar et al. (2018) prove that for a network with BN, the gradient of the loss satisfies LBNL\|\nabla L_{\text{BN}}\| \leq \|\nabla L\| and the Hessian has smaller spectral norm. The key observation is that normalization decouples the length of the weight vector from its direction, preventing the loss from growing unboundedly with weight magnitude. Follow-up work (Yang et al. 2019, Kohler et al. 2019) refined the picture and identified cases where the smoothing bounds are loose.

Why It Matters

This gives a plausible mechanism for the practical benefits of BN: faster convergence, tolerance for higher learning rates, and reduced sensitivity to weight initialization. It is the leading hypothesis, not the final word.

Failure Mode

This analysis is for feedforward networks with BN in standard positions. The smoothing effect can be reduced or absent in architectures where BN is applied in unusual ways, or with very small batch sizes where the batch statistics are noisy.

Practical Benefits

  1. Higher learning rates: smoother landscape tolerates larger steps
  2. Less sensitivity to initialization: normalization prevents activations from exploding or vanishing
  3. Regularization effect: the noise in mini-batch statistics acts as a mild regularizer (similar in spirit to dropout)
  4. Faster convergence: empirically 5-10x faster in many settings

Layer Norm vs Batch Norm

Definition

Layer Normalization

Layer normalization normalizes across the feature dimension rather than the batch dimension. For an activation vector xRdx \in \mathbb{R}^d at a single example:

μ=1dj=1dxj,σ2=1dj=1d(xjμ)2\mu = \frac{1}{d}\sum_{j=1}^{d} x_j, \quad \sigma^2 = \frac{1}{d}\sum_{j=1}^{d}(x_j - \mu)^2

LN(x)=γxμσ2+ϵ+β\text{LN}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

Key differences from BN:

  • Normalizes across features, not across the batch
  • Each example is normalized independently (no batch dependence)
  • Identical behavior at train and test time
  • Used in transformers where batch norm is impractical (variable sequence lengths, autoregressive generation)

RMSNorm

Definition

RMS Normalization

RMSNorm (Zhang and Sennrich, 2019) simplifies layer norm by dropping the mean centering:

RMS(x)=1dj=1dxj2\text{RMS}(x) = \sqrt{\frac{1}{d}\sum_{j=1}^{d} x_j^2}

RMSNorm(x)=γxRMS(x)\text{RMSNorm}(x) = \gamma \odot \frac{x}{\text{RMS}(x)}

RMSNorm is computationally cheaper (no mean computation) and empirically performs comparably to layer norm. It is used in LLaMA, Gemma, and several other modern LLM architectures.

Common Confusions

Watch Out

BN changes the function at test time

A very common mistake: at test time, BN uses running averages from training, not the current batch statistics. This means the function computed at test time is different from training time. It also means that a single example produces the same output regardless of what other examples are in the batch. If you forget to switch to eval mode (e.g., model.eval() in PyTorch), you will use batch statistics at test time, which causes errors. especially with batch size 1.

Watch Out

Gamma and beta do not undo the normalization

A natural question: if BN normalizes to zero mean and unit variance, and then γ\gamma and β\beta can learn to undo this, what is the point? The answer is that BN changes the optimization landscape. Even if the optimal transform could be learned without BN, the path to finding it through gradient descent is smoother with BN. The reparameterization matters for optimization even if it does not change the representational capacity.

Watch Out

Small batch sizes make BN unreliable

With very small batches (e.g., batch size 2-4), the batch statistics μB\mu_B and σB2\sigma_B^2 are extremely noisy estimates. This is why BN works poorly in settings with small batches (e.g., object detection with large images, or reinforcement learning). Group Normalization (Wu and He, 2018) was designed specifically to handle this case.

Summary

  • BN normalizes to zero mean and unit variance within each mini-batch, then applies learnable scale γ\gamma and shift β\beta
  • The original ICS motivation is disputed; the real benefit is a smoother loss landscape
  • At test time, BN uses running averages. The function changes between train and test modes
  • Layer norm normalizes across features (not batch) and is used in transformers
  • RMSNorm drops the mean centering for efficiency and is used in modern LLMs

Exercises

ExerciseCore

Problem

Write out the full forward pass of BN for a single feature channel in a CNN with batch size m=4m = 4 and activations [2,4,6,8][2, 4, 6, 8]. Use γ=2\gamma = 2, β=1\beta = 1, ϵ=0\epsilon = 0. What are the four output values?

ExerciseAdvanced

Problem

Explain why batch normalization makes the loss invariant to the scale of the weight vector in the preceding layer. Specifically, show that if you replace WW with αW\alpha W (for α>0\alpha > 0) in the layer before BN, the output of BN does not change.

Related Comparisons

References

Canonical:

  • Ioffe and Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" (2015)
  • Santurkar et al., "How Does Batch Normalization Help Optimization?" (2018)

Normalization Variants:

  • Ba, Kiros, Hinton, "Layer Normalization" (2016)
  • Zhang and Sennrich, "Root Mean Square Layer Normalization" (2019)
  • Wu and He, "Group Normalization" (2018)

Next Topics

  • Group normalization: handles small batch sizes where BN fails
  • Weight normalization: an alternative reparameterization for smoothing

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.