Batch Normalization

Sneiderman, Robby

Training Techniques

Batch Normalization

Normalize activations within mini-batches to stabilize and accelerate training, then scale and shift with learnable parameters.

CoreTier 1StableSupporting~50 min

Prerequisites

Feedforward Networks and Backpropagation Expectation Variance Covariance Moments Activation Functions Gradient Flow and Vanishing Gradients

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

training-techniques | layer 2 | tier 1. This page has 8 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Batch normalization is one of the most widely-adopted practical innovations in deep learning. Introduced by Ioffe and Szegedy (2015), it enabled training of much deeper networks by stabilizing training dynamics. Nearly every modern convolutional architecture uses it (or a close variant). Understanding what it does and why it helps — especially since the original explanation turned out to be incomplete — is essential for any practitioner or theorist. See the batch normalization paper breakdown for the full forward-and-backward derivation and the post-hoc loss-smoothness explanation that replaced the original "internal covariate shift" framing.

Mental Model

Think of each layer in a deep network as receiving inputs from the layer before it. If those input distributions shift wildly during training (because the previous layers' weights keep changing), the current layer is constantly chasing a moving target. Batch norm forces each layer's inputs to have consistent statistics (zero mean, unit variance) before the layer processes them, then lets the network learn the optimal mean and variance through learnable parameters.

The Batch Normalization Transform

Definition

Batch Normalization (Training) $B N_{γ, β} (x)$

Given a mini-batch $B = \{x_1, \ldots, x_m\}$ of activations at a particular layer, the batch norm transform computes:

$\mu_B = \frac{1}{m}\sum_{i=1}^{m} x_i, \quad \sigma_B^2 = \frac{1}{m}\sum_{i=1}^{m}(x_i - \mu_B)^2$

$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$

$y_i = \gamma \hat{x}_i + \beta = \text{BN}_{\gamma,\beta}(x_i)$

Here $\gamma$ (scale) and $\beta$ (shift) are learnable parameters, and $\epsilon$ is a small constant for numerical stability. Without $\gamma$ and $\beta$ , the network could not represent the identity transform through the normalization layer.

Definition

Batch Normalization (Inference)

At test time, we do not use the current mini-batch statistics. Instead, BN uses running averages accumulated during training:

$y = \gamma \cdot \frac{x - \mathbb{E}[x]}{\sqrt{\text{Var}[x] + \epsilon}} + \beta$

where $\mathbb{E}[x]$ and $\text{Var}[x]$ are the running mean and variance computed as exponential moving averages over training batches. This makes inference deterministic and independent of batch composition.

The Internal Covariate Shift Story (and Why It Is Disputed)

The original paper motivated BN by the concept of internal covariate shift (ICS): as parameters of earlier layers change during training, the distribution of inputs to later layers shifts, forcing those layers to continuously adapt. BN was supposed to fix this by stabilizing those distributions.

However, Santurkar et al. (2018) showed experimentally that:

BN does not significantly reduce internal covariate shift
Networks with BN can exhibit more ICS than networks without it
BN still dramatically helps training despite this

So the ICS story is at best incomplete. The field has since proposed several competing explanations—loss-landscape smoothing (Santurkar et al. 2018), length-direction decoupling of the weight vector (Kohler et al. 2019), and implicit learning-rate scheduling via batch statistics. No single explanation has become consensus; each captures part of the effect.

Why Batch Normalization Actually Helps (Leading Hypothesis)

Proposition

BN Smooths the Loss Landscape

Statement

Under the loss-landscape-smoothing hypothesis (Santurkar et al. 2018), batch normalization reduces the variation of the loss and its gradients: the Lipschitz constant of the loss decreases and the second-derivative smoothness improves. This is currently the best-supported mechanistic explanation, not a closed question.

Intuition

A smoother loss landscape means that gradient descent steps are more predictive: the gradient at your current point is a better approximation of what the loss will actually do when you take a step. This is why BN allows higher learning rates: the landscape is well-behaved enough that large steps do not overshoot.

Proof Sketch

Santurkar et al. (2018) prove, under specific assumptions on the architecture and where BN is inserted, that for a network with BN the gradient and the Hessian of the loss satisfy improved Lipschitz / spectral bounds along the BN-normalized direction: roughly $\|\nabla L_{\text{BN}}\|^2 \leq \kappa(\sigma) \, \|\nabla L\|^2$ with a factor $\kappa(\sigma)$ that can be smaller than 1 in the regime of interest, and analogous bounds on the Hessian's effective spectral norm. These are not universal pointwise inequalities; they hold in the analyzed linearized setting and degrade in regimes (e.g., small batch sizes, BN in unusual positions) where the empirical landscape becomes noisier. The key qualitative observation is that normalization decouples the length of the weight vector from its direction, preventing the loss from growing unboundedly with weight magnitude. Follow-up work (Yang et al. 2019, Kohler et al. 2019) refined the picture and identified cases where the smoothing bounds are loose.

Why It Matters

This gives a plausible mechanism for the practical benefits of BN: faster convergence, tolerance for higher learning rates, and reduced sensitivity to weight initialization. It is the leading hypothesis, not the final word.

Failure Mode

This analysis is for feedforward networks with BN in standard positions. The smoothing effect can be reduced or absent in architectures where BN is applied in unusual ways, or with very small batch sizes where the batch statistics are noisy.

report a correction →

Practical Benefits

Higher learning rates: smoother landscape tolerates larger steps
Less sensitivity to initialization: normalization prevents activations from exploding or vanishing
Regularization effect: the noise in mini-batch statistics acts as a mild regularizer (similar in spirit to dropout)
Faster convergence: empirically 5-10x faster in many settings

Layer Norm vs Batch Norm

Definition

Layer Normalization $L N (x)$

Layer normalization normalizes across the feature dimension rather than the batch dimension. For an activation vector $x \in \mathbb{R}^d$ at a single example:

$\mu = \frac{1}{d}\sum_{j=1}^{d} x_j, \quad \sigma^2 = \frac{1}{d}\sum_{j=1}^{d}(x_j - \mu)^2$

$\text{LN}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$

Key differences from BN:

Normalizes across features, not across the batch
Each example is normalized independently (no batch dependence)
Identical behavior at train and test time
Used in transformers where batch norm is impractical (variable sequence lengths, autoregressive generation)

RMSNorm

Definition

RMS Normalization $R M S N or m (x)$

RMSNorm (Zhang and Sennrich, 2019) simplifies layer norm by dropping the mean centering:

$\text{RMS}(x) = \sqrt{\frac{1}{d}\sum_{j=1}^{d} x_j^2}$

$\text{RMSNorm}(x) = \gamma \odot \frac{x}{\text{RMS}(x)}$

RMSNorm is computationally cheaper (no mean computation) and empirically performs comparably to layer norm. It is used in LLaMA, Gemma, and several other modern LLM architectures.

Common Confusions

Watch Out

BN changes the function at test time

A very common mistake: at test time, BN uses running averages from training, not the current batch statistics. This means the function computed at test time is different from training time. It also means that a single example produces the same output regardless of what other examples are in the batch. If you forget to switch to eval mode (e.g., model.eval() in PyTorch), you will use batch statistics at test time, which causes errors. especially with batch size 1.

Watch Out

Gamma and beta do not undo the normalization

A natural question: if BN normalizes to zero mean and unit variance, and then $\gamma$ and $\beta$ can learn to undo this, what is the point? The answer is that BN changes the optimization landscape. Even if the optimal transform could be learned without BN, the path to finding it through gradient descent is smoother with BN. The reparameterization matters for optimization even if it does not change the representational capacity.

Watch Out

Small batch sizes make BN unreliable

With very small batches (e.g., batch size 2-4), the batch statistics $\mu_B$ and $\sigma_B^2$ are extremely noisy estimates. This is why BN works poorly in settings with small batches (e.g., object detection with large images, or reinforcement learning). Group Normalization (Wu and He, 2018) was designed specifically to handle this case.

Summary

BN normalizes to zero mean and unit variance within each mini-batch, then applies learnable scale $\gamma$ and shift $\beta$
The original ICS motivation is disputed; the real benefit is a smoother loss landscape
At test time, BN uses running averages. The function changes between train and test modes
Layer norm normalizes across features (not batch) and is used in transformers
RMSNorm drops the mean centering for efficiency and is used in modern LLMs

Exercises

ExerciseCore

Problem

Write out the full forward pass of BN for a single feature channel in a CNN with batch size $m = 4$ and activations $[2, 4, 6, 8]$ . Use $\gamma = 2$ , $\beta = 1$ , $\epsilon = 0$ . What are the four output values?

ExerciseAdvanced

Problem

Explain why batch normalization makes the loss invariant to the scale of the weight vector in the preceding layer. Specifically, show that if you replace $W$ with $\alpha W$ (for $\alpha > 0$ ) in the layer before BN, the output of BN does not change.

Related Comparisons

References

Canonical:

Ioffe and Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" (2015)
Santurkar et al., "How Does Batch Normalization Help Optimization?" (2018)

Normalization Variants:

Ba, Kiros, Hinton, "Layer Normalization" (2016)
Zhang and Sennrich, "Root Mean Square Layer Normalization" (2019)
Wu and He, "Group Normalization" (2018)
Salimans and Kingma, "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks" (2016)

Next Topics

Group normalization: handles small batch sizes where BN fails
Weight normalization: an alternative reparameterization for smoothing

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

8

Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Activation Functionslayer 1 · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1
Gradient Flow and Vanishing Gradientslayer 2 · tier 1
Regularization in Practicelayer 2 · tier 1

Derived topics

0

No published topic currently declares this as a prerequisite.