Training Techniques
Batch Normalization
Normalize activations within mini-batches to stabilize and accelerate training, then scale and shift with learnable parameters.
Why This Matters
Batch normalization is one of the most widely-adopted practical innovations in deep learning. Introduced by Ioffe and Szegedy (2015), it enabled training of much deeper networks by stabilizing training dynamics. Nearly every modern convolutional architecture uses it (or a close variant). Understanding what it does and why it helps. especially since the original explanation turned out to be incomplete. is essential for any practitioner or theorist.
Mental Model
Think of each layer in a deep network as receiving inputs from the layer before it. If those input distributions shift wildly during training (because the previous layers' weights keep changing), the current layer is constantly chasing a moving target. Batch norm forces each layer's inputs to have consistent statistics (zero mean, unit variance) before the layer processes them, then lets the network learn the optimal mean and variance through learnable parameters.
The Batch Normalization Transform
Batch Normalization (Training)
Given a mini-batch of activations at a particular layer, the batch norm transform computes:
Here (scale) and (shift) are learnable parameters, and is a small constant for numerical stability. Without and , the network could not represent the identity transform through the normalization layer.
Batch Normalization (Inference)
At test time, we do not use the current mini-batch statistics. Instead, BN uses running averages accumulated during training:
where and are the running mean and variance computed as exponential moving averages over training batches. This makes inference deterministic and independent of batch composition.
The Internal Covariate Shift Story (and Why It Is Disputed)
The original paper motivated BN by the concept of internal covariate shift (ICS): as parameters of earlier layers change during training, the distribution of inputs to later layers shifts, forcing those layers to continuously adapt. BN was supposed to fix this by stabilizing those distributions.
However, Santurkar et al. (2018) showed experimentally that:
- BN does not significantly reduce internal covariate shift
- Networks with BN can exhibit more ICS than networks without it
- BN still dramatically helps training despite this
So the ICS story is at best incomplete. The field has since proposed several competing explanations: loss-landscape smoothing (Santurkar et al. 2018), length-direction decoupling of the weight vector (Kohler et al. 2019), and implicit learning-rate scheduling via batch statistics. No single explanation has become consensus; each captures part of the effect.
Why Batch Normalization Actually Helps (Leading Hypothesis)
BN Smooths the Loss Landscape
Statement
Under the loss-landscape-smoothing hypothesis (Santurkar et al. 2018), batch normalization reduces the variation of the loss and its gradients: the Lipschitz constant of the loss decreases and the second-derivative smoothness improves. This is currently the best-supported mechanistic explanation, not a closed question.
Intuition
A smoother loss landscape means that gradient descent steps are more predictive: the gradient at your current point is a better approximation of what the loss will actually do when you take a step. This is why BN allows higher learning rates: the landscape is well-behaved enough that large steps do not overshoot.
Proof Sketch
Santurkar et al. (2018) prove that for a network with BN, the gradient of the loss satisfies and the Hessian has smaller spectral norm. The key observation is that normalization decouples the length of the weight vector from its direction, preventing the loss from growing unboundedly with weight magnitude. Follow-up work (Yang et al. 2019, Kohler et al. 2019) refined the picture and identified cases where the smoothing bounds are loose.
Why It Matters
This gives a plausible mechanism for the practical benefits of BN: faster convergence, tolerance for higher learning rates, and reduced sensitivity to weight initialization. It is the leading hypothesis, not the final word.
Failure Mode
This analysis is for feedforward networks with BN in standard positions. The smoothing effect can be reduced or absent in architectures where BN is applied in unusual ways, or with very small batch sizes where the batch statistics are noisy.
Practical Benefits
- Higher learning rates: smoother landscape tolerates larger steps
- Less sensitivity to initialization: normalization prevents activations from exploding or vanishing
- Regularization effect: the noise in mini-batch statistics acts as a mild regularizer (similar in spirit to dropout)
- Faster convergence: empirically 5-10x faster in many settings
Layer Norm vs Batch Norm
Layer Normalization
Layer normalization normalizes across the feature dimension rather than the batch dimension. For an activation vector at a single example:
Key differences from BN:
- Normalizes across features, not across the batch
- Each example is normalized independently (no batch dependence)
- Identical behavior at train and test time
- Used in transformers where batch norm is impractical (variable sequence lengths, autoregressive generation)
RMSNorm
RMS Normalization
RMSNorm (Zhang and Sennrich, 2019) simplifies layer norm by dropping the mean centering:
RMSNorm is computationally cheaper (no mean computation) and empirically performs comparably to layer norm. It is used in LLaMA, Gemma, and several other modern LLM architectures.
Common Confusions
BN changes the function at test time
A very common mistake: at test time, BN uses running averages from
training, not the current batch statistics. This means the function computed at
test time is different from training time. It also means that a single example
produces the same output regardless of what other examples are in the batch.
If you forget to switch to eval mode (e.g., model.eval() in PyTorch), you
will use batch statistics at test time, which causes errors. especially with
batch size 1.
Gamma and beta do not undo the normalization
A natural question: if BN normalizes to zero mean and unit variance, and then and can learn to undo this, what is the point? The answer is that BN changes the optimization landscape. Even if the optimal transform could be learned without BN, the path to finding it through gradient descent is smoother with BN. The reparameterization matters for optimization even if it does not change the representational capacity.
Small batch sizes make BN unreliable
With very small batches (e.g., batch size 2-4), the batch statistics and are extremely noisy estimates. This is why BN works poorly in settings with small batches (e.g., object detection with large images, or reinforcement learning). Group Normalization (Wu and He, 2018) was designed specifically to handle this case.
Summary
- BN normalizes to zero mean and unit variance within each mini-batch, then applies learnable scale and shift
- The original ICS motivation is disputed; the real benefit is a smoother loss landscape
- At test time, BN uses running averages. The function changes between train and test modes
- Layer norm normalizes across features (not batch) and is used in transformers
- RMSNorm drops the mean centering for efficiency and is used in modern LLMs
Exercises
Problem
Write out the full forward pass of BN for a single feature channel in a CNN with batch size and activations . Use , , . What are the four output values?
Problem
Explain why batch normalization makes the loss invariant to the scale of the weight vector in the preceding layer. Specifically, show that if you replace with (for ) in the layer before BN, the output of BN does not change.
Related Comparisons
References
Canonical:
- Ioffe and Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" (2015)
- Santurkar et al., "How Does Batch Normalization Help Optimization?" (2018)
Normalization Variants:
- Ba, Kiros, Hinton, "Layer Normalization" (2016)
- Zhang and Sennrich, "Root Mean Square Layer Normalization" (2019)
- Wu and He, "Group Normalization" (2018)
Next Topics
- Group normalization: handles small batch sizes where BN fails
- Weight normalization: an alternative reparameterization for smoothing
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Common Probability DistributionsLayer 0A