Normalization Lab
Build intuition for BatchNorm, LayerNorm, and RMSNorm before the residual-stream and transformer internals pages: which axis gets normalized, which statistics survive, and when batch size matters.
Normalization decides which statistics get stabilized and which ones survive into the next layer
BatchNorm, LayerNorm, and RMSNorm are not interchangeable cleanup steps. They choose different axes, different invariances, and different failure modes. This lab shows that choice directly: column coupling across a batch, row-wise centering inside one example, or scale control without centering.
The cyan outline shows exactly which row or column is normalized together.
LayerNorm recenters and rescales. RMSNorm mostly rescales. BatchNorm uses shared batch statistics.
Lower the batch size and compare how much only BatchNorm starts to depend on who else is in the room.
BatchNorm recenters each feature using the current mini-batch, then hands back a trainable scale and shift
The key idea is cross-example coupling. Every example borrows the current batch's feature statistics, which is why BatchNorm can stabilize convolutional training and also become noisy when batches get too small.
All examples in the batch contribute to the same per-feature mean and variance, so one example's normalized output depends on the other examples sampled beside it.
The same matrix before normalization. Warmer cells are larger positive values; cooler cells are more negative.
The highlighted axis is what this family actually normalizes. Watch whether the row mean disappears, the row scale tightens, or the feature column recenters.
All examples in the batch contribute to the same per-feature mean and variance, so one example's normalized output depends on the other examples sampled beside it.
feature columns share statistics
Normalization doing its job
The selected axis is being pulled into a more trainable range, but the mechanism depends on the family: BatchNorm couples examples, LayerNorm centers each row, and RMSNorm only controls row scale.
Jump between the three families on the same scenario. The axis highlight tells you what each one is actually normalizing.
Normalization changes the geometry of optimization by stabilizing internal statistics, but different families choose different axes and different invariances.
BatchNorm can widen the safe learning-rate range and keep intermediate activations in a trainable range, especially in convolutional nets with healthy effective batch size.
CNN backbones, residual vision models, and older deep feedforward stacks where batch-level statistics are stable and cheap to estimate.
Use it when the model sees reasonably large, homogeneous mini-batches and you want cross-example feature stabilization.
With tiny or highly non-iid batches, the batch statistics themselves become noisy and the normalization can inject instability.