Skip to main content

Normalization Lab

Build intuition for BatchNorm, LayerNorm, and RMSNorm before the residual-stream and transformer internals pages: which axis gets normalized, which statistics survive, and when batch size matters.

Why normalization matters

Normalization decides which statistics get stabilized and which ones survive into the next layer

BatchNorm, LayerNorm, and RMSNorm are not interchangeable cleanup steps. They choose different axes, different invariances, and different failure modes. This lab shows that choice directly: column coupling across a batch, row-wise centering inside one example, or scale control without centering.

What to watch as you drag
Highlighted axis

The cyan outline shows exactly which row or column is normalized together.

Mean versus scale

LayerNorm recenters and rescales. RMSNorm mostly rescales. BatchNorm uses shared batch statistics.

Batch sensitivity

Lower the batch size and compare how much only BatchNorm starts to depend on who else is in the room.

Normalization board

BatchNorm recenters each feature using the current mini-batch, then hands back a trainable scale and shift

The key idea is cross-example coupling. Every example borrows the current batch's feature statistics, which is why BatchNorm can stabilize convolutional training and also become noisy when batches get too small.

Active formula
Normalize one feature column across the mini-batch

All examples in the batch contribute to the same per-feature mean and variance, so one example's normalized output depends on the other examples sampled beside it.

Raw std
0.270
feature 3 across batch
Normalized mean
-0.000
Normalized std
1.000
Coupling
batch-coupled
depends on neighbors
Raw activations

The same matrix before normalization. Warmer cells are larger positive values; cooler cells are more negative.

Normalized output

The highlighted axis is what this family actually normalizes. Watch whether the row mean disappears, the row scale tightens, or the feature column recenters.

Raw problem
Shift and scale drift make optimization brittle because later layers see wildly changing activation magnitudes.
Normalization action
Feature columns borrow batch statistics, so the whole batch moves together.
Current focus row
Example 3 ends with mean 0.39 and RMS 1.18 after normalization.
feature columns share statisticscyan outline = axis normalized togethersame controls, different invariances
Controls under the board
Choose the family
Jump to teaching scenarios
Current read
Family
BatchNorm
Scenario
Stable batch

All examples in the batch contribute to the same per-feature mean and variance, so one example's normalized output depends on the other examples sampled beside it.

Family snapshot

feature columns share statistics

One example's normalized feature depends on the other examples in the same mini-batch.
Diagnosis

Normalization doing its job

The selected axis is being pulled into a more trainable range, but the mechanism depends on the family: BatchNorm couples examples, LayerNorm centers each row, and RMSNorm only controls row scale.

Try next

Jump between the three families on the same scenario. The axis highlight tells you what each one is actually normalizing.

ML translation

Normalization changes the geometry of optimization by stabilizing internal statistics, but different families choose different axes and different invariances.

Why this family exists

BatchNorm can widen the safe learning-rate range and keep intermediate activations in a trainable range, especially in convolutional nets with healthy effective batch size.

Where you'll see it

CNN backbones, residual vision models, and older deep feedforward stacks where batch-level statistics are stable and cheap to estimate.

Choose it when

Use it when the model sees reasonably large, homogeneous mini-batches and you want cross-example feature stabilization.

Watch out for

With tiny or highly non-iid batches, the batch statistics themselves become noisy and the normalization can inject instability.

Family cheat sheet
BatchNorm
Across-batch feature stats
feature columns share statistics. All examples in the batch contribute to the same per-feature mean and variance, so one example's normalized output depends on the other examples sampled beside it.
LayerNorm
Within-example centering
each row gets its own centered scale. Each example gets its own mean and variance over features. Neighboring examples do not affect the normalized result.
RMSNorm
Scale only, no centering
row scale fixed, mean can remain shifted. RMSNorm uses the root-mean-square of one row. It rescales the vector length but does not explicitly recenter the row around zero.