Normalization Lab

Build intuition for BatchNorm, LayerNorm, and RMSNorm before the residual-stream and transformer internals pages: which axis gets normalized, which statistics survive, and when batch size matters.

Why normalization matters

Normalization decides which statistics get stabilized and which ones survive into the next layer

BatchNorm, LayerNorm, and RMSNorm are not interchangeable cleanup steps. They choose different axes, different invariances, and different failure modes. This lab shows that choice directly: column coupling across a batch, row-wise centering inside one example, or scale control without centering.

What to watch as you drag

Highlighted axis

The cyan outline shows exactly which row or column is normalized together.

Mean versus scale

LayerNorm recenters and rescales. RMSNorm mostly rescales. BatchNorm uses shared batch statistics.

Batch sensitivity

Lower the batch size and compare how much only BatchNorm starts to depend on who else is in the room.

Normalization board

BatchNorm recenters each feature using the current mini-batch, then hands back a trainable scale and shift

The key idea is cross-example coupling. Every example borrows the current batch's feature statistics, which is why BatchNorm can stabilize convolutional training and also become noisy when batches get too small.

Active formula

\overset{x}{^}_{b, j} = \frac{x _{b, j} - μ _{j}}{σ _{j}^{2} + ε}, y_{b, j} = γ_{j} \overset{x}{^}_{b, j} + β_{j}

Normalize one feature column across the mini-batch

All examples in the batch contribute to the same per-feature mean and variance, so one example's normalized output depends on the other examples sampled beside it.

Raw std

0.270

feature 3 across batch

Normalized mean

-0.000

Normalized std

1.000

Coupling

batch-coupled

depends on neighbors

Raw activations

The same matrix before normalization. Warmer cells are larger positive values; cooler cells are more negative.

-1.1

0.4

1.6

-0.1

1.0

1.8

-0.9

0.2

1.2

0.0

0.8

1.3

-0.7

0.4

1.7

-0.5

0.9

1.5

-1.1

0.1

1.1

-0.0

0.6

1.4

-1.0

0.4

1.3

-0.2

1.0

1.9

-1.4

-0.1

0.9

-0.3

0.6

1.4

Normalized output

The highlighted axis is what this family actually normalizes. Watch whether the row mean disappears, the row scale tightens, or the feature column recenters.

-0.3

0.9

1.2

0.5

0.9

1.2

0.7

-0.2

-0.5

1.2

-0.3

-1.0

1.5

1.0

1.4

-1.7

0.5

-0.3

-0.6

-0.7

0.9

-1.4

-0.6

0.2

0.7

-0.1

1.4

1.6

-1.7

-1.8

-1.4

-0.8

-1.1

-0.8

Raw problem

Shift and scale drift make optimization brittle because later layers see wildly changing activation magnitudes.

Normalization action

Feature columns borrow batch statistics, so the whole batch moves together.

Current focus row

Example 3 ends with mean 0.39 and RMS 1.18 after normalization.

feature columns share statisticscyan outline = axis normalized togethersame controls, different invariances

Controls under the board

Activation shift0.00

Activation scale1.00

Batch size6

Choose the family

Jump to teaching scenarios

Current read

Family

BatchNorm

Scenario

Stable batch

All examples in the batch contribute to the same per-feature mean and variance, so one example's normalized output depends on the other examples sampled beside it.

Family snapshot

feature columns share statistics

One example's normalized feature depends on the other examples in the same mini-batch.

Diagnosis

Normalization doing its job

The selected axis is being pulled into a more trainable range, but the mechanism depends on the family: BatchNorm couples examples, LayerNorm centers each row, and RMSNorm only controls row scale.

Try next

Jump between the three families on the same scenario. The axis highlight tells you what each one is actually normalizing.

ML translation

Normalization changes the geometry of optimization by stabilizing internal statistics, but different families choose different axes and different invariances.

Why this family exists

BatchNorm can widen the safe learning-rate range and keep intermediate activations in a trainable range, especially in convolutional nets with healthy effective batch size.

Where you'll see it

CNN backbones, residual vision models, and older deep feedforward stacks where batch-level statistics are stable and cheap to estimate.

Choose it when

Use it when the model sees reasonably large, homogeneous mini-batches and you want cross-example feature stabilization.

Watch out for

With tiny or highly non-iid batches, the batch statistics themselves become noisy and the normalization can inject instability.

Family cheat sheet

BatchNorm

Across-batch feature stats

feature columns share statistics. All examples in the batch contribute to the same per-feature mean and variance, so one example's normalized output depends on the other examples sampled beside it.

LayerNorm

Within-example centering

each row gets its own centered scale. Each example gets its own mean and variance over features. Neighboring examples do not affect the normalized result.

RMSNorm

Scale only, no centering

row scale fixed, mean can remain shifted. RMSNorm uses the root-mean-square of one row. It rescales the vector length but does not explicitly recenter the row around zero.

TheoryBatch NormalizationFormal forward pass, train-test mismatch, and the LayerNorm / RMSNorm comparison WhyGradient FlowWhere normalization changes the trainable range of deep activations NextTransformer ArchitectureWhy modern transformers use row-wise norms and often prefer RMSNorm

TheoremPath · 594 topics · Interactive demos