Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Training Techniques

Batch Size and Learning Dynamics

How batch size affects what SGD finds: gradient noise, implicit regularization, the linear scaling rule, sharp vs flat minima, and the gradient noise scale as the key quantity governing the tradeoff.

CoreTier 2Current~45 min

Why This Matters

Batch size is not just a memory/compute tradeoff. It changes the noise structure of SGD, which changes the loss landscape regions the optimizer explores, which changes the generalization properties of the trained model. Small-batch SGD and large-batch SGD can converge to different solutions with measurably different test performance.

Understanding this requires connecting optimization (convergence rate), statistics (gradient variance), and geometry (curvature of the loss surface).

Mental Model

SGD computes a gradient estimate from a mini-batch of size BB. Small BB means noisy gradients: the optimizer takes jittery steps that help it escape sharp minima. Large BB means accurate gradients: the optimizer takes confident steps but may get stuck in the nearest sharp minimum. The noise is not a bug; it is an implicit regularizer.

The key quantity is the ratio of gradient noise to gradient signal. This ratio determines whether the optimizer behaves more like SGD (noisy, exploratory) or more like full-batch gradient descent (deterministic, exploitative).

Formal Setup

Consider minimizing L(θ)=EzD[(θ;z)]L(\theta) = \mathbb{E}_{z \sim \mathcal{D}}[\ell(\theta; z)] using mini-batch SGD. A mini-batch B\mathcal{B} of size BB gives the gradient estimate:

gB(θ)=1BzBθ(θ;z)g_\mathcal{B}(\theta) = \frac{1}{B}\sum_{z \in \mathcal{B}} \nabla_\theta \ell(\theta; z)

Definition

Gradient Noise Covariance

The covariance of the mini-batch gradient estimate is:

Σ(θ)=Cov[gB(θ)]=1BC(θ)\Sigma(\theta) = \text{Cov}[g_\mathcal{B}(\theta)] = \frac{1}{B} C(\theta)

where C(θ)=Covz[θ(θ;z)]C(\theta) = \text{Cov}_{z}[\nabla_\theta \ell(\theta; z)] is the per-sample gradient covariance. The noise scales as 1/B1/B: doubling the batch size halves the gradient variance.

Definition

Gradient Noise Scale

The gradient noise scale (McCandlish et al., 2018) is:

Bnoise=tr(C(θ))L(θ)2B_{\text{noise}} = \frac{\text{tr}(C(\theta))}{\|\nabla L(\theta)\|^2}

This is the ratio of gradient variance to gradient signal squared. When BBnoiseB \ll B_{\text{noise}}, noise dominates and increasing BB gives near-linear speedup. When BBnoiseB \gg B_{\text{noise}}, signal dominates and increasing BB gives diminishing returns.

Main Theorems

Proposition

Linear Scaling Rule

Statement

If SGD with learning rate η\eta and batch size BB produces a certain training trajectory (in the continuous-time SDE approximation), then SGD with learning rate kηk\eta and batch size kBkB produces approximately the same trajectory, for any scaling factor k>0k > 0.

The effective noise temperature of SGD is:

Teff=ηBtr(C(θ))T_{\text{eff}} = \frac{\eta}{B} \cdot \text{tr}(C(\theta))

Scaling both η\eta and BB by kk preserves TeffT_{\text{eff}}.

Intuition

What matters for the dynamics is the ratio η/B\eta/B, not η\eta or BB individually. This ratio controls the magnitude of the noise injected per step. If you use 4×4\times larger batches, you must use 4×4\times larger learning rate to maintain the same effective noise.

Proof Sketch

In the continuous-time limit, SGD on the loss L(θ)L(\theta) is approximated by the SDE:

dθ=L(θ)dt+ηBC(θ)1/2dWtd\theta = -\nabla L(\theta)\,dt + \sqrt{\frac{\eta}{B}} C(\theta)^{1/2}\,dW_t

The drift term L(θ)-\nabla L(\theta) is independent of η\eta and BB (after rescaling time by η\eta). The diffusion coefficient depends on η/B\eta/B. Scaling ηkη\eta \to k\eta and BkBB \to kB preserves the diffusion coefficient.

Why It Matters

This is the theoretical justification for the practice of scaling the learning rate linearly with batch size, used in large-scale distributed training (Goyal et al., 2017). Without this rule, increasing the batch size would reduce noise and change the implicit regularization of SGD.

Failure Mode

The linear scaling rule breaks when: (1) the learning rate becomes so large that the continuous-time SDE approximation fails (discrete effects dominate), (2) the loss landscape is not smooth enough for the local diffusion approximation, or (3) during the initial transient before SGD reaches a stationary regime. In practice, a warm-up period is needed for large learning rates.

Proposition

Critical Batch Size and Training Efficiency

Statement

Let Bnoise=tr(C(θ))/L(θ)2B_{\text{noise}} = \text{tr}(C(\theta)) / \|\nabla L(\theta)\|^2 be the gradient noise scale. The number of optimization steps SS to reach a target loss ϵ\epsilon scales as:

S(B)Smin(1+BnoiseB)S(B) \approx S_{\min} \cdot \left(1 + \frac{B_{\text{noise}}}{B}\right)

where SminS_{\min} is the minimum steps achievable (at BB \to \infty). The total compute (in units of samples processed) is E(B)=BS(B)E(B) = B \cdot S(B), minimized near BBnoiseB \approx B_{\text{noise}}.

Intuition

For BBnoiseB \ll B_{\text{noise}}: noise dominates. Doubling BB halves the number of steps with nearly the same total compute (linear speedup). For BBnoiseB \gg B_{\text{noise}}: signal dominates. Doubling BB barely reduces steps, so total compute roughly doubles (no speedup). The crossover is at BBnoiseB \approx B_{\text{noise}}.

Proof Sketch

The per-step improvement in loss is approximately ηL2η2tr(Σ)/2\eta \|\nabla L\|^2 - \eta^2 \text{tr}(\Sigma)/2. With optimal ηB/(B+Bnoise)\eta \propto B/(B + B_{\text{noise}}) (balancing progress against noise), the per-step improvement is B/(B+Bnoise)\propto B/(B + B_{\text{noise}}). Inverting gives the step count formula.

Why It Matters

BnoiseB_{\text{noise}} is measurable during training (estimate gradient variance from multiple mini-batches). It tells you the largest useful batch size: going beyond BnoiseB_{\text{noise}} wastes compute. McCandlish et al. (2018) measured BnoiseB_{\text{noise}} for language models and found it increases during training, explaining why larger batches become useful later.

Failure Mode

The analysis assumes the gradient variance C(θ)C(\theta) is approximately constant, which changes as training progresses. The noise scale BnoiseB_{\text{noise}} itself varies over training, so the optimal batch size is not a single number but a trajectory.

Sharp vs Flat Minima

Proposition

Small Batch SGD Favors Flat Minima

Statement

In the SDE approximation of SGD, the stationary distribution concentrates on regions where the local loss is low and the Hessian eigenvalues are small. Specifically, the stationary density is approximately proportional to:

p(θ)exp(2BηL(θ))detH(θ)1/2p(\theta) \propto \exp\left(-\frac{2B}{\eta} L(\theta)\right) \cdot |\det H(\theta)|^{-1/2}

where H(θ)H(\theta) is the Hessian. The second factor favors flat minima (small eigenvalues of HH).

Intuition

SGD noise acts like a temperature that helps the optimizer escape sharp minima (high curvature) but not flat minima (low curvature). The wider a minimum is, the harder it is for noise to push the iterate out. Higher noise temperature (η/B\eta/B large) means only the flattest minima are stable.

Proof Sketch

Model SGD as a Langevin diffusion dθ=Ldt+σdWd\theta = -\nabla L\,dt + \sigma\,dW with σ2=ηtr(C)/B\sigma^2 = \eta \cdot \text{tr}(C)/B. The stationary distribution of Langevin dynamics is the Gibbs measure pexp(2L/σ2)p \propto \exp(-2L/\sigma^2), modified by the determinant factor when the noise covariance depends on θ\theta.

Why It Matters

This provides a theoretical explanation for the empirical observation that small-batch SGD often generalizes better than large-batch SGD: flat minima are associated with better generalization because nearby points have similar loss (the function is stable under perturbation of parameters). This is an observation, not a theorem about generalization itself; the connection between flatness and generalization is debated.

Failure Mode

The sharp/flat minima story has known weaknesses. Dinh et al. (2017) showed you can reparameterize a network to make any minimum arbitrarily sharp without changing the function it computes. The SDE approximation also breaks for large learning rates. Flatness measured by the Hessian trace may not correlate with generalization in all architectures.

Practical Implications

Warm-up. When using the linear scaling rule with large batch sizes, the initial learning rate is large. A warm-up period (linearly increasing η\eta from a small value over the first few epochs) prevents divergence during the initial transient when the loss landscape curvature is high.

LARS and LAMB. For very large batch sizes (B>8192B > 8192), layer-wise adaptive learning rates (LARS for SGD, LAMB for Adam) adjust the step size per layer based on the ratio of parameter norm to gradient norm. This compensates for different layers having different gradient noise scales.

Diminishing returns. Training large language models, BnoiseB_{\text{noise}} is typically 10510^5 to 10710^7 tokens. Beyond this, more parallelism does not reduce wall-clock training time proportionally.

Canonical Examples

Example

ResNet-50 on ImageNet: batch size scaling

Goyal et al. (2017) trained ResNet-50 with batch sizes from 256 to 8192. With the linear scaling rule (ηB\eta \propto B) and warm-up, they achieved equivalent accuracy across all batch sizes. At B=8192B = 8192 (32x baseline), training completed in \sim1 hour on 256 GPUs. Beyond B8192B \approx 8192, accuracy began to degrade, consistent with BnoiseB_{\text{noise}} being roughly in that range for this task.

Example

Gradient noise scale in language modeling

McCandlish et al. (2018) measured BnoiseB_{\text{noise}} for a Transformer language model during training. Early in training, Bnoise103B_{\text{noise}} \approx 10^3 (small batches suffice because gradients are noisy relative to their magnitude). Late in training, Bnoise106B_{\text{noise}} \approx 10^6 (the model is near a minimum, gradients are small, so you need large batches to estimate the direction accurately).

Common Confusions

Watch Out

Larger batch does not always mean faster training

Larger batches reduce the number of steps, but each step processes more data. Total compute (samples processed) is minimized near BBnoiseB \approx B_{\text{noise}}. Beyond that, you are paying more compute per step without proportional reduction in steps. Wall-clock time may still decrease if you have idle GPUs, but compute efficiency drops.

Watch Out

The linear scaling rule is not a law

It is an approximation that holds when the SDE limit is valid, which requires η\eta to be small relative to the inverse curvature of the loss. For very large learning rates or very large batch sizes, discrete-time effects (finite step size) break the continuous-time approximation. In practice, the rule works well up to some critical batch size and then fails.

Watch Out

Flat minima are not guaranteed to generalize better

The flat minima hypothesis is plausible but not proven. Reparameterization can change the curvature without changing the function. PAC-Bayes bounds provide some theoretical support (flat minima correspond to wide posteriors with low KL penalty), but the connection is not definitive. Treat the flat minima story as a useful heuristic, not a theorem.

Key Takeaways

  • Batch size controls the noise-to-signal ratio of SGD gradients
  • The gradient noise scale BnoiseB_{\text{noise}} is the critical batch size where noise and signal are balanced
  • Linear scaling rule: scale learning rate proportionally with batch size to preserve dynamics
  • Small batches inject more noise, which can help escape sharp minima (implicit regularization)
  • Beyond BnoiseB_{\text{noise}}, larger batches give diminishing returns in compute efficiency
  • The connection between flatness and generalization is suggestive but not proven

Exercises

ExerciseCore

Problem

You are training with batch size B=64B = 64 and learning rate η=0.01\eta = 0.01. You want to switch to B=256B = 256. What learning rate should you use according to the linear scaling rule? What quantity is preserved?

ExerciseAdvanced

Problem

The gradient noise scale of your model is Bnoise=2048B_{\text{noise}} = 2048. You have access to 16 GPUs, each handling a local batch of 128. Your global batch size is B=2048B = 2048. A colleague offers you 16 more GPUs. Should you double the global batch size to 4096? Justify using the step count formula.

References

Canonical:

  • Goyal et al., "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour" (2017)
  • McCandlish et al., "An Empirical Model of Large-Batch Training" (2018)

Current:

  • Smith et al., "Don't Decay the Learning Rate, Increase the Batch Size" (ICLR, 2018)
  • Hoffer et al., "Train Longer, Generalize Better" (NeurIPS, 2017)
  • Dinh et al., "Sharp Minima Can Generalize for Deep Nets" (ICML, 2017)

Next Topics

Natural extensions from batch size dynamics:

  • Learning rate schedules: how to adjust learning rate over training, complementary to batch size choices
  • Distributed training theory: communication costs and gradient compression when parallelizing across many workers

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics