Skip to main content

Optimization

SGD as a Stochastic Differential Equation

The continuous-time SDE limit of mini-batch SGD: dW = -∇L dt + sqrt(η Σ) dB. Order-1 weak approximation (Li–Tai–E), Mandt–Hoffman–Blei stationary distribution, Bayesian interpretation, the linear scaling rule for batch size, and the modified-equation correction that exposes SGD's implicit gradient-norm regularizer.

AdvancedTier 2Current~50 min
0

Why This Matters

Stochastic gradient descent is a discrete iteration. Its trajectory in weight space is a sequence of jumps, one per mini-batch, of size proportional to the learning rate η\eta. In the small-η\eta regime those jumps look — on the right time scale — like the increments of a stochastic differential equation whose drift is the population gradient L(W)-\nabla L(W) and whose diffusion matrix is set by the mini-batch gradient covariance Σ(W)\Sigma(W). This is the SDE limit of SGD, formalized by Mandt, Hoffman, and Blei (2017) and Li, Tai, and E (2017, full version in JMLR 2019).

The reason to take this limit seriously is that it gives a clean analytical handle on questions the discrete recursion answers awkwardly. Stationary distribution: under constant step size, what does SGD asymptotically sample from? Batch-size and learning-rate scaling: what combinations of (η,B)(\eta, |B|) leave the trajectory geometry invariant? Bayesian interpretation: in what regime does SGD behave like an approximate posterior sampler, and when does it not? Implicit bias: what continuous-time vector field does SGD secretly follow that gradient flow does not? All four questions become tractable once SGD is reframed as an SDE with a Fokker–Planck dual.

The framing has limits. The approximation is order-1 in η\eta on a fixed time interval; at the step sizes used in deep-learning training (η0.1\eta \sim 0.1 for ResNet, larger still for transformers), the approximation can break down and higher-order corrections matter (Yaida 2019). The Gaussian noise assumption fails when gradient noise is heavy-tailed (Şimşekli et al. 2019), in which case the correct continuous limit is a Lévy SDE, not a Brownian one. Despite these caveats, the SDE view is the standard analytical tool for SGD theory and the bridge to Langevin dynamics, score-based sampling, and flat-minima generalization arguments.

Mental Model

Write the SGD update as

Wk+1=WkηLBk(Wk),W_{k+1} = W_k - \eta\, \nabla L_{B_k}(W_k),

where LBk(W)=1BkxBk(W;x)\nabla L_{B_k}(W) = \frac{1}{|B_k|} \sum_{x \in B_k} \nabla \ell(W; x) is a mini-batch gradient on a uniformly sampled batch BkB_k of size B|B| from the training set. Decompose the mini-batch gradient into its mean and a zero-mean fluctuation:

LBk(W)=L(W)+ξk(W),E[ξk]=0,Cov(ξk)=1BΣ(W),\nabla L_{B_k}(W) = \nabla L(W) + \xi_k(W), \qquad \mathbb{E}[\xi_k] = 0, \quad \mathrm{Cov}(\xi_k) = \tfrac{1}{|B|}\, \Sigma(W),

with per-example covariance Σ(W)=CovxD[(W;x)]\Sigma(W) = \mathrm{Cov}_{x \sim D}[\nabla \ell(W; x)]. The recursion becomes Wk+1Wk=ηL(Wk)ηξk(Wk)W_{k+1} - W_k = -\eta\, \nabla L(W_k) - \eta\, \xi_k(W_k). The deterministic part scales as η\eta; the noise part has variance scaling as η2Σ/B\eta^2 \Sigma / |B|, so its standard deviation scales as η/B\eta / \sqrt{|B|}.

Rescale time so that one SGD step corresponds to η\eta units of continuous time, t=kηt = k\eta. Over a continuous interval [t,t+η][t, t+\eta] the deterministic increment is ηL(W)-\eta\, \nabla L(W) and the stochastic increment has variance (η2/B)Σ(\eta^2 / |B|)\, \Sigma, which we rewrite as η(η/B)Σ\eta\, \cdot (\eta / |B|)\, \Sigma. Matching this to a Brownian increment ηΣ(W)/BΔB\sqrt{\eta\, \Sigma(W)/|B|}\,\Delta B with E[ΔB(ΔB)]=ηI\mathbb{E}[\Delta B (\Delta B)^\top] = \eta\, I gives the SDE

dWt=L(Wt)dt+ηΣ(Wt)/BdBt.dW_t = -\nabla L(W_t)\,dt + \sqrt{\eta\, \Sigma(W_t)/|B|}\,dB_t.

The factor η\sqrt{\eta} in the diffusion is the signature of the SDE limit: in the small-step regime, the per-step noise ηξ\eta \xi has standard deviation O(η)O(\eta), but the cumulative noise over a unit continuous-time interval (which contains 1/η1/\eta steps) has standard deviation 1/ηη=η\sqrt{1/\eta} \cdot \eta = \sqrt{\eta}. This is the same N\sqrt{N} scaling that produces Brownian motion as a continuum limit of random walks.

Formal Statement

Definition

SGD-SDE Approximation

Let L:RdRL: \mathbb{R}^d \to \mathbb{R} be a population loss with mini-batch gradients LB(W)\nabla L_B(W) satisfying E[LB(W)]=L(W)\mathbb{E}[\nabla L_B(W)] = \nabla L(W) and Cov(LB(W))=Σ(W)/B\mathrm{Cov}(\nabla L_B(W)) = \Sigma(W)/|B| for batches of size B|B|. The SGD-SDE approximation is the Itô SDE

dWt=L(Wt)dt+ηΣ(Wt)/B  dBt,dW_t = -\nabla L(W_t)\,dt + \sqrt{\eta\, \Sigma(W_t)/|B|}\;dB_t,

where the square root is a matrix square root of Σ\Sigma and BtB_t is a standard dd-dimensional Brownian motion. The time rescaling is t=kηt = k\eta: one SGD step corresponds to η\eta units of continuous time. The drift is the population gradient; the diffusion is set by the per-example gradient covariance, the batch size, and the learning rate.

The matching of one SGD step to η\eta continuous-time units is what makes the diffusion coefficient depend on η\eta. A naive reading that "the noise vanishes as η0\eta \to 0" is wrong on the natural time scale. On a fixed number of steps the noise vanishes; on a fixed continuous time horizon it stays O(1)O(1).

Li–Tai–E Approximation Theorem

Theorem

Li–Tai–E Order-1 Weak Approximation

Statement

Under the assumptions above, on any fixed time horizon [0,T][0, T], the SGD iterates WkW_k at step k=T/ηk = \lfloor T/\eta \rfloor and the solution WTW_T of the SGD-SDE driven by the same initial condition satisfy

E[g(Wk)]E[g(WT)]C(T,g)η\bigl|\mathbb{E}[g(W_k)] - \mathbb{E}[g(W_T)]\bigr| \le C(T, g)\,\eta

for every C4C^4 test function gg with polynomial growth. The constant C(T,g)C(T, g) depends on derivatives of gg and on Lipschitz / boundedness constants of LL and Σ\Sigma, but not on η\eta. This is weak approximation of order 1.

Intuition

The SGD update and one Itô–Taylor step of the SDE agree in their first two moments to order η\eta. Higher moments differ at order η3/2\eta^{3/2} and above. A test function gg probes only the law of the iterate, not its pathwise realization, so first-two-moment matching plus a Grönwall argument over T/ηT/\eta steps gives the order-1 bound on the cumulative error in expectation.

Proof Sketch

Taylor-expand g(Wk+1)=g(WkηLBk(Wk))g(W_{k+1}) = g(W_k - \eta\, \nabla L_{B_k}(W_k)) to third order in η\eta. Take conditional expectation over the batch: the linear term gives η(g)L-\eta\, (\nabla g)^\top \nabla L; the quadratic term contributes both a Hessian-of-LL piece and a noise-covariance piece 12η2Tr(Σ2g)\tfrac{1}{2} \eta^2\, \mathrm{Tr}(\Sigma\, \nabla^2 g), where the η2/B\eta^2 / |B| from Cov(LB)=Σ/B\mathrm{Cov}(\nabla L_B) = \Sigma/|B| matches the ηΣ/B\eta\, \Sigma/|B| diffusion of the SDE under the time rescaling t=kηt = k\eta. Compare against the Itô–Taylor expansion of E[g(WT)]\mathbb{E}[g(W_T)] on the increment [t,t+η][t, t + \eta]; the leading terms agree, and the O(η2)O(\eta^2) residual sums over T/ηT/\eta steps to O(η)O(\eta). Grönwall gives the bound. This is Theorem 1 of Li, Tai, and E (2017); the JMLR version (2019) gives the order-2 weak approximation when one tracks an extra correction term in the drift.

Why It Matters

Order-1 weak approximation is the formal statement that distributional quantities computed from the SDE — stationary covariances, escape rates from saddles, mixing times, expected losses — are correct to leading order in η\eta. It justifies replacing the discrete recursion with the continuous-time object for any analysis that ultimately cares about expectations. It does not justify pathwise statements: a single realization of SGD and a single realization of the SDE coupled to the same Brownian motion can be order-η\sqrt{\eta} apart in path norm, the same strong-vs-weak gap that appears in Euler–Maruyama analysis. For any T>0T > 0 and any C4C^4 test function gg with polynomial growth, the SGD iterate WkW_k after k=T/ηk = \lfloor T/\eta \rfloor steps and the SDE solution WTW_T at t=Tt = T satisfy E[g(Wk)]E[g(WT)]C(T,g)η|\mathbb{E}[g(W_k)] - \mathbb{E}[g(W_T)]| \le C(T,g)\, \eta: weak approximation order 1 in η\eta.

Failure Mode

The order-1 bound has a hidden multiplicative constant in the Lipschitz and boundedness assumptions. Deep networks at the start of training have loss landscapes where third and fourth derivatives are very large (sharp near-zero curvature, exponential activations), and the constant C(T,g)C(T, g) can be large enough that the approximation is poor at η=0.1\eta = 0.1 even though it is asymptotically of order η\eta. Yaida (ICLR 2019) develops fluctuation–dissipation relations that are exact to higher order and useful for diagnosing when the SDE limit fits empirical SGD trajectories. A second failure mode: when per-example gradient norms are heavy-tailed (infinite variance), the central-limit step in the SDE derivation breaks and the right continuous limit is an α\alpha-stable Lévy SDE, not a Brownian one (Şimşekli, Sagun, and Gürbüzbalaban, ICML 2019).

Stationary Distribution and Bayesian Interpretation

Under constant η\eta and assuming the SDE has a unique invariant measure, the SGD iterate has an approximate stationary distribution. The Fokker–Planck equation associated with the SGD-SDE is

tp=(Lp)+η2B(Σp),\partial_t p = \nabla \cdot (\nabla L\, p) + \tfrac{\eta}{2|B|}\, \nabla \cdot \nabla \cdot (\Sigma\, p),

where the second term is shorthand for ijij(Σijp)\sum_{ij} \partial_i \partial_j (\Sigma_{ij}\, p). The stationary density solves (Lp)+(η/2B)(Σp)=0\nabla \cdot (\nabla L\, p_\infty) + (\eta / 2|B|)\, \nabla \cdot \nabla \cdot (\Sigma\, p_\infty) = 0. In general this PDE has no closed-form solution because Σ\Sigma depends on WW in a non-Gibbs way.

Two simplifying regimes admit closed forms. First, when the noise is isotropic and homogeneous, Σ(W)=σ2I\Sigma(W) = \sigma^2 I, the SDE coincides with overdamped Langevin dynamics on LL at inverse temperature β=2B/(ησ2)\beta = 2|B| / (\eta \sigma^2). The stationary distribution is the Gibbs measure

p(W)exp ⁣(2Bησ2L(W)),Teff=ησ22B.p_\infty(W) \propto \exp\!\bigl(-\tfrac{2|B|}{\eta\sigma^2}\, L(W)\bigr), \qquad T_{\mathrm{eff}} = \tfrac{\eta\sigma^2}{2|B|}.

The effective temperature TeffT_{\mathrm{eff}} scales linearly in η\eta and inversely in B|B|. SGD at higher learning rate (or smaller batch) samples from a hotter Gibbs measure: more spread-out, more weight on flatter regions, less concentration at the global minimum.

Second, when the loss is quadratic and the noise is constant, the SDE is an Ornstein–Uhlenbeck process and the stationary distribution is explicitly Gaussian (worked out below).

The Bayesian reading, due to Mandt, Hoffman, and Blei (2017): if L(W)=logp(DW)+logp(W)L(W) = -\log p(\mathcal{D} \mid W) + \log p(W) is a Bayesian negative log posterior, then SGD with isotropic noise and tuned (η,B)(\eta, |B|) asymptotically samples from a tempered posterior p(WD)1/Teffp(W \mid \mathcal{D})^{1/T_{\mathrm{eff}}}. At Teff=1T_{\mathrm{eff}} = 1 this is the true posterior; at Teff1T_{\mathrm{eff}} \ll 1 it is a sharpened, more MAP-like distribution. Mandt, Hoffman, and Blei use this to repurpose constant-step SGD as an approximate posterior sampler, calibrating (η,B)(\eta, |B|) so that TeffT_{\mathrm{eff}} matches the desired tempering. The full method requires a preconditioner that aligns the noise covariance with the inverse Fisher; without it, Σ\Sigma is anisotropic and the stationary distribution is no longer Gibbs.

Linear Scaling Rule

Read the SDE again: dWt=L(Wt)dt+ηΣ/BdBtdW_t = -\nabla L(W_t)\,dt + \sqrt{\eta\, \Sigma / |B|}\, dB_t. The drift is independent of η\eta and B|B|. The diffusion depends on the combination η/B\eta / |B|. So holding η/B\eta / |B| constant leaves the SDE, and therefore the trajectory geometry on any fixed continuous-time interval, invariant.

This is the linear scaling rule for batch size: when you increase the batch size by a factor kk, increase the learning rate by the same factor to preserve the SDE. Goyal et al. (arXiv 1706.02677, 2017) demonstrated empirically that the rule holds for ImageNet ResNet-50 training up to batch size 8192\sim 8192, allowing one-hour training without loss of generalization. Smith and Le (ICLR 2018) gave the SDE-based derivation. Hoffer, Hubara, and Soudry (NeurIPS 2017) noted a refinement: to match generalization, train longer at large batch (or more precisely, match the number of SDE-time-units, not the number of epochs), because the SDE-time per epoch is ηN/B\eta \cdot N / |B| where NN is the dataset size.

The rule breaks at very large batch (B104|B| \gtrsim 10^4 for typical vision models, larger for transformers). Three mechanisms contribute: the SDE approximation needs η\eta small, but linear scaling drives η\eta up; the noise covariance Σ\Sigma shrinks as 1/B1/|B| until other sources of randomness (data ordering, augmentation) dominate; and at small noise the system is close to deterministic gradient descent and loses the implicit-regularization benefit of the noise. McCandlish et al. (2018) introduced the "critical batch size" framework that quantifies where this transition occurs in terms of the gradient covariance.

Worked Example

Quadratic loss L(W)=12WHWL(W) = \tfrac{1}{2} W^\top H W with H0H \succ 0, and isotropic constant noise Σ=σ2I\Sigma = \sigma^2 I. The SGD-SDE is the Ornstein–Uhlenbeck process

dWt=HWtdt+ησ2B  dBt.dW_t = -H W_t\,dt + \sqrt{\tfrac{\eta\sigma^2}{|B|}}\;dB_t.

Solve in closed form: Wt=eHtW0+ησ2/B0teH(ts)dBsW_t = e^{-Ht} W_0 + \sqrt{\eta\sigma^2/|B|} \int_0^t e^{-H(t-s)}\,dB_s. Mean eHtW00e^{-Ht} W_0 \to 0. Stationary covariance: solve the Lyapunov equation HΣ+ΣH=(ησ2/B)IH \Sigma_\infty + \Sigma_\infty H = (\eta\sigma^2/|B|) I, giving Σ=TeffH1\Sigma_\infty = T_{\mathrm{eff}}\, H^{-1} with Teff=ησ2/(2B)T_{\mathrm{eff}} = \eta\sigma^2/(2|B|). The stationary distribution is

WN ⁣(0,  TeffH1).W_\infty \sim \mathcal{N}\!\bigl(0,\; T_{\mathrm{eff}}\, H^{-1}\bigr).

In the eigenbasis of HH with eigenvalues λ1λd>0\lambda_1 \ge \dots \ge \lambda_d > 0, the stationary variance along the ii-th eigendirection is Teff/λiT_{\mathrm{eff}} / \lambda_i. SGD spreads most in the directions where LL is flattest (small λi\lambda_i) and concentrates in the sharpest directions (large λi\lambda_i). This is the precise content of the heuristic that "SGD spends most time in flat regions of the loss."

The flat-minima generalization story (Hochreiter and Schmidhuber 1997; Keskar et al. ICLR 2017) reads this as a mechanism: if generalization correlates with flatness (measured, for example, by the trace of HH near the solution), then the SGD-SDE biases toward flatter minima and hence toward better-generalizing minima. The argument is suggestive, not a theorem in non-convex deep nets; flatness measures are reparametrization-dependent (Dinh, Pascanu, Bengio, and Bengio, ICML 2017), and the SDE limit captures only one mechanism among several candidates for SGD's implicit bias.

Common Confusions

Watch Out

The Gaussian-noise assumption can fail

The SDE derivation invokes a central-limit-type approximation: the mini-batch gradient noise, summed over many small steps, looks Gaussian. This is fine when per-example gradient norms have finite variance and the network is not too deep. Şimşekli, Sagun, and Gürbüzbalaban (ICML 2019) measured per-example gradient norms in deep networks and reported heavy-tailed distributions whose tail index α\alpha falls below 2, violating the variance hypothesis. Under their measurements the right continuous limit is an α\alpha-stable Lévy SDE, whose escape times from local minima scale polynomially rather than exponentially in the barrier height. The Brownian-SDE picture and the Lévy-SDE picture make qualitatively different predictions about saddle escape; which one applies in any given training run is an empirical question.

Watch Out

’SGD prefers flat minima’ is a heuristic, not a theorem

The OU calculation shows that on a quadratic, the SGD-SDE stationary covariance is largest along flat eigendirections. Generalizing this to non-convex deep nets requires several steps that are not airtight: the loss is not quadratic; the noise covariance is anisotropic and depends on WW; flatness measures are not reparametrization-invariant; and the SDE limit only approximates the discrete dynamics. Implicit bias of SGD is an active area with multiple competing explanations: the SDE stationary distribution, the modified-equation drift correction, label noise, edge-of-stability dynamics, and weight-decay interactions are all candidates and none subsumes the others. Treat the SDE flat-minima argument as one mechanism, not the explanation.

Watch Out

The SDE treats noise as Markov; real SGD has correlation structure

The SDE assumes the noise increments at distinct continuous times are independent, by construction of Brownian motion. Real SGD draws batches without replacement within each epoch; the per-step noises are negatively correlated across an epoch and the cumulative noise over an epoch is exactly zero (every example contributes once). This violates the Markov assumption at the epoch scale. Smith, Elsen, and De (ICML 2020) study the epoch-level effect; for typical training horizons the deviation from the Brownian SDE is small but not zero, and the mismatch matters when reasoning about variance of long-time averages. With- replacement sampling restores the Markov assumption at the cost of slower per-epoch progress.

Exercises

ExerciseCore

Problem

Take L(W)=12WHWL(W) = \tfrac{1}{2} W^\top H W with H=diag(λ1,,λd)H = \mathrm{diag}(\lambda_1, \dots, \lambda_d), λi>0\lambda_i > 0, and isotropic noise Σ=σ2I\Sigma = \sigma^2 I. Write down the SGD-SDE, solve the Lyapunov equation for the stationary covariance, and verify that the stationary variance along the ii-th eigendirection is Teff/λiT_{\mathrm{eff}}/\lambda_i with Teff=ησ2/(2B)T_{\mathrm{eff}} = \eta\sigma^2/(2|B|).

ExerciseAdvanced

Problem

Derive the modified equation for SGD: show that to first order in η\eta, the SGD iterate WkW_k tracks not the gradient flow W˙=L\dot W = -\nabla L but a corrected SDE with drift L~(W)=L(W)η4L(W)2-\nabla \tilde L(W) = -\nabla L(W) - \tfrac{\eta}{4} \nabla \lVert \nabla L(W) \rVert^2, plus the same diffusion as before. Conclude that SGD has an implicit gradient-norm regularizer of strength η/4\eta/4.

References

No canonical references provided.

Next Topics

  • Langevin Dynamics: the SDE that the SGD-SDE reduces to under isotropic noise; the source of the Gibbs stationary distribution and the Bayesian reading.
  • Fokker–Planck Equation: the PDE governing the time-evolving density of the SGD-SDE, where stationary distributions and mixing rates live.
  • Stochastic Gradient Descent Convergence: the discrete-iteration convergence theory the SDE limit complements; non-asymptotic rates without the small-step approximation.
  • Implicit Bias and Modern Generalization: the broader question of which minima SGD selects, of which the SDE flat-minima story is one mechanism among several.
  • Stochastic Differential Equations: the parent SDE framework, including Euler–Maruyama (which the SGD update mirrors) and weak-vs-strong approximation theory.

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics