SGD as a Stochastic Differential Equation

Sneiderman, Robby

Optimization

SGD as a Stochastic Differential Equation

The continuous-time SDE limit of mini-batch SGD. Order-1 weak approximation (Li-Tai-E), Mandt-Hoffman-Blei stationary distribution, Bayesian interpretation, the linear scaling rule for batch size, and the modified-equation correction that exposes SGD's implicit gradient-norm regularizer.

AdvancedTier 2CurrentCore spine~50 min

Prerequisites

Stochastic Differential Equations Stochastic Gradient Descent Convergence Fokker Planck Equation Stochastic Calculus for ML

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

optimization | layer 3 | tier 2. This page has 4 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Langevin Dynamics

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Stochastic gradient descent is a discrete iteration. Its trajectory in weight space is a sequence of jumps, one per mini-batch, of size proportional to the learning rate $\eta$ . In the small- $\eta$ regime those jumps look — on the right time scale — like the increments of a stochastic differential equation whose drift is the population gradient $-\nabla L(W)$ and whose diffusion matrix is set by the mini-batch gradient covariance $\Sigma(W)$ . This is the SDE limit of SGD, formalized by Mandt, Hoffman, and Blei (2017) and Li, Tai, and E (2017, full version in JMLR 2019).

The reason to take this limit seriously is that it gives a clean analytical handle on questions the discrete recursion answers awkwardly. Stationary distribution: under constant step size, what does SGD asymptotically sample from? Batch-size and learning-rate scaling: what combinations of $(\eta, |B|)$ leave the trajectory geometry invariant? Bayesian interpretation: in what regime does SGD behave like an approximate posterior sampler, and when does it not? Implicit bias: what continuous-time vector field does SGD secretly follow that gradient flow does not? All four questions become tractable once SGD is reframed as an SDE with a Fokker–Planck dual.

The framing has limits. The approximation is order-1 in $\eta$ on a fixed time interval; at the step sizes used in deep-learning training ( $\eta \sim 0.1$ for ResNet, larger still for transformers), the approximation can break down and higher-order corrections matter (Yaida 2019). The Gaussian noise assumption fails when gradient noise is heavy-tailed (Şimşekli et al. 2019), in which case the correct continuous limit is a Lévy SDE, not a Brownian one. Despite these caveats, the SDE view is the standard analytical tool for SGD theory and the bridge to Langevin dynamics, score-based sampling, and flat-minima generalization arguments.

Mental Model

Write the SGD update as

W_{k+1} = W_k - \eta\, \nabla L_{B_k}(W_k),

where $\nabla L_{B_k}(W) = \frac{1}{|B_k|} \sum_{x \in B_k} \nabla \ell(W; x)$ is a mini-batch gradient on a uniformly sampled batch $B_k$ of size $|B|$ from the training set. Decompose the mini-batch gradient into its mean and a zero-mean fluctuation:

\nabla L_{B_k}(W) = \nabla L(W) + \xi_k(W), \qquad \mathbb{E}[\xi_k] = 0, \quad \mathrm{Cov}(\xi_k) = \tfrac{1}{|B|}\, \Sigma(W),

with per-example covariance $\Sigma(W) = \mathrm{Cov}_{x \sim D}[\nabla \ell(W; x)]$ . The recursion becomes $W_{k+1} - W_k = -\eta\, \nabla L(W_k) - \eta\, \xi_k(W_k)$ . The deterministic part scales as $\eta$ ; the noise part has variance scaling as $\eta^2 \Sigma / |B|$ , so its standard deviation scales as $\eta / \sqrt{|B|}$ .

Rescale time so that one SGD step corresponds to $\eta$ units of continuous time, $t = k\eta$ . Over a continuous interval $[t, t+\eta]$ the deterministic increment is $-\eta\, \nabla L(W)$ and the stochastic increment has variance $(\eta^2 / |B|)\, \Sigma$ , which we rewrite as $\eta\, \cdot (\eta / |B|)\, \Sigma$ . Matching this to a Brownian increment $\sqrt{\eta\, \Sigma(W)/|B|}\,\Delta B$ with $\mathbb{E}[\Delta B (\Delta B)^\top] = \eta\, I$ gives the SDE

dW_t = -\nabla L(W_t)\,dt + \sqrt{\eta\, \Sigma(W_t)/|B|}\,dB_t.

The factor $\sqrt{\eta}$ in the diffusion is the signature of the SDE limit: in the small-step regime, the per-step noise $\eta \xi$ has standard deviation $O(\eta)$ , but the cumulative noise over a unit continuous-time interval (which contains $1/\eta$ steps) has standard deviation $\sqrt{1/\eta} \cdot \eta = \sqrt{\eta}$ . This is the same $\sqrt{N}$ scaling that produces Brownian motion as a continuum limit of random walks.

Formal Statement

Definition

SGD-SDE Approximation $d W = - \nabla L d t + s q r t (η Σ (W) /∣ B ∣) d B$

Let $L: \mathbb{R}^d \to \mathbb{R}$ be a population loss with mini-batch gradients $\nabla L_B(W)$ satisfying $\mathbb{E}[\nabla L_B(W)] = \nabla L(W)$ and $\mathrm{Cov}(\nabla L_B(W)) = \Sigma(W)/|B|$ for batches of size $|B|$ . The SGD-SDE approximation is the Itô SDE

dW_t = -\nabla L(W_t)\,dt + \sqrt{\eta\, \Sigma(W_t)/|B|}\;dB_t,

where the square root is a matrix square root of $\Sigma$ and $B_t$ is a standard $d$ -dimensional Brownian motion. The time rescaling is $t = k\eta$ : one SGD step corresponds to $\eta$ units of continuous time. The drift is the population gradient; the diffusion is set by the per-example gradient covariance, the batch size, and the learning rate.

The matching of one SGD step to $\eta$ continuous-time units is what makes the diffusion coefficient depend on $\eta$ . A naive reading that "the noise vanishes as $\eta \to 0$ " is wrong on the natural time scale. On a fixed number of steps the noise vanishes; on a fixed continuous time horizon it stays $O(1)$ .

Li–Tai–E Approximation Theorem

Theorem

Li–Tai–E Order-1 Weak Approximation

Statement

Under the assumptions above, on any fixed time horizon $[0, T]$ , the SGD iterates $W_k$ at step $k = \lfloor T/\eta \rfloor$ and the solution $W_T$ of the SGD-SDE driven by the same initial condition satisfy

\bigl|\mathbb{E}[g(W_k)] - \mathbb{E}[g(W_T)]\bigr| \le C(T, g)\,\eta

for every $C^4$ test function $g$ with polynomial growth. The constant $C(T, g)$ depends on derivatives of $g$ and on Lipschitz / boundedness constants of $L$ and $\Sigma$ , but not on $\eta$ . This is weak approximation of order 1.

Intuition

The SGD update and one Itô–Taylor step of the SDE agree in their first two moments to order $\eta$ . Higher moments differ at order $\eta^{3/2}$ and above. A test function $g$ probes only the law of the iterate, not its pathwise realization, so first-two-moment matching plus a Grönwall argument over $T/\eta$ steps gives the order-1 bound on the cumulative error in expectation.

Proof Sketch

Taylor-expand $g(W_{k+1}) = g(W_k - \eta\, \nabla L_{B_k}(W_k))$ to third order in $\eta$ . Take conditional expectation over the batch: the linear term gives $-\eta\, (\nabla g)^\top \nabla L$ ; the quadratic term contributes both a Hessian-of- $L$ piece and a noise-covariance piece $\tfrac{1}{2} \eta^2\, \mathrm{Tr}(\Sigma\, \nabla^2 g)$ , where the $\eta^2 / |B|$ from $\mathrm{Cov}(\nabla L_B) = \Sigma/|B|$ matches the $\eta\, \Sigma/|B|$ diffusion of the SDE under the time rescaling $t = k\eta$ . Compare against the Itô–Taylor expansion of $\mathbb{E}[g(W_T)]$ on the increment $[t, t + \eta]$ ; the leading terms agree, and the $O(\eta^2)$ residual sums over $T/\eta$ steps to $O(\eta)$ . Grönwall gives the bound. This is Theorem 1 of Li, Tai, and E (2017); the JMLR version (2019) gives the order-2 weak approximation when one tracks an extra correction term in the drift.

Why It Matters

Order-1 weak approximation is the formal statement that distributional quantities computed from the SDE — stationary covariances, escape rates from saddles, mixing times, expected losses — are correct to leading order in $\eta$ . It justifies replacing the discrete recursion with the continuous-time object for any analysis that ultimately cares about expectations. It does not justify pathwise statements: a single realization of SGD and a single realization of the SDE coupled to the same Brownian motion can be order- $\sqrt{\eta}$ apart in path norm, the same strong-vs-weak gap that appears in Euler–Maruyama analysis. For any $T > 0$ and any $C^4$ test function $g$ with polynomial growth, the SGD iterate $W_k$ after $k = \lfloor T/\eta \rfloor$ steps and the SDE solution $W_T$ at $t = T$ satisfy $|\mathbb{E}[g(W_k)] - \mathbb{E}[g(W_T)]| \le C(T,g)\, \eta$ : weak approximation order 1 in $\eta$ .

Failure Mode

The order-1 bound has a hidden multiplicative constant in the Lipschitz and boundedness assumptions. Deep networks at the start of training have loss landscapes where third and fourth derivatives are very large (sharp near-zero curvature, exponential activations), and the constant $C(T, g)$ can be large enough that the approximation is poor at $\eta = 0.1$ even though it is asymptotically of order $\eta$ . Yaida (ICLR 2019) develops fluctuation–dissipation relations that are exact to higher order and useful for diagnosing when the SDE limit fits empirical SGD trajectories. A second failure mode: when per-example gradient norms are heavy-tailed (infinite variance), the central-limit step in the SDE derivation breaks and the right continuous limit is an $\alpha$ -stable Lévy SDE, not a Brownian one (Şimşekli, Sagun, and Gürbüzbalaban, ICML 2019).

report a correction →

Stationary Distribution and Bayesian Interpretation

Under constant $\eta$ and assuming the SDE has a unique invariant measure, the SGD iterate has an approximate stationary distribution. The Fokker–Planck equation associated with the SGD-SDE is

\partial_t p = \nabla \cdot (\nabla L\, p) + \tfrac{\eta}{2|B|}\, \nabla \cdot \nabla \cdot (\Sigma\, p),

where the second term is shorthand for $\sum_{ij} \partial_i \partial_j (\Sigma_{ij}\, p)$ . The stationary density solves $\nabla \cdot (\nabla L\, p_\infty) + (\eta / 2|B|)\, \nabla \cdot \nabla \cdot (\Sigma\, p_\infty) = 0$ . In general this PDE has no closed-form solution because $\Sigma$ depends on $W$ in a non-Gibbs way.

Two simplifying regimes admit closed forms. First, when the noise is isotropic and homogeneous, $\Sigma(W) = \sigma^2 I$ , the SDE coincides with overdamped Langevin dynamics on $L$ at inverse temperature $\beta = 2|B| / (\eta \sigma^2)$ . The stationary distribution is the Gibbs measure

p_\infty(W) \propto \exp\!\bigl(-\tfrac{2|B|}{\eta\sigma^2}\, L(W)\bigr), \qquad T_{\mathrm{eff}} = \tfrac{\eta\sigma^2}{2|B|}.

The effective temperature $T_{\mathrm{eff}}$ scales linearly in $\eta$ and inversely in $|B|$ . SGD at higher learning rate (or smaller batch) samples from a hotter Gibbs measure: more spread-out, more weight on flatter regions, less concentration at the global minimum.

Second, when the loss is quadratic and the noise is constant, the SDE is an Ornstein–Uhlenbeck process and the stationary distribution is explicitly Gaussian (worked out below).

The Bayesian reading, due to Mandt, Hoffman, and Blei (2017): if $L(W) = -\log p(\mathcal{D} \mid W) + \log p(W)$ is a Bayesian negative log posterior, then SGD with isotropic noise and tuned $(\eta, |B|)$ asymptotically samples from a tempered posterior $p(W \mid \mathcal{D})^{1/T_{\mathrm{eff}}}$ . At $T_{\mathrm{eff}} = 1$ this is the true posterior; at $T_{\mathrm{eff}} \ll 1$ it is a sharpened, more MAP-like distribution. Mandt, Hoffman, and Blei use this to repurpose constant-step SGD as an approximate posterior sampler, calibrating $(\eta, |B|)$ so that $T_{\mathrm{eff}}$ matches the desired tempering. The full method requires a preconditioner that aligns the noise covariance with the inverse Fisher; without it, $\Sigma$ is anisotropic and the stationary distribution is no longer Gibbs.

Linear Scaling Rule

Read the SDE again: $dW_t = -\nabla L(W_t)\,dt + \sqrt{\eta\, \Sigma / |B|}\, dB_t$ . The drift is independent of $\eta$ and $|B|$ . The diffusion depends on the combination $\eta / |B|$ . So holding $\eta / |B|$ constant leaves the SDE, and therefore the trajectory geometry on any fixed continuous-time interval, invariant.

This is the linear scaling rule for batch size: when you increase the batch size by a factor $k$ , increase the learning rate by the same factor to preserve the SDE. Goyal et al. (arXiv 1706.02677, 2017) demonstrated empirically that the rule holds for ImageNet ResNet-50 training up to batch size $\sim 8192$ , allowing one-hour training without loss of generalization. Smith and Le (ICLR 2018) gave the SDE-based derivation. Hoffer, Hubara, and Soudry (NeurIPS 2017) noted a refinement: to match generalization, train longer at large batch (or more precisely, match the number of SDE-time-units, not the number of epochs), because the SDE-time per epoch is $\eta \cdot N / |B|$ where $N$ is the dataset size.

The rule breaks at very large batch ( $|B| \gtrsim 10^4$ for typical vision models, larger for transformers). Three mechanisms contribute: the SDE approximation needs $\eta$ small, but linear scaling drives $\eta$ up; the noise covariance $\Sigma$ shrinks as $1/|B|$ until other sources of randomness (data ordering, augmentation) dominate; and at small noise the system is close to deterministic gradient descent and loses the implicit-regularization benefit of the noise. McCandlish et al. (2018) introduced the "critical batch size" framework that quantifies where this transition occurs in terms of the gradient covariance.

Worked Example

Quadratic loss $L(W) = \tfrac{1}{2} W^\top H W$ with $H \succ 0$ , and isotropic constant noise $\Sigma = \sigma^2 I$ . The SGD-SDE is the Ornstein–Uhlenbeck process

dW_t = -H W_t\,dt + \sqrt{\tfrac{\eta\sigma^2}{|B|}}\;dB_t.

Solve in closed form: $W_t = e^{-Ht} W_0 + \sqrt{\eta\sigma^2/|B|} \int_0^t e^{-H(t-s)}\,dB_s$ . Mean $e^{-Ht} W_0 \to 0$ . Stationary covariance: solve the Lyapunov equation $H \Sigma_\infty + \Sigma_\infty H = (\eta\sigma^2/|B|) I$ , giving $\Sigma_\infty = T_{\mathrm{eff}}\, H^{-1}$ with $T_{\mathrm{eff}} = \eta\sigma^2/(2|B|)$ . The stationary distribution is

W_\infty \sim \mathcal{N}\!\bigl(0,\; T_{\mathrm{eff}}\, H^{-1}\bigr).

In the eigenbasis of $H$ with eigenvalues $\lambda_1 \ge \dots \ge \lambda_d > 0$ , the stationary variance along the $i$ -th eigendirection is $T_{\mathrm{eff}} / \lambda_i$ . SGD spreads most in the directions where $L$ is flattest (small $\lambda_i$ ) and concentrates in the sharpest directions (large $\lambda_i$ ). This is the precise content of the heuristic that "SGD spends most time in flat regions of the loss."

The flat-minima generalization story (Hochreiter and Schmidhuber 1997; Keskar et al. ICLR 2017) reads this as a mechanism: if generalization correlates with flatness (measured, for example, by the trace of $H$ near the solution), then the SGD-SDE biases toward flatter minima and hence toward better-generalizing minima. The argument is suggestive, not a theorem in non-convex deep nets; flatness measures are reparametrization-dependent (Dinh, Pascanu, Bengio, and Bengio, ICML 2017), and the SDE limit captures only one mechanism among several candidates for SGD's implicit bias.

Common Confusions

Watch Out

The Gaussian-noise assumption can fail

The SDE derivation invokes a central-limit-type approximation: the mini-batch gradient noise, summed over many small steps, looks Gaussian. This is fine when per-example gradient norms have finite variance and the network is not too deep. Şimşekli, Sagun, and Gürbüzbalaban (ICML 2019) measured per-example gradient norms in deep networks and reported heavy-tailed distributions whose tail index $\alpha$ falls below 2, violating the variance hypothesis. Under their measurements the right continuous limit is an $\alpha$ -stable Lévy SDE, whose escape times from local minima scale polynomially rather than exponentially in the barrier height. The Brownian-SDE picture and the Lévy-SDE picture make qualitatively different predictions about saddle escape; which one applies in any given training run is an empirical question.

Watch Out

’SGD prefers flat minima’ is a heuristic, not a theorem

The OU calculation shows that on a quadratic, the SGD-SDE stationary covariance is largest along flat eigendirections. Generalizing this to non-convex deep nets requires several steps that are not airtight: the loss is not quadratic; the noise covariance is anisotropic and depends on $W$ ; flatness measures are not reparametrization-invariant; and the SDE limit only approximates the discrete dynamics. Implicit bias of SGD is an active area with multiple competing explanations: the SDE stationary distribution, the modified-equation drift correction, label noise, edge-of-stability dynamics, and weight-decay interactions are all candidates and none subsumes the others. Treat the SDE flat-minima argument as one mechanism, not the explanation.

Watch Out

The SDE treats noise as Markov; real SGD has correlation structure

The SDE assumes the noise increments at distinct continuous times are independent, by construction of Brownian motion. Real SGD draws batches without replacement within each epoch; the per-step noises are negatively correlated across an epoch and the cumulative noise over an epoch is exactly zero (every example contributes once). This violates the Markov assumption at the epoch scale. Smith, Elsen, and De (ICML 2020) study the epoch-level effect; for typical training horizons the deviation from the Brownian SDE is small but not zero, and the mismatch matters when reasoning about variance of long-time averages. With- replacement sampling restores the Markov assumption at the cost of slower per-epoch progress.

Exercises

ExerciseCore

Problem

Take $L(W) = \tfrac{1}{2} W^\top H W$ with $H = \mathrm{diag}(\lambda_1, \dots, \lambda_d)$ , $\lambda_i > 0$ , and isotropic noise $\Sigma = \sigma^2 I$ . Write down the SGD-SDE, solve the Lyapunov equation for the stationary covariance, and verify that the stationary variance along the $i$ -th eigendirection is $T_{\mathrm{eff}}/\lambda_i$ with $T_{\mathrm{eff}} = \eta\sigma^2/(2|B|)$ .

ExerciseAdvanced

Problem

Derive the modified equation for SGD: show that to first order in $\eta$ , the SGD iterate $W_k$ tracks not the gradient flow $\dot W = -\nabla L$ but a corrected SDE with drift $-\nabla \tilde L(W) = -\nabla L(W) - \tfrac{\eta}{4} \nabla \lVert \nabla L(W) \rVert^2$ , plus the same diffusion as before. Conclude that SGD has an implicit gradient-norm regularizer of strength $\eta/4$ .

References

Canonical:

Mandt, Hoffman, and Blei, Stochastic gradient descent as approximate Bayesian inference (Journal of Machine Learning Research 18, 2017). The foundational paper: derives the SDE limit, stationary distribution, and effective temperature; proposes constant-step SGD as a posterior sampler with preconditioning.
Li, Tai, and E, Stochastic modified equations and adaptive stochastic gradient algorithms (ICML 2017; full version JMLR 2019). Rigorous order-1 and order-2 weak approximation theorems for SGD by an SDE, including the modified-equation drift corrections.
Smith and Le, A Bayesian perspective on generalization and stochastic gradient descent (ICLR 2018). The cleanest derivation of the linear scaling rule from the SDE viewpoint, plus an information-theoretic interpretation of generalization in terms of effective temperature.
Hoffer, Hubara, and Soudry, Train longer, generalize better: closing the generalization gap in large batch training (NeurIPS 2017). The "match SDE-time, not epoch count" refinement of the linear scaling rule, and the empirical generalization gap at large batch.
Goyal, Dollar, Girshick, Noordhuis, Wesolowski, Kyrola, Tulloch, Jia, and He, Accurate, large minibatch SGD: training ImageNet in 1 hour (arXiv 1706.02677, 2017). The empirical demonstration that linear scaling holds up to batch 8K on ResNet-50, with warmup recipes for the early-training breakdown.

Current:

Jastrzebski, Kenton, Arpit, Ballas, Fischer, Bengio, and Storkey, Three factors influencing minima in SGD (ICANN 2018). Empirical and theoretical study of how learning rate, batch size, and gradient noise jointly shape the minima SGD converges to.
Yaida, Fluctuation-dissipation relations for stochastic gradient descent (ICLR 2019). Higher-order corrections to the SDE limit and exact identities at finite step size, useful for diagnosing when the order-1 approximation is inadequate.
Simsekli, Sagun, and Gurbuzbalaban, A tail-index analysis of stochastic gradient noise in deep neural networks (ICML 2019). Empirical evidence that gradient noise in deep networks is heavy-tailed and the corresponding alpha-stable Levy SDE limit, with implications for saddle-escape rates.
Barrett and Dherin, Implicit gradient regularization (ICLR 2021). The clean modified-equation derivation showing finite-step gradient descent implicitly regularizes the gradient-norm penalty; Smith, Dherin, Barrett, and De (ICLR 2021) extend to mini-batch SGD.

Next Topics

Langevin Dynamics: the SDE that the SGD-SDE reduces to under isotropic noise; the source of the Gibbs stationary distribution and the Bayesian reading.
Fokker–Planck Equation: the PDE governing the time-evolving density of the SGD-SDE, where stationary distributions and mixing rates live.
Stochastic Gradient Descent Convergence: the discrete-iteration convergence theory the SDE limit complements; non-asymptotic rates without the small-step approximation.
Implicit Bias and Modern Generalization: the broader question of which minima SGD selects, of which the SDE flat-minima story is one mechanism among several.
Stochastic Differential Equations: the parent SDE framework, including Euler–Maruyama (which the SGD update mirrors) and weak-vs-strong approximation theory.

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Stochastic Gradient Descent Convergencelayer 2 · tier 1
Fokker–Planck Equationlayer 3 · tier 2
Stochastic Differential Equationslayer 3 · tier 2
Stochastic Calculus for MLlayer 3 · tier 3

Derived topics

2

Implicit Bias and Modern Generalizationlayer 4 · tier 1
Langevin Dynamicslayer 3 · tier 2

Graph-backed continuations

Langevin Dynamics Implicit Bias and Modern Generalization