Optimization
SGD as a Stochastic Differential Equation
The continuous-time SDE limit of mini-batch SGD: dW = -∇L dt + sqrt(η Σ) dB. Order-1 weak approximation (Li–Tai–E), Mandt–Hoffman–Blei stationary distribution, Bayesian interpretation, the linear scaling rule for batch size, and the modified-equation correction that exposes SGD's implicit gradient-norm regularizer.
Prerequisites
Why This Matters
Stochastic gradient descent is a discrete iteration. Its trajectory in weight space is a sequence of jumps, one per mini-batch, of size proportional to the learning rate . In the small- regime those jumps look — on the right time scale — like the increments of a stochastic differential equation whose drift is the population gradient and whose diffusion matrix is set by the mini-batch gradient covariance . This is the SDE limit of SGD, formalized by Mandt, Hoffman, and Blei (2017) and Li, Tai, and E (2017, full version in JMLR 2019).
The reason to take this limit seriously is that it gives a clean analytical handle on questions the discrete recursion answers awkwardly. Stationary distribution: under constant step size, what does SGD asymptotically sample from? Batch-size and learning-rate scaling: what combinations of leave the trajectory geometry invariant? Bayesian interpretation: in what regime does SGD behave like an approximate posterior sampler, and when does it not? Implicit bias: what continuous-time vector field does SGD secretly follow that gradient flow does not? All four questions become tractable once SGD is reframed as an SDE with a Fokker–Planck dual.
The framing has limits. The approximation is order-1 in on a fixed time interval; at the step sizes used in deep-learning training ( for ResNet, larger still for transformers), the approximation can break down and higher-order corrections matter (Yaida 2019). The Gaussian noise assumption fails when gradient noise is heavy-tailed (Şimşekli et al. 2019), in which case the correct continuous limit is a Lévy SDE, not a Brownian one. Despite these caveats, the SDE view is the standard analytical tool for SGD theory and the bridge to Langevin dynamics, score-based sampling, and flat-minima generalization arguments.
Mental Model
Write the SGD update as
where is a mini-batch gradient on a uniformly sampled batch of size from the training set. Decompose the mini-batch gradient into its mean and a zero-mean fluctuation:
with per-example covariance . The recursion becomes . The deterministic part scales as ; the noise part has variance scaling as , so its standard deviation scales as .
Rescale time so that one SGD step corresponds to units of continuous time, . Over a continuous interval the deterministic increment is and the stochastic increment has variance , which we rewrite as . Matching this to a Brownian increment with gives the SDE
The factor in the diffusion is the signature of the SDE limit: in the small-step regime, the per-step noise has standard deviation , but the cumulative noise over a unit continuous-time interval (which contains steps) has standard deviation . This is the same scaling that produces Brownian motion as a continuum limit of random walks.
Formal Statement
SGD-SDE Approximation
Let be a population loss with mini-batch gradients satisfying and for batches of size . The SGD-SDE approximation is the Itô SDE
where the square root is a matrix square root of and is a standard -dimensional Brownian motion. The time rescaling is : one SGD step corresponds to units of continuous time. The drift is the population gradient; the diffusion is set by the per-example gradient covariance, the batch size, and the learning rate.
The matching of one SGD step to continuous-time units is what makes the diffusion coefficient depend on . A naive reading that "the noise vanishes as " is wrong on the natural time scale. On a fixed number of steps the noise vanishes; on a fixed continuous time horizon it stays .
Li–Tai–E Approximation Theorem
Li–Tai–E Order-1 Weak Approximation
Statement
Under the assumptions above, on any fixed time horizon , the SGD iterates at step and the solution of the SGD-SDE driven by the same initial condition satisfy
for every test function with polynomial growth. The constant depends on derivatives of and on Lipschitz / boundedness constants of and , but not on . This is weak approximation of order 1.
Intuition
The SGD update and one Itô–Taylor step of the SDE agree in their first two moments to order . Higher moments differ at order and above. A test function probes only the law of the iterate, not its pathwise realization, so first-two-moment matching plus a Grönwall argument over steps gives the order-1 bound on the cumulative error in expectation.
Proof Sketch
Taylor-expand to third order in . Take conditional expectation over the batch: the linear term gives ; the quadratic term contributes both a Hessian-of- piece and a noise-covariance piece , where the from matches the diffusion of the SDE under the time rescaling . Compare against the Itô–Taylor expansion of on the increment ; the leading terms agree, and the residual sums over steps to . Grönwall gives the bound. This is Theorem 1 of Li, Tai, and E (2017); the JMLR version (2019) gives the order-2 weak approximation when one tracks an extra correction term in the drift.
Why It Matters
Order-1 weak approximation is the formal statement that distributional quantities computed from the SDE — stationary covariances, escape rates from saddles, mixing times, expected losses — are correct to leading order in . It justifies replacing the discrete recursion with the continuous-time object for any analysis that ultimately cares about expectations. It does not justify pathwise statements: a single realization of SGD and a single realization of the SDE coupled to the same Brownian motion can be order- apart in path norm, the same strong-vs-weak gap that appears in Euler–Maruyama analysis. For any and any test function with polynomial growth, the SGD iterate after steps and the SDE solution at satisfy : weak approximation order 1 in .
Failure Mode
The order-1 bound has a hidden multiplicative constant in the Lipschitz and boundedness assumptions. Deep networks at the start of training have loss landscapes where third and fourth derivatives are very large (sharp near-zero curvature, exponential activations), and the constant can be large enough that the approximation is poor at even though it is asymptotically of order . Yaida (ICLR 2019) develops fluctuation–dissipation relations that are exact to higher order and useful for diagnosing when the SDE limit fits empirical SGD trajectories. A second failure mode: when per-example gradient norms are heavy-tailed (infinite variance), the central-limit step in the SDE derivation breaks and the right continuous limit is an -stable Lévy SDE, not a Brownian one (Şimşekli, Sagun, and Gürbüzbalaban, ICML 2019).
Stationary Distribution and Bayesian Interpretation
Under constant and assuming the SDE has a unique invariant measure, the SGD iterate has an approximate stationary distribution. The Fokker–Planck equation associated with the SGD-SDE is
where the second term is shorthand for . The stationary density solves . In general this PDE has no closed-form solution because depends on in a non-Gibbs way.
Two simplifying regimes admit closed forms. First, when the noise is isotropic and homogeneous, , the SDE coincides with overdamped Langevin dynamics on at inverse temperature . The stationary distribution is the Gibbs measure
The effective temperature scales linearly in and inversely in . SGD at higher learning rate (or smaller batch) samples from a hotter Gibbs measure: more spread-out, more weight on flatter regions, less concentration at the global minimum.
Second, when the loss is quadratic and the noise is constant, the SDE is an Ornstein–Uhlenbeck process and the stationary distribution is explicitly Gaussian (worked out below).
The Bayesian reading, due to Mandt, Hoffman, and Blei (2017): if is a Bayesian negative log posterior, then SGD with isotropic noise and tuned asymptotically samples from a tempered posterior . At this is the true posterior; at it is a sharpened, more MAP-like distribution. Mandt, Hoffman, and Blei use this to repurpose constant-step SGD as an approximate posterior sampler, calibrating so that matches the desired tempering. The full method requires a preconditioner that aligns the noise covariance with the inverse Fisher; without it, is anisotropic and the stationary distribution is no longer Gibbs.
Linear Scaling Rule
Read the SDE again: . The drift is independent of and . The diffusion depends on the combination . So holding constant leaves the SDE, and therefore the trajectory geometry on any fixed continuous-time interval, invariant.
This is the linear scaling rule for batch size: when you increase the batch size by a factor , increase the learning rate by the same factor to preserve the SDE. Goyal et al. (arXiv 1706.02677, 2017) demonstrated empirically that the rule holds for ImageNet ResNet-50 training up to batch size , allowing one-hour training without loss of generalization. Smith and Le (ICLR 2018) gave the SDE-based derivation. Hoffer, Hubara, and Soudry (NeurIPS 2017) noted a refinement: to match generalization, train longer at large batch (or more precisely, match the number of SDE-time-units, not the number of epochs), because the SDE-time per epoch is where is the dataset size.
The rule breaks at very large batch ( for typical vision models, larger for transformers). Three mechanisms contribute: the SDE approximation needs small, but linear scaling drives up; the noise covariance shrinks as until other sources of randomness (data ordering, augmentation) dominate; and at small noise the system is close to deterministic gradient descent and loses the implicit-regularization benefit of the noise. McCandlish et al. (2018) introduced the "critical batch size" framework that quantifies where this transition occurs in terms of the gradient covariance.
Worked Example
Quadratic loss with , and isotropic constant noise . The SGD-SDE is the Ornstein–Uhlenbeck process
Solve in closed form: . Mean . Stationary covariance: solve the Lyapunov equation , giving with . The stationary distribution is
In the eigenbasis of with eigenvalues , the stationary variance along the -th eigendirection is . SGD spreads most in the directions where is flattest (small ) and concentrates in the sharpest directions (large ). This is the precise content of the heuristic that "SGD spends most time in flat regions of the loss."
The flat-minima generalization story (Hochreiter and Schmidhuber 1997; Keskar et al. ICLR 2017) reads this as a mechanism: if generalization correlates with flatness (measured, for example, by the trace of near the solution), then the SGD-SDE biases toward flatter minima and hence toward better-generalizing minima. The argument is suggestive, not a theorem in non-convex deep nets; flatness measures are reparametrization-dependent (Dinh, Pascanu, Bengio, and Bengio, ICML 2017), and the SDE limit captures only one mechanism among several candidates for SGD's implicit bias.
Common Confusions
The Gaussian-noise assumption can fail
The SDE derivation invokes a central-limit-type approximation: the mini-batch gradient noise, summed over many small steps, looks Gaussian. This is fine when per-example gradient norms have finite variance and the network is not too deep. Şimşekli, Sagun, and Gürbüzbalaban (ICML 2019) measured per-example gradient norms in deep networks and reported heavy-tailed distributions whose tail index falls below 2, violating the variance hypothesis. Under their measurements the right continuous limit is an -stable Lévy SDE, whose escape times from local minima scale polynomially rather than exponentially in the barrier height. The Brownian-SDE picture and the Lévy-SDE picture make qualitatively different predictions about saddle escape; which one applies in any given training run is an empirical question.
’SGD prefers flat minima’ is a heuristic, not a theorem
The OU calculation shows that on a quadratic, the SGD-SDE stationary covariance is largest along flat eigendirections. Generalizing this to non-convex deep nets requires several steps that are not airtight: the loss is not quadratic; the noise covariance is anisotropic and depends on ; flatness measures are not reparametrization-invariant; and the SDE limit only approximates the discrete dynamics. Implicit bias of SGD is an active area with multiple competing explanations: the SDE stationary distribution, the modified-equation drift correction, label noise, edge-of-stability dynamics, and weight-decay interactions are all candidates and none subsumes the others. Treat the SDE flat-minima argument as one mechanism, not the explanation.
The SDE treats noise as Markov; real SGD has correlation structure
The SDE assumes the noise increments at distinct continuous times are independent, by construction of Brownian motion. Real SGD draws batches without replacement within each epoch; the per-step noises are negatively correlated across an epoch and the cumulative noise over an epoch is exactly zero (every example contributes once). This violates the Markov assumption at the epoch scale. Smith, Elsen, and De (ICML 2020) study the epoch-level effect; for typical training horizons the deviation from the Brownian SDE is small but not zero, and the mismatch matters when reasoning about variance of long-time averages. With- replacement sampling restores the Markov assumption at the cost of slower per-epoch progress.
Exercises
Problem
Take with , , and isotropic noise . Write down the SGD-SDE, solve the Lyapunov equation for the stationary covariance, and verify that the stationary variance along the -th eigendirection is with .
Problem
Derive the modified equation for SGD: show that to first order in , the SGD iterate tracks not the gradient flow but a corrected SDE with drift , plus the same diffusion as before. Conclude that SGD has an implicit gradient-norm regularizer of strength .
References
No canonical references provided.
No current references provided.
No frontier references provided.
Next Topics
- Langevin Dynamics: the SDE that the SGD-SDE reduces to under isotropic noise; the source of the Gibbs stationary distribution and the Bayesian reading.
- Fokker–Planck Equation: the PDE governing the time-evolving density of the SGD-SDE, where stationary distributions and mixing rates live.
- Stochastic Gradient Descent Convergence: the discrete-iteration convergence theory the SDE limit complements; non-asymptotic rates without the small-step approximation.
- Implicit Bias and Modern Generalization: the broader question of which minima SGD selects, of which the SDE flat-minima story is one mechanism among several.
- Stochastic Differential Equations: the parent SDE framework, including Euler–Maruyama (which the SGD update mirrors) and weak-vs-strong approximation theory.
Last reviewed: April 18, 2026
Prerequisites
Foundations this topic depends on.
- Stochastic Differential EquationsLayer 3
- Brownian MotionLayer 2
- Measure-Theoretic ProbabilityLayer 0B
- Martingale TheoryLayer 0B
- Ito's LemmaLayer 3
- Stochastic Calculus for MLLayer 3
- Stochastic Gradient Descent ConvergenceLayer 2
- Gradient Descent VariantsLayer 1
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Fokker–Planck EquationLayer 3
- PDE Fundamentals for Machine LearningLayer 1
- Fast Fourier TransformLayer 1
- Exponential Function PropertiesLayer 0A
- Eigenvalues and EigenvectorsLayer 0A
- Functional Analysis CoreLayer 0B
- Metric Spaces, Convergence, and CompletenessLayer 0A
- Inner Product Spaces and OrthogonalityLayer 0A
- Vectors, Matrices, and Linear MapsLayer 0A