Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Mathematical Infrastructure

Stochastic Calculus for ML

Brownian motion, Ito integrals, Ito's lemma, and stochastic differential equations: the mathematical machinery behind diffusion models, score-based generative models, and Langevin dynamics.

AdvancedTier 3Stable~65 min
0

Why This Matters

Diffusion models (DDPM, score-based models) generate data by reversing a stochastic differential equation. Langevin dynamics, used in MCMC sampling and score matching, is a specific SDE. Understanding these models requires stochastic calculus: the extension of ordinary calculus to processes driven by Brownian motion.

The central surprise of stochastic calculus is that the chain rule changes. When you apply a smooth function to a process driven by Brownian motion, you pick up an extra second-order term that does not appear in ordinary calculus. This is Ito's lemma, and it is the single most important formula in this topic.

Brownian Motion

Definition

Standard Brownian Motion

A continuous-time stochastic process {Wt}t0\{W_t\}_{t \geq 0} satisfying:

  1. W0=0W_0 = 0
  2. Independent increments: WtWsW_t - W_s is independent of {Wu:us}\{W_u : u \leq s\} for s<ts < t
  3. Gaussian increments: WtWsN(0,ts)W_t - W_s \sim \mathcal{N}(0, t - s)
  4. Continuous paths: tWtt \mapsto W_t is continuous almost surely

Key properties that make Brownian motion different from smooth functions:

  • Nowhere differentiable: WtW_t is continuous but has no derivative at any point, almost surely. This is why Riemann-Stieltjes integration fails.
  • Quadratic variation: i(Wti+1Wti)2t\sum_{i} (W_{t_{i+1}} - W_{t_i})^2 \to t as the partition becomes finer. For smooth functions, quadratic variation is zero. This nonzero quadratic variation is the source of the extra term in Ito's lemma.
  • Scaling: {c1/2Wct}t0\{c^{-1/2} W_{ct}\}_{t \geq 0} is also a standard Brownian motion. The t\sqrt{t} scaling of increments (not linear in tt) is characteristic.

The Ito Integral

Ordinary calculus defines 0Tf(t)dg(t)\int_0^T f(t) \, dg(t) when gg has bounded variation. Brownian paths have unbounded variation (they wiggle too much), so this definition fails.

Definition

Ito Integral

For an adapted process ftf_t satisfying E[0Tft2dt]<\mathbb{E}[\int_0^T f_t^2 \, dt] < \infty, the Ito integral is:

0TftdWt=limni=0n1fti(Wti+1Wti)\int_0^T f_t \, dW_t = \lim_{n \to \infty} \sum_{i=0}^{n-1} f_{t_i}(W_{t_{i+1}} - W_{t_i})

The limit is in L2L^2. The crucial feature: ff is evaluated at the left endpoint tit_i, not the right endpoint or midpoint. This choice makes the integral a martingale but means Ito's calculus differs from Stratonovich's.

The left-endpoint evaluation is not arbitrary. It ensures that the integrand is independent of the increment Wti+1WtiW_{t_{i+1}} - W_{t_i}, which is required for the integral to be a martingale and for the Ito isometry to hold.

Main Theorems

Theorem

Ito Isometry

Statement

E[(0TftdWt)2]=E[0Tft2dt]\mathbb{E}\left[\left(\int_0^T f_t \, dW_t\right)^2\right] = \mathbb{E}\left[\int_0^T f_t^2 \, dt\right]

Intuition

The Ito integral converts an L2L^2 function of time into a random variable, and it does so isometrically: the variance of the integral equals the integral of the variance. This is because the cross-terms in the square vanish due to independence of non-overlapping Brownian increments.

Proof Sketch

For simple (step) functions, expand the square of the sum. Cross-terms have the form ftiftj(Wti+1Wti)(Wtj+1Wtj)f_{t_i} f_{t_j} (W_{t_{i+1}} - W_{t_i})(W_{t_{j+1}} - W_{t_j}) with iji \neq j. By independence of increments, these have zero expectation. Only the diagonal terms survive, giving iE[fti2](ti+1ti)\sum_i \mathbb{E}[f_{t_i}^2] \cdot (t_{i+1} - t_i). Extend by density to general integrands.

Why It Matters

The Ito isometry is the tool for computing variances of stochastic integrals. It also provides the L2L^2 framework needed to define the integral for general integrands by approximation with simple processes.

Failure Mode

The isometry fails if ftf_t is not adapted (i.e., it looks into the future). It also fails for the Stratonovich integral, where the integrand is evaluated at the midpoint rather than the left endpoint.

Theorem

Itos Lemma

Statement

If XtX_t satisfies dXt=μtdt+σtdWtdX_t = \mu_t \, dt + \sigma_t \, dW_t and fC2f \in C^2, then:

df(Xt)=f(Xt)dXt+12f(Xt)σt2dtdf(X_t) = f'(X_t) \, dX_t + \frac{1}{2} f''(X_t) \sigma_t^2 \, dt

Equivalently:

df(Xt)=[μtf(Xt)+12σt2f(Xt)]dt+σtf(Xt)dWtdf(X_t) = \left[\mu_t f'(X_t) + \frac{1}{2} \sigma_t^2 f''(X_t)\right] dt + \sigma_t f'(X_t) \, dW_t

Intuition

This is the chain rule for stochastic processes. In ordinary calculus, df(x)=f(x)dxdf(x) = f'(x) dx and we stop at first order because (dx)2(dx)^2 is negligible. For Brownian motion, (dWt)2=dt(dW_t)^2 = dt (heuristically), so the second-order Taylor term 12f(Xt)(dXt)2\frac{1}{2} f''(X_t)(dX_t)^2 contributes a non-negligible dtdt term. This is the extra correction.

Proof Sketch

Apply a second-order Taylor expansion: f(Xt+dt)f(Xt)f(Xt)ΔX+12f(Xt)(ΔX)2f(X_{t+dt}) - f(X_t) \approx f'(X_t) \Delta X + \frac{1}{2} f''(X_t) (\Delta X)^2. Compute (ΔX)2=(μΔt+σΔW)2(\Delta X)^2 = (\mu \Delta t + \sigma \Delta W)^2. The (ΔW)2(\Delta W)^2 term converges to σ2dt\sigma^2 dt as the partition refines (quadratic variation of Brownian motion). The ΔtΔW\Delta t \cdot \Delta W and (Δt)2(\Delta t)^2 terms vanish.

Why It Matters

Ito's lemma is the computational workhorse of stochastic calculus. Every derivation involving SDEs uses it: computing the dynamics of transformed processes, deriving the Fokker-Planck equation, proving properties of diffusion models. You will use this constantly.

Failure Mode

If ff is not C2C^2, the lemma does not apply in its standard form (though generalizations exist via Tanaka's formula). If the underlying process is not an Ito process (e.g., it has jumps), you need the jump-diffusion version of Ito's lemma.

Stochastic Differential Equations

Definition

Stochastic Differential Equation

An SDE has the form:

dXt=μ(Xt,t)dt+σ(Xt,t)dWtdX_t = \mu(X_t, t) \, dt + \sigma(X_t, t) \, dW_t

where μ\mu is the drift (deterministic tendency) and σ\sigma is the diffusion coefficient (noise intensity). This is shorthand for the integral equation Xt=X0+0tμ(Xs,s)ds+0tσ(Xs,s)dWsX_t = X_0 + \int_0^t \mu(X_s, s) ds + \int_0^t \sigma(X_s, s) dW_s.

Theorem

Existence and Uniqueness for SDEs

Statement

Under Lipschitz and linear growth conditions on μ\mu and σ\sigma, the SDE has a unique strong solution XtX_t that is adapted to the filtration generated by WtW_t and satisfies E[sup0tTXt2]<\mathbb{E}[\sup_{0 \leq t \leq T} X_t^2] < \infty.

Intuition

This is the stochastic analog of Picard-Lindelof for ODEs. Lipschitz continuity prevents solutions from splitting apart (uniqueness), and linear growth prevents solutions from exploding to infinity in finite time (existence).

Proof Sketch

Picard iteration: define Xt(0)=X0X_t^{(0)} = X_0 and Xt(n+1)=X0+0tμ(Xs(n),s)ds+0tσ(Xs(n),s)dWsX_t^{(n+1)} = X_0 + \int_0^t \mu(X_s^{(n)}, s) ds + \int_0^t \sigma(X_s^{(n)}, s) dW_s. Use the Ito isometry and the Lipschitz condition to show that the iterates form a Cauchy sequence in L2(sup[0,T])L^2(\sup_{[0,T]}). Completeness gives convergence to a unique limit.

Why It Matters

This theorem guarantees that the forward and reverse SDEs in diffusion models are well-defined. Without existence and uniqueness, the generative process would not be mathematically sound.

Failure Mode

Many practically important SDEs violate Lipschitz continuity. The CIR process (dXt=a(bXt)dt+σXtdWtdX_t = a(b - X_t) dt + \sigma \sqrt{X_t} dW_t) has σ(x)=σx\sigma(x) = \sigma\sqrt{x}, which is not Lipschitz at x=0x = 0. In such cases, existence and uniqueness can still be established by other methods, but the standard theorem does not apply directly.

Connections to ML

Diffusion models: the forward process dXt=f(t)Xtdt+g(t)dWtdX_t = f(t) X_t \, dt + g(t) \, dW_t gradually adds noise to data. The reverse process (Anderson, 1982) is also an SDE: dXt=[f(t)Xtg(t)2xlogpt(Xt)]dt+g(t)dWˉtdX_t = [f(t)X_t - g(t)^2 \nabla_x \log p_t(X_t)] \, dt + g(t) \, d\bar{W}_t, where xlogpt\nabla_x \log p_t is the score function. The neural network learns to approximate this score.

Langevin dynamics: the SDE dXt=logp(Xt)dt+2dWtdX_t = \nabla \log p(X_t) \, dt + \sqrt{2} \, dW_t has pp as its stationary distribution under mild conditions. Discretizing this SDE gives the Langevin MCMC sampler.

SGD as SDE: with small learning rate η\eta, SGD on a loss LL approximately follows dXt=L(Xt)dt+ηΣ(Xt)dWtdX_t = -\nabla L(X_t) \, dt + \sqrt{\eta \Sigma(X_t)} \, dW_t, where Σ\Sigma is the minibatch gradient covariance. This SDE viewpoint explains implicit regularization effects.

Common Confusions

Watch Out

Ito vs Stratonovich

The Ito integral evaluates the integrand at the left endpoint; the Stratonovich integral evaluates at the midpoint. They give different results for the same integrand. Ito is standard in probability and finance because Ito integrals are martingales. Stratonovich is common in physics because it preserves the ordinary chain rule. For diffusion models in ML, the Ito convention is standard.

Watch Out

dW_t squared is not zero

In ordinary calculus, (dx)2=0(dx)^2 = 0 because it is second-order. In stochastic calculus, (dWt)2=dt(dW_t)^2 = dt (in the formal sense of quadratic variation). This is why Ito's lemma has an extra term. Forgetting this is the most common error.

Summary

  • Brownian motion has continuous but nowhere differentiable paths
  • Ito integrals use left-endpoint evaluation, making them martingales
  • Ito's lemma: df=fdX+12fσ2dtdf = f' dX + \frac{1}{2} f'' \sigma^2 dt (the extra second-order term is the key difference from ordinary calculus)
  • SDEs exist and are unique under Lipschitz and linear growth conditions
  • Diffusion models, Langevin dynamics, and SGD analysis all use SDEs

Exercises

ExerciseCore

Problem

Let WtW_t be a standard Brownian motion. Use Ito's lemma to find d(Wt2)d(W_t^2). What are the drift and diffusion coefficients of the process Yt=Wt2Y_t = W_t^2?

ExerciseAdvanced

Problem

The Ornstein-Uhlenbeck process satisfies dXt=θXtdt+σdWtdX_t = -\theta X_t \, dt + \sigma \, dW_t with θ>0\theta > 0. Use Ito's lemma on Yt=XteθtY_t = X_t e^{\theta t} to find the explicit solution for XtX_t.

References

Canonical:

  • Oksendal, Stochastic Differential Equations (6th ed.), Chapters 3-5
  • Karatzas & Shreve, Brownian Motion and Stochastic Calculus, Chapter 3

Current:

  • Song et al., Score-Based Generative Modeling through Stochastic Differential Equations (2021), Section 2
  • Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models (2020)

Next Topics

  • Diffusion models: the primary ML application of SDEs and score functions

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics