Mathematical Infrastructure
Stochastic Calculus for ML
Brownian motion, Ito integrals, Ito's lemma, and stochastic differential equations: the mathematical machinery behind diffusion models, score-based generative models, and Langevin dynamics.
Prerequisites
Why This Matters
Diffusion models (DDPM, score-based models) generate data by reversing a stochastic differential equation. Langevin dynamics, used in MCMC sampling and score matching, is a specific SDE. Understanding these models requires stochastic calculus: the extension of ordinary calculus to processes driven by Brownian motion.
The central surprise of stochastic calculus is that the chain rule changes. When you apply a smooth function to a process driven by Brownian motion, you pick up an extra second-order term that does not appear in ordinary calculus. This is Ito's lemma, and it is the single most important formula in this topic.
Brownian Motion
Standard Brownian Motion
A continuous-time stochastic process satisfying:
- Independent increments: is independent of for
- Gaussian increments:
- Continuous paths: is continuous almost surely
Key properties that make Brownian motion different from smooth functions:
- Nowhere differentiable: is continuous but has no derivative at any point, almost surely. This is why Riemann-Stieltjes integration fails.
- Quadratic variation: as the partition becomes finer. For smooth functions, quadratic variation is zero. This nonzero quadratic variation is the source of the extra term in Ito's lemma.
- Scaling: is also a standard Brownian motion. The scaling of increments (not linear in ) is characteristic.
The Ito Integral
Ordinary calculus defines when has bounded variation. Brownian paths have unbounded variation (they wiggle too much), so this definition fails.
Ito Integral
For an adapted process satisfying , the Ito integral is:
The limit is in . The crucial feature: is evaluated at the left endpoint , not the right endpoint or midpoint. This choice makes the integral a martingale but means Ito's calculus differs from Stratonovich's.
The left-endpoint evaluation is not arbitrary. It ensures that the integrand is independent of the increment , which is required for the integral to be a martingale and for the Ito isometry to hold.
Main Theorems
Ito Isometry
Statement
Intuition
The Ito integral converts an function of time into a random variable, and it does so isometrically: the variance of the integral equals the integral of the variance. This is because the cross-terms in the square vanish due to independence of non-overlapping Brownian increments.
Proof Sketch
For simple (step) functions, expand the square of the sum. Cross-terms have the form with . By independence of increments, these have zero expectation. Only the diagonal terms survive, giving . Extend by density to general integrands.
Why It Matters
The Ito isometry is the tool for computing variances of stochastic integrals. It also provides the framework needed to define the integral for general integrands by approximation with simple processes.
Failure Mode
The isometry fails if is not adapted (i.e., it looks into the future). It also fails for the Stratonovich integral, where the integrand is evaluated at the midpoint rather than the left endpoint.
Itos Lemma
Statement
If satisfies and , then:
Equivalently:
Intuition
This is the chain rule for stochastic processes. In ordinary calculus, and we stop at first order because is negligible. For Brownian motion, (heuristically), so the second-order Taylor term contributes a non-negligible term. This is the extra correction.
Proof Sketch
Apply a second-order Taylor expansion: . Compute . The term converges to as the partition refines (quadratic variation of Brownian motion). The and terms vanish.
Why It Matters
Ito's lemma is the computational workhorse of stochastic calculus. Every derivation involving SDEs uses it: computing the dynamics of transformed processes, deriving the Fokker-Planck equation, proving properties of diffusion models. You will use this constantly.
Failure Mode
If is not , the lemma does not apply in its standard form (though generalizations exist via Tanaka's formula). If the underlying process is not an Ito process (e.g., it has jumps), you need the jump-diffusion version of Ito's lemma.
Stochastic Differential Equations
Stochastic Differential Equation
An SDE has the form:
where is the drift (deterministic tendency) and is the diffusion coefficient (noise intensity). This is shorthand for the integral equation .
Existence and Uniqueness for SDEs
Statement
Under Lipschitz and linear growth conditions on and , the SDE has a unique strong solution that is adapted to the filtration generated by and satisfies .
Intuition
This is the stochastic analog of Picard-Lindelof for ODEs. Lipschitz continuity prevents solutions from splitting apart (uniqueness), and linear growth prevents solutions from exploding to infinity in finite time (existence).
Proof Sketch
Picard iteration: define and . Use the Ito isometry and the Lipschitz condition to show that the iterates form a Cauchy sequence in . Completeness gives convergence to a unique limit.
Why It Matters
This theorem guarantees that the forward and reverse SDEs in diffusion models are well-defined. Without existence and uniqueness, the generative process would not be mathematically sound.
Failure Mode
Many practically important SDEs violate Lipschitz continuity. The CIR process () has , which is not Lipschitz at . In such cases, existence and uniqueness can still be established by other methods, but the standard theorem does not apply directly.
Connections to ML
Diffusion models: the forward process gradually adds noise to data. The reverse process (Anderson, 1982) is also an SDE: , where is the score function. The neural network learns to approximate this score.
Langevin dynamics: the SDE has as its stationary distribution under mild conditions. Discretizing this SDE gives the Langevin MCMC sampler.
SGD as SDE: with small learning rate , SGD on a loss approximately follows , where is the minibatch gradient covariance. This SDE viewpoint explains implicit regularization effects.
Common Confusions
Ito vs Stratonovich
The Ito integral evaluates the integrand at the left endpoint; the Stratonovich integral evaluates at the midpoint. They give different results for the same integrand. Ito is standard in probability and finance because Ito integrals are martingales. Stratonovich is common in physics because it preserves the ordinary chain rule. For diffusion models in ML, the Ito convention is standard.
dW_t squared is not zero
In ordinary calculus, because it is second-order. In stochastic calculus, (in the formal sense of quadratic variation). This is why Ito's lemma has an extra term. Forgetting this is the most common error.
Summary
- Brownian motion has continuous but nowhere differentiable paths
- Ito integrals use left-endpoint evaluation, making them martingales
- Ito's lemma: (the extra second-order term is the key difference from ordinary calculus)
- SDEs exist and are unique under Lipschitz and linear growth conditions
- Diffusion models, Langevin dynamics, and SGD analysis all use SDEs
Exercises
Problem
Let be a standard Brownian motion. Use Ito's lemma to find . What are the drift and diffusion coefficients of the process ?
Problem
The Ornstein-Uhlenbeck process satisfies with . Use Ito's lemma on to find the explicit solution for .
References
Canonical:
- Oksendal, Stochastic Differential Equations (6th ed.), Chapters 3-5
- Karatzas & Shreve, Brownian Motion and Stochastic Calculus, Chapter 3
Current:
- Song et al., Score-Based Generative Modeling through Stochastic Differential Equations (2021), Section 2
- Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models (2020)
Next Topics
- Diffusion models: the primary ML application of SDEs and score functions
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Martingale TheoryLayer 0B
- Measure-Theoretic ProbabilityLayer 0B
Builds on This
- Ito's LemmaLayer 3