Skip to main content

Mathematical Infrastructure

Fokker–Planck Equation

The deterministic PDE for the time-evolving density of an SDE. The bridge that lets you reason about Langevin samplers, diffusion models, and stochastic optimization in PDE language: stationary distributions become null spaces, mixing times become spectral gaps, score functions become drift-density couplings.

AdvancedTier 2Stable~50 min
0

Why This Matters

Solving an SDE gives you sample paths; solving its Fokker–Planck equation gives you the density of those paths at every time. For most ML applications you do not actually want one trajectory. You want to know what distribution your sampler is converging to, how fast it gets there, and whether any of that depends on a hyperparameter you can tune. Those are PDE questions, not SDE questions, and the Fokker–Planck equation is what turns one into the other.

This is why the Fokker–Planck equation shows up implicitly under nearly every sampling and generative-modeling result in modern ML. The stationary density of Langevin dynamics is the kernel of its Fokker–Planck operator; the convergence rate of SGLD and ULA is the spectral gap of that operator; the forward noising process in diffusion models is a Fokker–Planck PDE evolving the data distribution toward standard Gaussian; the score-matching loss is a re-expression of one term in the Fokker–Planck operator.

The same equation also appears in physics (chemical kinetics, polymer dynamics, plasma transport), finance (option pricing under nonconstant volatility), and biology (population genetics under drift). It is one of the densest cross-disciplinary PDEs you can learn.

Mental Model

An SDE pushes a single particle around with drift bb and noise σ\sigma. If you instead release a cloud of independent particles all from the same initial distribution and watch their density p(x,t)p(x, t) evolve, that density satisfies a deterministic PDE: the drift bb advects the cloud, and the noise σ\sigma smears it. The Fokker–Planck equation is exactly this advection–diffusion PDE. The smearing is anisotropic and position-dependent because σ\sigma can be a matrix that varies with xx.

A useful aphorism: the SDE is the Lagrangian view (follow one particle), the Fokker–Planck equation is the Eulerian view (watch the field of particles). They contain the same information, expressed in dual languages.

Formal Derivation

Definition

Fokker–Planck Equation (Forward Kolmogorov)

Let XtRdX_t \in \mathbb{R}^d solve the Itô SDE dXt=b(Xt,t)dt+σ(Xt,t)dBtdX_t = b(X_t, t)\,dt + \sigma(X_t, t)\,dB_t with X0p0X_0 \sim p_0, and suppose XtX_t admits a smooth density p(x,t)p(x, t). Then pp satisfies

tp(x,t)=i=1di(bi(x,t)p(x,t))+12i,j=1dij(Dij(x,t)p(x,t)),\partial_t p(x, t) = -\sum_{i=1}^d \partial_i \big(b_i(x, t)\, p(x, t)\big) + \tfrac{1}{2} \sum_{i,j=1}^d \partial_i \partial_j \big( D_{ij}(x, t)\, p(x, t) \big),

with diffusion tensor D=σσD = \sigma \sigma^\top, initial condition p(x,0)=p0(x)p(x, 0) = p_0(x), and decay-at-infinity boundary conditions.

The right-hand side is the forward Kolmogorov operator Lp\mathcal{L}^* p. Its formal adjoint Lf=bf+12Tr(D2f)\mathcal{L} f = b \cdot \nabla f + \tfrac{1}{2} \operatorname{Tr}(D \nabla^2 f) is the infinitesimal generator of the SDE.

Theorem

Fokker–Planck Derivation from Itô

Statement

Under the assumptions above, for every test function fCc(Rd)f \in C_c^\infty (\mathbb{R}^d) and time t0t \ge 0, ddtE[f(Xt)]=E[Lf(Xt)]\frac{d}{dt} \mathbb{E}[f(X_t)] = \mathbb{E}[\mathcal{L} f(X_t)]. Writing this as ftpdx=(Lf)pdx\int f\, \partial_t p\,dx = \int (\mathcal{L} f)\, p\,dx and integrating by parts gives tp=Lp\partial_t p = \mathcal{L}^* p as a distributional identity, which equals the Fokker–Planck equation in the display above.

Intuition

The generator L\mathcal{L} is "what happens to the expectation of a function as the particle moves." Its adjoint flips the perspective: "what happens to the density as the cloud of particles moves." The Itô formula gives the first; integration by parts trades it for the second.

Proof Sketch

By Itô, df(Xt)=Lf(Xt)dt+(f)σdBtdf(X_t) = \mathcal{L} f(X_t)\,dt + (\nabla f)^\top \sigma\,dB_t. The stochastic integral is a martingale (test functions have compact support), so E[f(Xt)]=E[f(X0)]+0tE[Lf(Xs)]ds\mathbb{E}[f(X_t)] = \mathbb{E}[f(X_0)] + \int_0^t \mathbb{E} [\mathcal{L} f(X_s)]\,ds. Differentiating in tt and writing both sides as integrals against p(,s)p(\cdot, s) gives fsp=(Lf)p=f(Lp)\int f\, \partial_s p = \int (\mathcal{L} f)\, p = \int f\, (\mathcal{L}^* p) for every ff, which forces sp=Lp\partial_s p = \mathcal{L}^* p pointwise on the support.

Why It Matters

Every claim about "what distribution the SDE converges to" or "how fast it mixes" is, formally, a claim about L\mathcal{L}^*. Spectral analysis of L\mathcal{L}^* gives mixing rates; null-space analysis gives stationary distributions; positivity and mass-conservation properties live in the PDE. None of this is visible from the SDE itself. The density pp satisfies tp=(bp)+12i,jij((σσ)ijp)\partial_t p = -\nabla \cdot (bp) + \tfrac{1}{2} \sum_{i,j} \partial_i \partial_j ((\sigma\sigma^\top)_{ij}\, p) in the weak sense.

Failure Mode

The derivation assumes a smooth density exists. SDEs with degenerate or state-dependent diffusion (e.g., CIR near zero, hypoelliptic systems where σ\sigma is rank-deficient) may have singular or measure-valued solutions to the Fokker–Planck equation, requiring the framework of Hörmander's theorem or Malliavin calculus. The clean PDE form is restricted to elliptic / uniformly nondegenerate diffusions.

Stationary Distributions

A stationary density pp_\infty is a fixed point of the Fokker–Planck flow: Lp=0\mathcal{L}^* p_\infty = 0. For the canonical "overdamped Langevin" SDE dXt=U(Xt)dt+2/βdBtdX_t = -\nabla U(X_t)\,dt + \sqrt{2/\beta}\,dB_t with potential UU and inverse temperature β\beta, the stationary distribution is the Gibbs measure

p(x)=1ZeβU(x),Z=eβU(x)dx.p_\infty(x) = \frac{1}{Z} e^{-\beta U(x)}, \qquad Z = \int e^{-\beta U(x)}\,dx.

Plug it into L\mathcal{L}^* and the drift and diffusion terms cancel exactly. This is the entire foundation of MCMC sampling from energy-based models, of SGLD as approximate Bayesian inference, and of the equilibrium statistical mechanics that generative models implicitly target.

For non-reversible SDEs (those with drift not equal to U-\nabla U for any UU), the stationary distribution still exists when L\mathcal{L}^* has a one-dimensional kernel, but it is no longer Gibbs and the analysis is much harder.

Convergence Rates and Spectral Gap

The Fokker–Planck operator on L2(p)L^2(p_\infty) is self-adjoint and negative semidefinite for reversible (Langevin) dynamics, with 00 as a simple eigenvalue and eigenvector pp_\infty. The next eigenvalue λ1<0-\lambda_1 < 0 controls exponential convergence:

p(,t)pL2(1/p)2e2λ1tp0pL2(1/p)2.\lVert p(\cdot, t) - p_\infty \rVert_{L^2(1/p_\infty)}^2 \le e^{-2 \lambda_1 t}\, \lVert p_0 - p_\infty \rVert_{L^2(1/p_\infty)}^2.

The constant λ1\lambda_1 is the spectral gap, and Bakry–Émery theory gives sharp lower bounds in terms of the convexity of UU: if 2UmI\nabla^2 U \succeq m\, I for some m>0m > 0, then λ1m/β\lambda_1 \ge m / \beta and the law of XtX_t converges exponentially fast at rate m/βm / \beta. This is the cleanest statement of "Langevin mixes fast on log-concave targets," and it is the theoretical underpinning of every convergence guarantee for gradient-based MCMC on convex problems.

The stronger log-Sobolev inequality Entp(f2)2ρLSIEp[f2]\operatorname{Ent}_{p_\infty}(f^2) \le \frac{2}{\rho_{\text{LSI}}} \mathbb{E}_{p_\infty}[\lVert \nabla f \rVert^2] implies the same exponential rate in KL divergence, not just L2L^2, and is the right tool whenever you need entropic — not energetic — convergence guarantees.

Worked Example: Ornstein–Uhlenbeck Density

Take dXt=θXtdt+σdBtdX_t = -\theta X_t\,dt + \sigma\,dB_t in R\mathbb{R}. The Fokker–Planck equation is tp=θx(xp)+12σ2xxp\partial_t p = \theta \partial_x (x\, p) + \tfrac{1}{2} \sigma^2 \partial_{xx} p. With Gaussian initial data X0N(μ0,v0)X_0 \sim \mathcal{N}(\mu_0, v_0) the solution stays Gaussian with

μ(t)=eθtμ0,v(t)=e2θtv0+σ22θ(1e2θt).\mu(t) = e^{-\theta t} \mu_0, \qquad v(t) = e^{-2\theta t} v_0 + \frac{\sigma^2}{2\theta}\big(1 - e^{-2\theta t}\big).

The stationary distribution is N(0,σ2/(2θ))\mathcal{N}(0, \sigma^2/(2\theta)), exactly the Gibbs measure eβU/Ze^{-\beta U}/Z for U(x)=12θx2U(x) = \tfrac{1}{2}\theta x^2 and β=2θ/σ2\beta = 2 \theta / \sigma^2. The spectral gap is θ\theta — the same θ\theta that controls mean-reversion in the SDE, so the SDE and PDE views agree on the convergence rate, as they must.

Common Confusions

Watch Out

Forward Kolmogorov vs backward Kolmogorov are different equations

The Fokker–Planck equation is the forward Kolmogorov equation: it evolves the density of XtX_t forward in time. The backward Kolmogorov equation tu+Lu=0\partial_t u + \mathcal{L} u = 0 with terminal condition u(T,x)=g(x)u(T, x) = g(x) evolves the value function u(t,x)=E[g(XT)Xt=x]u(t, x) = \mathbb{E}[g(X_T) \mid X_t = x] backward in time and uses the generator L\mathcal{L}, not its adjoint. Forward asks "where will the particle be?", backward asks "what is the expected payoff if it ends here?". The Feynman–Kac formula and the deep BSDE method live in the backward picture.

Watch Out

The diffusion tensor is σσ^⊤, not σ

A common bug is to write the second derivative term as ij(σijp)\partial_i \partial_j (\sigma_{ij} p) instead of ij((σσ)ijp)\partial_i \partial_j ((\sigma \sigma^\top)_{ij} p). The diffusion enters the Fokker–Planck equation only through its covariance matrix D=σσD = \sigma \sigma^\top; two SDEs with the same drift and the same σσ\sigma \sigma^\top have the same Fokker–Planck equation, even if their σ\sigma matrices differ (e.g., differ by a right rotation). This is also why diffusion models can use scalar noise schedules without losing generality.

Watch Out

A stationary distribution is not the same as a steady-state flux

Lp=0\mathcal{L}^* p = 0 has two meaningful sub-cases. Reversible (Langevin) dynamics: drift b=Ub = -\nabla U and the equilibrium current $J = b p_\infty

  • \tfrac D \nabla p_\inftyisidenticallyzerothesystemisindetailedbalance.Nonreversibledynamics:is identically zero — the system is in *detailed balance*. **Non-reversible** dynamics:\mathcal^* p_\infty = 0butbutJ \ne 0$; the system has steady probability currents, only their divergence vanishes. Non-reversible Langevin samplers exploit exactly this to mix faster than reversible ones; they sit at a stationary distribution but with persistent rotational flow.

Exercises

ExerciseCore

Problem

Verify by direct substitution that the Gibbs measure p(x)=eβU(x)/Zp_\infty(x) = e^{-\beta U(x)}/Z is stationary for the overdamped Langevin SDE dXt=U(Xt)dt+2/βdBtdX_t = -\nabla U(X_t)\,dt + \sqrt{2/\beta}\,dB_t. Identify the cancellation that makes this work.

ExerciseAdvanced

Problem

For the OU process dXt=θXtdt+σdBtdX_t = -\theta X_t\,dt + \sigma\,dB_t, compute the full spectrum of the Fokker–Planck operator on L2(p)L^2(p_\infty). Show that the eigenvalues are nθ-n \theta for n=0,1,2,n = 0, 1, 2, \dots and identify the corresponding eigenfunctions.

References

No canonical references provided.

Next Topics

  • Langevin Dynamics: the canonical SDE whose Fokker–Planck equation has a Gibbs stationary distribution.
  • Score Matching: training a network on the score logp\nabla \log p that appears inside the Fokker–Planck operator.
  • Diffusion Models: forward noising SDEs and their Fokker–Planck duals; reverse-time samplers.
  • Time Reversal of SDEs: how the reversed SDE's drift involves the score logp\nabla \log p that the forward Fokker–Planck operator already contains.
  • SGD as SDE: SGD viewed as a discretization of a Langevin-type SDE; the implicit Fokker–Planck equation for SGD's stationary distribution.

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics