Skip to main content

Sampling MCMC

Langevin Dynamics

The overdamped Langevin SDE: dX_t = -∇U(X_t) dt + sqrt(2/β) dB_t. Gradient descent plus calibrated Gaussian noise. The mathematical object that makes SGLD, ULA, and energy-based MCMC samplers work, and the simplest sampler with provable polynomial-time convergence on log-concave targets.

AdvancedTier 2Stable~50 min
0

Why This Matters

If you want to sample from a target distribution p(x)exp(U(x))p_\infty(x) \propto \exp(-U(x)) and you can compute U\nabla U, you have a clean recipe: run the SDE

dXt=U(Xt)dt+2/βdBt,dX_t = -\nabla U(X_t)\,dt + \sqrt{2/\beta}\,dB_t,

and at large tt the law of XtX_t converges to p(x)exp(βU(x))p_\infty(x) \propto \exp(-\beta U(x)). This is overdamped Langevin dynamics. It is gradient descent on UU plus a precisely calibrated Gaussian noise, and the calibration is what makes the distribution — not just the trajectory — converge to the right place.

Almost every gradient-based MCMC sampler in modern ML is a discretization or perturbation of this SDE. The unadjusted Langevin algorithm (ULA) is the Euler–Maruyama discretization. SGLD (Welling–Teh 2011) replaces U\nabla U with a stochastic mini-batch estimate. Score-based diffusion models generate samples by running a Langevin-like reverse SDE whose drift is the learned score logp\nabla \log p. MALA, HMC, and proximal Langevin are all targeted refinements of the same machinery.

Langevin dynamics is also the SDE that gives the cleanest non-asymptotic guarantees in modern sampling theory. Dalalyan (2017) and Durmus–Moulines (2017) showed that ULA on a strongly-log-concave target reaches a prescribed total-variation distance ε\varepsilon from pp_\infty in O~(d/ε2)\tilde{O}(d / \varepsilon^2) steps — polynomial in dimension, no intractable constants. This is the kind of guarantee Metropolis–Hastings methods cannot provide and that explains why Langevin-based methods are the default for sampling from energy-based models in dimensions above ~50.

Mental Model

Read the dynamics as a competition between two forces. The drift U-\nabla U pulls the particle downhill toward modes of pp_\infty, exactly like gradient descent. The noise 2/βdB\sqrt{2/\beta}\,dB pushes the particle around with a magnitude that scales with the inverse temperature β\beta. At temperature β=\beta = \infty (no noise) you recover deterministic gradient flow, which gets stuck at the nearest local minimum. At temperature β=0\beta = 0 (only noise) you get pure Brownian motion, which never localizes anywhere. The Gibbs measure eβU\propto e^{-\beta U} is the unique stationary balance: probability concentrates near minima of UU but spreads out according to the temperature, and the relative weights between minima depend on β\beta times the energy gap.

The key calibration is the factor 2/β\sqrt{2/\beta} in front of dBdB: any other coefficient gives a different stationary distribution. This is not a tunable hyperparameter; it is set by the requirement that the Fokker–Planck equation have eβUe^{-\beta U} as its stationary density.

Formal Definition

Definition

Overdamped Langevin SDE

For a potential U:RdRU: \mathbb{R}^d \to \mathbb{R} with eβUe^{-\beta U} integrable and a standard dd-dimensional Brownian motion BtB_t, the overdamped Langevin SDE at inverse temperature β\beta is

dXt=U(Xt)dt+2/βdBt.dX_t = -\nabla U(X_t)\,dt + \sqrt{2/\beta}\,dB_t.

The associated Fokker–Planck equation is tp=(Up)+β1Δp\partial_t p = \nabla \cdot (\nabla U\, p) + \beta^{-1} \Delta p, which can be rewritten in gradient-flow form tp=(plog(p/p))/β\partial_t p = \nabla \cdot \big(p\, \nabla \log(p / p_\infty)\big) / \beta where p(x)=eβU(x)/Zp_\infty(x) = e^{-\beta U(x)}/Z. This is the formal sense in which Langevin dynamics is the gradient flow of relative entropy KL(pp)\operatorname{KL}(p \| p_\infty) in the Wasserstein-2 metric (Jordan, Kinderlehrer, and Otto, 1998).

The "overdamped" qualifier distinguishes this from the second-order underdamped Langevin SDE that includes a momentum variable and is the basis of Hamiltonian Monte Carlo.

Stationary Distribution

Theorem

Gibbs Measure is the Stationary Distribution of Langevin

Statement

Under the assumptions above, p(x)=eβU(x)/Zp_\infty(x) = e^{-\beta U(x)}/Z with Z=eβU(x)dxZ = \int e^{-\beta U(x)}\,dx is the unique invariant density of the Langevin Fokker–Planck operator Lp=(Up)+β1Δp\mathcal{L}^* p = \nabla \cdot (\nabla U\, p) + \beta^{-1} \Delta p. The dynamics is reversible with respect to pp_\infty, and detailed balance holds: the equilibrium probability current J=pUβ1pJ = -p_\infty \nabla U - \beta^{-1} \nabla p_\infty is identically zero.

Intuition

Plug pp_\infty into L\mathcal{L}^* and watch the cancellation. The drift term (Up)=(ΔU)pβU2p\nabla \cdot (\nabla U\, p_\infty) = (\Delta U) p_\infty - \beta\, \lvert \nabla U \rvert^2 p_\infty. The diffusion term β1Δp=β1(βΔU)p+β1β2U2p\beta^{-1} \Delta p_\infty = -\beta^{-1}(\beta\, \Delta U) p_\infty + \beta^{-1}\beta^2\, \lvert \nabla U \rvert^2 p_\infty. They sum to zero. The factor 2/β\sqrt{2/ \beta} in the SDE is exactly what makes this cancellation work.

Proof Sketch

For uniqueness, restrict to L2(p)L^2(p_\infty). The Fokker–Planck operator, when symmetrized as L=(p(/p))/β\mathcal{L}^* = \nabla \cdot (p_\infty \nabla (\cdot / p_\infty)) / \beta, is self-adjoint and negative semidefinite with 00 a simple eigenvalue when UU has connected sublevel sets. The eigenvector for 00 is pp_\infty. Standard semigroup theory then gives p(,t)pp(\cdot, t) \to p_\infty in L2(p)L^2(p_\infty) for any initial p0p_0 with finite relative entropy.

Why It Matters

This is the entire reason Langevin dynamics works as a sampler: with the right calibration, the SDE has the target distribution as its unique equilibrium. Tampering with the noise scale (e.g., using 1/β\sqrt{1/\beta} instead of 2/β\sqrt{2/\beta}) changes the stationary distribution to e2βU\propto e^{-2\beta U}, the same modes but a sharper temperature. Methods that "anneal β\beta" (simulated annealing) exploit this dependence to escape local minima at high temperature, then localize at low temperature. The Gibbs measure p(x)=eβU(x)/Zp_\infty(x) = e^{-\beta U(x)}/Z is the unique stationary distribution of the overdamped Langevin SDE.

Failure Mode

The result requires eβUe^{-\beta U} to be integrable. For U(x)=cxU(x) = c\, x (linear potential) this fails: the dynamics drifts to infinity and has no stationary distribution. Same for UU with super-quadratic decay at infinity in some directions. Practical samplers handle this with confinement: add a quadratic regularizer λx2/2\lambda \lVert x \rVert^2 / 2 to UU if the target is not provably tight.

Convergence Rate: Log-Sobolev and Bakry–Émery

The asymptotic statement "XtX_t converges to pp_\infty" is qualitative. The quantitative statement requires a functional inequality for the stationary measure. The cleanest one is the log-Sobolev inequality (LSI) Entp(f2)2ρLSIEp[f2]\operatorname{Ent}_{p_\infty}(f^2) \le \frac{2}{\rho_{\text{LSI}}}\, \mathbb{E}_{p_\infty}[\lVert \nabla f \rVert^2]. Whenever pp_\infty satisfies LSI with constant ρLSI\rho_{\text{LSI}}, the law of XtX_t converges to pp_\infty in KL divergence at exponential rate ρLSI/β\rho_{\text{LSI}} / \beta:

KL(p(,t)p)e2ρLSIt/βKL(p0p).\operatorname{KL}\big(p(\cdot, t)\, \|\, p_\infty\big) \le e^{-2 \rho_{\text{LSI}} t / \beta}\, \operatorname{KL}\big(p_0\, \|\, p_\infty\big).

Bakry–Émery's curvature criterion says: if 2(βU)mI\nabla^2 (\beta U) \succeq m\, I for some m>0m > 0 (i.e., the rescaled potential is strongly convex), then pp_\infty satisfies LSI with constant at least mm, and the convergence rate is at least m/βm / \beta. This is the sharpest and cleanest convergence theorem for Langevin dynamics on log-concave targets.

For non-log-concave pp_\infty (multimodal energy landscapes), LSI can fail catastrophically. Holley–Stroock perturbation gives an LSI constant that scales like eβΔUe^{-\beta\, \Delta U} where ΔU\Delta U is the energy barrier between modes, producing exponential slowdown in the barrier height. This is the mathematical statement of "Langevin gets stuck at low temperature."

Numerical Discretization: ULA

The Unadjusted Langevin Algorithm is Euler–Maruyama applied to the Langevin SDE:

Xk+1=XkhU(Xk)+2h/βξk,ξkiidN(0,Id).X_{k+1} = X_k - h\, \nabla U(X_k) + \sqrt{2 h / \beta}\,\xi_k, \qquad \xi_k \stackrel{\text{iid}}{\sim} \mathcal{N}(0, I_d).

ULA does not exactly preserve the Gibbs measure. The discretization introduces a bias of size O(h)O(h) in total variation. There is a clean non-asymptotic theory.

Theorem

Dalalyan's Non-Asymptotic ULA Bound

Statement

Under the assumptions above, the law μKULA\mu_K^{\text{ULA}} of XKX_K satisfies TV(μKULA,p)ε\operatorname{TV}(\mu_K^{\text{ULA}}, p_\infty) \le \varepsilon provided KCdmε2log ⁣(1ε)K \ge C\, \frac{d}{m\, \varepsilon^2}\, \log\!\big(\frac{1}{\varepsilon}\big) for an absolute constant CC and step size hh tuned proportionally to ε2/d\varepsilon^2 / d.

Intuition

The proof couples the continuous-time Langevin SDE (which converges exponentially fast by Bakry–Émery) with its Euler–Maruyama discretization. The discretization bias accumulates linearly in the number of steps but is O(h)O(h) per step, so total bias is O(Kh)O(K h). Choosing hh small enough balances continuous-time convergence error against discretization bias.

Proof Sketch

Let μKcont\mu_K^{\text{cont}} be the law of the continuous Langevin SDE at time KhK h. Triangle inequality: TV(μKULA,p)TV(μKULA,μKcont)+TV(μKcont,p)\operatorname{TV}(\mu_K^{\text{ULA}}, p_\infty) \le \operatorname{TV}(\mu_K^{\text{ULA}}, \mu_K^{\text{cont}}) + \operatorname{TV}(\mu_K^{\text{cont}}, p_\infty). The first term is the discretization bias bounded via Girsanov's theorem and the Lipschitz hypothesis on U\nabla U; it scales as Kh3dL2\sqrt{K h^3 d L^2} (Dalalyan 2017, Theorem 1). The second term decays as emKhe^{-m K h} by Bakry–Émery. Optimize hh over the sum.

Why It Matters

This was the first dimension-polynomial sampling result for a gradient-based MCMC method on log-concave targets. Earlier methods (MH, HMC) only had asymptotic guarantees with constants that were either intractable or known to scale exponentially in dd. The result lit a research program: faster Langevin variants (MALA: O(d/ε)O(d/\varepsilon) gradient queries; underdamped Langevin: O~(d/ε)\tilde{O}(\sqrt{d}/\varepsilon); proximal Langevin for non-smooth UU), all measured against this baseline. After K=O((d/m)log(1/ε)/ε2)K = O((d/m) \cdot \log(1/\varepsilon) / \varepsilon^2) steps, the ULA iterate is within total-variation distance ε\varepsilon of the Gibbs measure.

Failure Mode

The result requires strong log-concavity (positive lower bound on the Hessian). Weakly log-concave or multimodal targets fall outside the hypothesis, and ULA's bias does not vanish in the standard sense. For non-log-concave problems, the Metropolis adjustment (MALA) is needed to remove the bias, but MALA's mixing time can be exponentially worse in the energy-barrier height.

Stochastic-Gradient Variant: SGLD

Welling and Teh (2011) introduced Stochastic Gradient Langevin Dynamics (SGLD): replace the exact gradient U(Xk)\nabla U(X_k) in ULA with a mini-batch estimate ^U(Xk)\hat{\nabla} U(X_k) from a random subset of data, and decay the step size hk0h_k \to 0 so that the discretization bias and gradient-noise bias both vanish in the limit:

Xk+1=Xkhk^U(Xk)+2hk/βξk.X_{k+1} = X_k - h_k\, \hat{\nabla} U(X_k) + \sqrt{2 h_k / \beta}\,\xi_k.

The intuition: at large kk, the injected Gaussian noise 2hk/βξk\sqrt{2 h_k / \beta}\,\xi_k dominates the mini-batch noise hkstoch. noiseh_k \cdot \text{stoch.\ noise} (because hkhk\sqrt{h_k} \gg h_k), and the algorithm transitions from SGD-like exploration to Langevin-like sampling. Without the decay, the mini-batch noise dominates indefinitely and the stationary distribution is not the Gibbs measure but a perturbation depending on the gradient covariance, a phenomenon that motivates the entire SGD-as-SDE literature.

Common Confusions

Watch Out

Langevin dynamics is not gradient flow

Gradient flow x˙=U(x)\dot{x} = -\nabla U(x) converges to the minimizer of UU; Langevin dynamics converges to the Gibbs distribution eβU\propto e^{-\beta U}. These give different answers when UU is multimodal, when β\beta is finite, or when you care about uncertainty rather than a point estimate. The noise term is not "regularization" or "exploration heuristic" — it is what makes Langevin a sampler instead of an optimizer, and removing it loses the entire Bayesian / sampling story.

Watch Out

The step size of ULA controls bias, not variance

Standard SGD intuition: small step sizes give low variance and slow convergence, large step sizes give the opposite. For ULA the trade-off is qualitatively different: small step sizes give low bias against the true Gibbs measure but slow continuous-time mixing; large step sizes give high bias (the iterates' stationary distribution is a perturbation of Gibbs). Throwing CPU at smaller and smaller hh is the right move when bias is the bottleneck; there is no diminishing return on step-size shrinkage that mirrors the variance / step-size trade-off in optimization.

Watch Out

Underdamped Langevin is a different SDE, not just a numerical trick

Overdamped Langevin lives in state space Rd\mathbb{R}^d. Underdamped Langevin lives in phase space R2d\mathbb{R}^{2d} with a momentum variable VtV_t and friction coefficient γ\gamma: dXt=VtdtdX_t = V_t\,dt, dVt=U(Xt)dtγVtdt+2γ/βdBtdV_t = -\nabla U(X_t)\,dt - \gamma V_t\,dt + \sqrt{2 \gamma / \beta}\,dB_t. The stationary distribution is eβ(U+v2/2)\propto e^{-\beta (U + \lvert v \rvert^2 / 2)}, which marginalizes back to Gibbs in xx. The mixing time scales as d\sqrt{d} instead of dd because momentum gives the dynamics inertia to traverse low-curvature regions faster. This is the SDE behind HMC.

Exercises

ExerciseCore

Problem

For the Gaussian target peβx2/2p_\infty \propto e^{-\beta x^2 / 2} on R\mathbb{R}, write out the Langevin SDE explicitly, solve it in closed form starting from X0=x0X_0 = x_0, and compute the variance of XtX_t. Confirm that Var(Xt)1/β\operatorname{Var}(X_t) \to 1/\beta as tt \to \infty, matching the Gibbs marginal.

ExerciseAdvanced

Problem

Consider a "double-well" potential U(x)=(x21)2/4U(x) = (x^2 - 1)^2 / 4 on R\mathbb{R}. Show that the LSI constant of the Gibbs measure peβUp_\infty \propto e^{-\beta U} degrades exponentially in β\beta as β\beta \to \infty, and explain what this implies for the mixing time of Langevin dynamics on this target.

References

No canonical references provided.

Next Topics

  • Score Matching: training a network on logp\nabla \log p, the same vector field that drives Langevin dynamics when p=pp = p_\infty.
  • SGD as SDE: SGD viewed as a Langevin-like SDE where the gradient noise plays the role of 2/βdB\sqrt{2/\beta}\,dB.
  • Diffusion Models: generative models that sample by running a reverse Langevin-style SDE with learned drift.
  • Time Reversal of SDEs: the SDE-reversal result that turns forward Langevin noising into a backward sampler.
  • Stochastic Differential Equations: the parent framework Langevin specializes.

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics