Skip to main content

Mathematical Infrastructure

Time Reversal of SDEs

Anderson 1982: any forward Itô SDE has an explicit time-reversed SDE whose drift is the original drift minus the divergence term in σσ^⊤ ∇ log p_t. The single result that turns a forward noising process into a generative sampler and underlies every score-based diffusion model.

AdvancedTier 2Stable~50 min
0

Why This Matters

Modern score-based diffusion models work by running an SDE backward in time. The forward SDE corrupts data into noise (typically a variance-preserving or variance-exploding noising schedule ending at a Gaussian); the backward SDE generates new samples by reversing this process, starting from Gaussian noise and integrating back to the data distribution. The fact that this is possible at all — and that the backward drift has a closed form involving the score of the forward marginal — is the content of Anderson's time-reversal theorem (1982).

Anderson's theorem is the single mathematical result that licenses every score-based generative model: DDPM, NCSN, DDIM, EDM, score-SDE, flow matching variants, and most controlled-generation methods. It tells you that learning logpt(x)\nabla \log p_t(x) for the forward noising marginal is sufficient to invert the noising and sample from the data distribution. The score-matching loss is what makes this score learnable; the time-reversal theorem is what makes the learned score useful.

Beyond generative modeling, time reversal sits behind detailed-balance arguments for non-reversible Langevin samplers, dual representations in stochastic control, and the path-measure identities that drive Schrödinger-bridge and Föllmer-process methods. It is one of the most consequential results in stochastic calculus for ML.

Mental Model

A forward SDE dXt=b(Xt,t)dt+σ(Xt,t)dBtdX_t = b(X_t, t)\,dt + \sigma(X_t, t)\,dB_t pushes a particle from time 00 to time TT. Its marginal density evolves by the Fokker–Planck equation and "spreads out" as time advances. To run the dynamics backward, starting at time TT distributed according to the spread-out marginal pTp_T and recover a sample from p0p_0, you need a different SDE that pushes particles in the opposite direction. Anderson's theorem gives you that SDE explicitly.

The backward SDE has the same diffusion σ\sigma as the forward one. The backward drift is the forward drift, minus a correction involving the score logpt(x)\nabla \log p_t(x):

bˉ(x,t)=b(x,t)σ(x,t)σ(x,t)logpt(x).\bar{b}(x, t) = b(x, t) - \sigma(x, t)\,\sigma(x, t)^\top\, \nabla \log p_t(x).

The term σσlogpt\sigma \sigma^\top\, \nabla \log p_t is exactly what the Fokker–Planck equation needs subtracted to make the marginal ptp_t flow in reverse. The score field is the only extra information required; everything else (drift, diffusion, time grid) is shared between the forward and backward processes.

Formal Statement

Definition

Reverse-Time SDE (Anderson 1982)

Let XtRdX_t \in \mathbb{R}^d solve the forward Itô SDE dXt=b(Xt,t)dt+σ(Xt,t)dBtdX_t = b(X_t, t)\,dt + \sigma(X_t, t)\,dB_t on [0,T][0, T] with marginal density pt(x)p_t(x) satisfying the Fokker–Planck equation. Then the reverse-time process Xˉt=XTt\bar{X}_t = X_{T-t} satisfies the SDE

dXˉt=[b(Xˉt,Tt)(σσ)(Xˉt,Tt)σσ(Xˉt,Tt)logpTt(Xˉt)](dt)+σ(Xˉt,Tt)dB~t,d\bar{X}_t = \big[\,b(\bar{X}_t, T - t) - \nabla \cdot \big(\sigma\sigma^\top\big)(\bar{X}_t, T - t) - \sigma\sigma^\top(\bar{X}_t, T - t)\, \nabla \log p_{T-t}(\bar{X}_t)\,\big]\, (-dt) + \sigma(\bar{X}_t, T - t)\,d\tilde{B}_t,

where B~t\tilde{B}_t is a Xˉ\bar{X}-adapted Brownian motion. In the common case where σ\sigma does not depend on xx, the divergence term (σσ)\nabla \cdot (\sigma \sigma^\top) vanishes and the formula simplifies to

dXˉt=[b(Xˉt,Tt)σσ(Tt)logpTt(Xˉt)](dt)+σ(Tt)dB~t.d\bar{X}_t = \big[\,b(\bar{X}_t, T-t) - \sigma\sigma^\top(T - t)\, \nabla \log p_{T-t}(\bar{X}_t)\,\big]\,(-dt) + \sigma(T - t)\,d\tilde{B}_t.

Equivalently, in forward time-coordinates running from TT to 00: dXt=[bσσlogpt]dt+σdBˉtdX_t = \big[b - \sigma\sigma^\top\, \nabla \log p_t\big]\,dt + \sigma\,d\bar{B}_t (with dt<0dt < 0, Bˉt\bar{B}_t a backward Brownian motion).

The score field logpt(x)\nabla \log p_t(x) is the gradient of the log-density of the forward marginal at time tt. It is the only place where information about the original data distribution enters the backward dynamics.

The Theorem

Theorem

Anderson's Time-Reversal Theorem

Statement

Under the assumptions above, the law of the reverse-time process Xˉt=XTt\bar{X}_t = X_{T-t} on [0,T][0, T] matches the law of the SDE dXˉt=[bσσlogpTt](Xˉt,Tt)dt+σ(Xˉt,Tt)dB~td\bar{X}_t = [\,b - \sigma\sigma^\top\, \nabla \log p_{T-t}\,](\bar{X}_t, T-t)\,dt + \sigma(\bar{X}_t, T-t)\,d\tilde{B}_t with initial distribution Xˉ0pT\bar{X}_0 \sim p_T. The terminal distribution at time TT of the backward SDE is exactly p0p_0, the original initial distribution of the forward SDE.

Intuition

Both the forward and the reverse-time process have the same marginals {pt}t[0,T]\{p_t\}_{t \in [0, T]}; they just traverse them in opposite directions. The Fokker–Planck equation is a continuity equation for these marginals, and there is exactly one drift field that produces the marginal flow in reverse: it is the forward drift minus the "Stein gradient" σσlogpt\sigma \sigma^\top \nabla \log p_t. The diffusion σ\sigma stays the same because the Fokker–Planck equation is symmetric in the second-order diffusion term but anti-symmetric in the first-order drift term.

Proof Sketch

Write the forward Fokker–Planck equation as a continuity equation tpt+Jt=0\partial_t p_t + \nabla \cdot J_t = 0 with current Jt=bpt12(σσpt)J_t = b\, p_t - \tfrac{1}{2} \nabla \cdot (\sigma \sigma^\top\, p_t). Decompose Jt=beffptJ_t = b_{\text{eff}}\, p_t for an effective drift beff=b12σσlogpt12(σσ)b_{\text{eff}} = b - \tfrac{1}{2}\sigma\sigma^\top\, \nabla \log p_t - \tfrac{1}{2} \nabla \cdot (\sigma \sigma^\top). The reverse-time process must produce the same current with the opposite sign of the time direction, which (after symmetry-of-noise arguments via Girsanov) forces the backward drift to be bσσlogpt(σσ)b - \sigma\sigma^\top\, \nabla \log p_t - \nabla \cdot (\sigma \sigma^\top). Anderson's original 1982 paper does this directly via Itô's lemma on the time-reversed filtration; Haussmann and Pardoux (1986) provided the modern measure-theoretic proof.

Why It Matters

This is the result that made score-based diffusion possible. Before 2020, generative models used either invertible flows (which constrain architectures to be invertible) or GANs (which are unstable and have no likelihood). Score-based diffusion (Song et al., 2021) said: train a network to learn logpt(x)\nabla \log p_t(x) for a known forward noising SDE, then plug the learned score into Anderson's reverse SDE and sample. Architecture constraints vanish — any function approximator works for the score. The training objective is regression, not adversarial. The reverse SDE is exactly Anderson's formula; the only innovation was learning the score parametrically and treating TT \to \infty (variance-exploding) or T=1T = 1 (variance-preserving) as a hyperparameter. The reverse-time process XTtX_{T-t} satisfies the explicit SDE above, with drift involving the score of the forward marginal.

Failure Mode

The theorem requires ptp_t to be smooth and strictly positive everywhere Xˉt\bar{X}_t might visit. For data distributions supported on a low-dimensional manifold in Rd\mathbb{R}^d (which is the realistic case for images), p0p_0 is not a density on Rd\mathbb{R}^d; it is a measure concentrated on the manifold. The forward noising process smooths it into a strictly positive ptp_t for t>0t > 0, so the score logpt\nabla \log p_t is well-defined on the bulk. But the backward SDE evaluated near t=0t = 0 encounters a score field that diverges or is poorly defined on the manifold's boundary. This is why diffusion samplers stop the reverse process at small t>0t > 0 rather than running all the way to t=0t = 0, and why singular-perturbation behavior near t=0t = 0 is the hardest part of diffusion sampler engineering (Karras et al. 2022).

Score-Based Diffusion: The Canonical Use

The forward "variance-preserving" noising SDE is dXt=12β(t)Xtdt+β(t)dBtdX_t = -\tfrac{1}{2} \beta(t)\, X_t\,dt + \sqrt{\beta(t)}\,dB_t for a schedule β(t)>0\beta(t) > 0. With X0pdataX_0 \sim p_{\text{data}} and large TT, the marginal pTp_T is approximately standard Gaussian.

By Anderson, the corresponding backward SDE is

dXt=[12β(t)Xtβ(t)logpt(Xt)]dt+β(t)dBˉt(integrate from T to 0).dX_t = \Big[\,-\tfrac{1}{2}\beta(t) X_t - \beta(t)\, \nabla \log p_t(X_t)\, \Big]\,dt + \sqrt{\beta(t)}\,d\bar{B}_t \quad (\text{integrate from } T \text{ to } 0).

A score network sθ(x,t)logpt(x)s_\theta(x, t) \approx \nabla \log p_t(x) is trained with score-matching. At sampling time, replace logpt\nabla \log p_t with sθs_\theta in the backward SDE and integrate numerically (Euler–Maruyama, predictor–corrector, or higher-order solvers). Each integration step pushes a Gaussian sample closer to the data distribution, and at t0t \approx 0 the sample is approximately drawn from pdatap_{\text{data}}.

The probability-flow ODE (Song et al. 2021) is the deterministic dual of the same backward dynamics; see probability-flow-ode for the closed-form connection.

Worked Example: Time-Reversed OU

Take the forward OU SDE dXt=Xtdt+2dBtdX_t = -X_t\,dt + \sqrt{2}\,dB_t on [0,T][0, T] with X0N(μ0,σ02)X_0 \sim \mathcal{N}(\mu_0, \sigma_0^2). The marginal is XtN(etμ0,e2tσ02+(1e2t))X_t \sim \mathcal{N}(e^{-t} \mu_0, e^{-2t} \sigma_0^2 + (1 - e^{-2t})). The score is

logpt(x)=xetμ0e2tσ02+(1e2t).\nabla \log p_t(x) = -\frac{x - e^{-t} \mu_0}{e^{-2t} \sigma_0^2 + (1 - e^{-2t})}.

Anderson's reverse SDE (with σ=2\sigma = \sqrt{2} constant) is dXt=[Xt2logpt(Xt)]dt+2dBˉtdX_t = [\,-X_t - 2 \nabla \log p_t(X_t)\,]\,dt + \sqrt{2}\,d\bar{B}_t integrated from TT to 00. Substituting the score and simplifying gives an explicit Gauss–Markov reverse process whose terminal distribution at t=0t = 0 is exactly N(μ0,σ02)\mathcal{N}(\mu_0, \sigma_0^2). This is the cleanest verification of Anderson's theorem: forward OU and its time reversal are both linear Gaussian processes, and you can check their marginals match analytically.

Common Confusions

Watch Out

The reverse SDE is NOT the forward SDE with negated drift

A naive guess: to reverse an SDE, just flip the sign of bb. This is wrong — it gives the wrong stationary distribution and the wrong intermediate marginals. The correct backward drift is bσσlogptb - \sigma \sigma^\top \nabla \log p_t, with the score correction. The score correction is the nontrivial content of Anderson's theorem; without it, you would need nothing more than a sign flip and the theorem would be trivial.

Watch Out

The backward Brownian motion is a different Brownian motion

The forward and backward SDEs use different Brownian motions: BtB_t is adapted to the forward filtration, B~t\tilde{B}_t (or Bˉt\bar{B}_t) is adapted to the backward filtration. They are not the same process run in reverse. The two SDEs only agree in distribution, not pathwise. Running the forward SDE and then "playing the tape backwards" does not produce a sample from the backward SDE. This matters for any analysis that tries to couple forward and backward trajectories.

Watch Out

Time reversal works for any forward SDE; the score is the only data-dependent piece

Some papers describe diffusion as if it required a special "noise schedule" to make reversal possible. It does not. Anderson's theorem applies to any forward SDE with smooth positive marginals. The choice of forward SDE (variance-preserving, variance-exploding, sub-VP, EDM-style) only affects how easy the score is to learn and how well-behaved the reverse sampler is, not whether reversal is mathematically valid.

Exercises

ExerciseCore

Problem

Verify Anderson's formula in the simplest case: standard Brownian motion dXt=dBtdX_t = dB_t on [0,T][0, T] with X0N(μ0,σ02)X_0 \sim \mathcal{N}(\mu_0, \sigma_0^2). Write down pt(x)p_t(x), compute logpt(x)\nabla \log p_t(x), write the backward SDE, and confirm that the backward dynamics produce the right marginals.

ExerciseAdvanced

Problem

Prove the divergence-of-current calculation that underlies Anderson's formula: starting from the forward Fokker–Planck equation tp+J=0\partial_t p + \nabla \cdot J = 0 with J=bp12(σσp)J = b\, p - \tfrac{1}{2} \nabla \cdot (\sigma \sigma^\top\, p), show that the time-reversed density p~s(x)=pTs(x)\tilde{p}_s(x) = p_{T-s}(x) satisfies a forward Fokker–Planck equation with drift bˉ=bσσlogpt(σσ)\bar{b} = b - \sigma\sigma^\top \nabla \log p_t - \nabla \cdot (\sigma \sigma^\top) and the same diffusion σ\sigma.

References

No canonical references provided.

Next Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics