Time Reversal of SDEs

Sneiderman, Robby

Mathematical Infrastructure

Time Reversal of SDEs

Anderson 1982: any forward Ito SDE has an explicit time-reversed SDE whose drift involves the score function (gradient of log density). The single result that turns a forward noising process into a generative sampler and underlies every score-based diffusion model.

AdvancedTier 2StableSupporting~50 min

Prerequisites

Stochastic Differential Equations Fokker Planck Equation

Prereq Map

Learning position

Read this page in the graph.

mathematical-infrastructure | layer 3 | tier 2. This page has 2 direct prerequisites and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Score Matching

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Modern score-based diffusion models work by running an SDE backward in time. The forward SDE corrupts data into noise (typically a variance-preserving or variance-exploding noising schedule ending at a Gaussian); the backward SDE generates new samples by reversing this process, starting from Gaussian noise and integrating back to the data distribution. The fact that this is possible at all — and that the backward drift has a closed form involving the score of the forward marginal — is the content of Anderson's time-reversal theorem (1982).

Anderson's theorem is the single mathematical result that licenses every score-based generative model: DDPM, NCSN, DDIM, EDM, score-SDE, flow matching variants, and most controlled-generation methods. It tells you that learning $\nabla \log p_t(x)$ for the forward noising marginal is sufficient to invert the noising and sample from the data distribution. The score-matching loss is what makes this score learnable; the time-reversal theorem is what makes the learned score useful.

Beyond generative modeling, time reversal sits behind detailed-balance arguments for non-reversible Langevin samplers, dual representations in stochastic control, and the path-measure identities that drive Schrödinger-bridge and Föllmer-process methods. It is one of the most consequential results in stochastic calculus for ML.

Mental Model

A forward SDE $dX_t = b(X_t, t)\,dt + \sigma(X_t, t)\,dB_t$ pushes a particle from time $0$ to time $T$ . Its marginal density evolves by the Fokker–Planck equation and "spreads out" as time advances. To run the dynamics backward, starting at time $T$ distributed according to the spread-out marginal $p_T$ and recover a sample from $p_0$ , you need a different SDE that pushes particles in the opposite direction. Anderson's theorem gives you that SDE explicitly.

The backward SDE has the same diffusion $\sigma$ as the forward one. The backward drift is the forward drift, minus a correction involving the score $\nabla \log p_t(x)$ :

\bar{b}(x, t) = b(x, t) - \sigma(x, t)\,\sigma(x, t)^\top\, \nabla \log p_t(x).

The term $\sigma \sigma^\top\, \nabla \log p_t$ is exactly what the Fokker–Planck equation needs subtracted to make the marginal $p_t$ flow in reverse. The score field is the only extra information required; everything else (drift, diffusion, time grid) is shared between the forward and backward processes.

Formal Statement

Definition

Reverse-Time SDE (Anderson 1982) $d X_{t} = [b - σ σ^{⊤} \nabla lo g p_{t}] d t + σ d \overset{ˉ}{B}_{t}$

Let $X_t \in \mathbb{R}^d$ solve the forward Itô SDE $dX_t = b(X_t, t)\,dt + \sigma(X_t, t)\,dB_t$ on $[0, T]$ with marginal density $p_t(x)$ satisfying the Fokker–Planck equation. Then the reverse-time process $\bar{X}_t = X_{T-t}$ satisfies the SDE

d\bar{X}_t = \big[\,b(\bar{X}_t, T - t) - \nabla \cdot \big(\sigma\sigma^\top\big)(\bar{X}_t, T - t) - \sigma\sigma^\top(\bar{X}_t, T - t)\, \nabla \log p_{T-t}(\bar{X}_t)\,\big]\, (-dt) + \sigma(\bar{X}_t, T - t)\,d\tilde{B}_t,

where $\tilde{B}_t$ is a $\bar{X}$ -adapted Brownian motion. In the common case where $\sigma$ does not depend on $x$ , the divergence term $\nabla \cdot (\sigma \sigma^\top)$ vanishes and the formula simplifies to

d\bar{X}_t = \big[\,b(\bar{X}_t, T-t) - \sigma\sigma^\top(T - t)\, \nabla \log p_{T-t}(\bar{X}_t)\,\big]\,(-dt) + \sigma(T - t)\,d\tilde{B}_t.

Equivalently, in forward time-coordinates running from $T$ to $0$ : $dX_t = \big[b - \sigma\sigma^\top\, \nabla \log p_t\big]\,dt + \sigma\,d\bar{B}_t$ (with $dt < 0$ , $\bar{B}_t$ a backward Brownian motion).

The score field $\nabla \log p_t(x)$ is the gradient of the log-density of the forward marginal at time $t$ . It is the only place where information about the original data distribution enters the backward dynamics.

The Theorem

Theorem

Anderson's Time-Reversal Theorem

Statement

Under the assumptions above, the law of the reverse-time process $\bar{X}_t = X_{T-t}$ on $[0, T]$ matches the law of the SDE $d\bar{X}_t = [\,b - \sigma\sigma^\top\, \nabla \log p_{T-t}\,](\bar{X}_t, T-t)\,dt + \sigma(\bar{X}_t, T-t)\,d\tilde{B}_t$ with initial distribution $\bar{X}_0 \sim p_T$ . The terminal distribution at time $T$ of the backward SDE is exactly $p_0$ , the original initial distribution of the forward SDE.

Intuition

Both the forward and the reverse-time process have the same marginals $\{p_t\}_{t \in [0, T]}$ ; they just traverse them in opposite directions. The Fokker–Planck equation is a continuity equation for these marginals, and there is exactly one drift field that produces the marginal flow in reverse: it is the forward drift minus the "Stein gradient" $\sigma \sigma^\top \nabla \log p_t$ . The diffusion $\sigma$ stays the same because the Fokker–Planck equation is symmetric in the second-order diffusion term but anti-symmetric in the first-order drift term.

Proof Sketch

Write the forward Fokker–Planck equation as a continuity equation $\partial_t p_t + \nabla \cdot J_t = 0$ with current $J_t = b\, p_t - \tfrac{1}{2} \nabla \cdot (\sigma \sigma^\top\, p_t)$ . Decompose $J_t = b_{\text{eff}}\, p_t$ for an effective drift $b_{\text{eff}} = b - \tfrac{1}{2}\sigma\sigma^\top\, \nabla \log p_t - \tfrac{1}{2} \nabla \cdot (\sigma \sigma^\top)$ . The reverse-time process must produce the same current with the opposite sign of the time direction, which (after symmetry-of-noise arguments via Girsanov) forces the backward drift to be $b - \sigma\sigma^\top\, \nabla \log p_t - \nabla \cdot (\sigma \sigma^\top)$ . Anderson's original 1982 paper does this directly via Itô's lemma on the time-reversed filtration; Haussmann and Pardoux (1986) provided the modern measure-theoretic proof.

Why It Matters

This is the result that made score-based diffusion possible. Before 2020, generative models used either invertible flows (which constrain architectures to be invertible) or GANs (which are unstable and have no likelihood). Score-based diffusion (Song et al., 2021) said: train a network to learn $\nabla \log p_t(x)$ for a known forward noising SDE, then plug the learned score into Anderson's reverse SDE and sample. Architecture constraints vanish — any function approximator works for the score. The training objective is regression, not adversarial. The reverse SDE is exactly Anderson's formula; the only innovation was learning the score parametrically and treating $T \to \infty$ (variance-exploding) or $T = 1$ (variance-preserving) as a hyperparameter. The reverse-time process $X_{T-t}$ satisfies the explicit SDE above, with drift involving the score of the forward marginal.

Failure Mode

The theorem requires $p_t$ to be smooth and strictly positive everywhere $\bar{X}_t$ might visit. For data distributions supported on a low-dimensional manifold in $\mathbb{R}^d$ (which is the realistic case for images), $p_0$ is not a density on $\mathbb{R}^d$ ; it is a measure concentrated on the manifold. The forward noising process smooths it into a strictly positive $p_t$ for $t > 0$ , so the score $\nabla \log p_t$ is well-defined on the bulk. But the backward SDE evaluated near $t = 0$ encounters a score field that diverges or is poorly defined on the manifold's boundary. This is why diffusion samplers stop the reverse process at small $t > 0$ rather than running all the way to $t = 0$ , and why singular-perturbation behavior near $t = 0$ is the hardest part of diffusion sampler engineering (Karras et al. 2022).

report a correction →

Score-Based Diffusion: The Canonical Use

The forward "variance-preserving" noising SDE is $dX_t = -\tfrac{1}{2} \beta(t)\, X_t\,dt + \sqrt{\beta(t)}\,dB_t$ for a schedule $\beta(t) > 0$ . With $X_0 \sim p_{\text{data}}$ and large $T$ , the marginal $p_T$ is approximately standard Gaussian.

By Anderson, the corresponding backward SDE is

dX_t = \Big[\,-\tfrac{1}{2}\beta(t) X_t - \beta(t)\, \nabla \log p_t(X_t)\, \Big]\,dt + \sqrt{\beta(t)}\,d\bar{B}_t \quad (\text{integrate from } T \text{ to } 0).

A score network $s_\theta(x, t) \approx \nabla \log p_t(x)$ is trained with score-matching. At sampling time, replace $\nabla \log p_t$ with $s_\theta$ in the backward SDE and integrate numerically (Euler–Maruyama, predictor–corrector, or higher-order solvers). Each integration step pushes a Gaussian sample closer to the data distribution, and at $t \approx 0$ the sample is approximately drawn from $p_{\text{data}}$ .

The probability-flow ODE (Song et al. 2021) is the deterministic dual of the same backward dynamics; see probability-flow-ode for the closed-form connection.

Worked Example: Time-Reversed OU

Take the forward OU SDE $dX_t = -X_t\,dt + \sqrt{2}\,dB_t$ on $[0, T]$ with $X_0 \sim \mathcal{N}(\mu_0, \sigma_0^2)$ . The marginal is $X_t \sim \mathcal{N}(e^{-t} \mu_0, e^{-2t} \sigma_0^2 + (1 - e^{-2t}))$ . The score is

\nabla \log p_t(x) = -\frac{x - e^{-t} \mu_0}{e^{-2t} \sigma_0^2 + (1 - e^{-2t})}.

Anderson's reverse SDE (with $\sigma = \sqrt{2}$ constant) is $dX_t = [\,-X_t - 2 \nabla \log p_t(X_t)\,]\,dt + \sqrt{2}\,d\bar{B}_t$ integrated from $T$ to $0$ . Substituting the score and simplifying gives an explicit Gauss–Markov reverse process whose terminal distribution at $t = 0$ is exactly $\mathcal{N}(\mu_0, \sigma_0^2)$ . This is the cleanest verification of Anderson's theorem: forward OU and its time reversal are both linear Gaussian processes, and you can check their marginals match analytically.

Common Confusions

Watch Out

The reverse SDE is NOT the forward SDE with negated drift

A naive guess: to reverse an SDE, just flip the sign of $b$ . This is wrong — it gives the wrong stationary distribution and the wrong intermediate marginals. The correct backward drift is $b - \sigma \sigma^\top \nabla \log p_t$ , with the score correction. The score correction is the nontrivial content of Anderson's theorem; without it, you would need nothing more than a sign flip and the theorem would be trivial.

Watch Out

The backward Brownian motion is a different Brownian motion

The forward and backward SDEs use different Brownian motions: $B_t$ is adapted to the forward filtration, $\tilde{B}_t$ (or $\bar{B}_t$ ) is adapted to the backward filtration. They are not the same process run in reverse. The two SDEs only agree in distribution, not pathwise. Running the forward SDE and then "playing the tape backwards" does not produce a sample from the backward SDE. This matters for any analysis that tries to couple forward and backward trajectories.

Watch Out

Time reversal works for any forward SDE; the score is the only data-dependent piece

Some papers describe diffusion as if it required a special "noise schedule" to make reversal possible. It does not. Anderson's theorem applies to any forward SDE with smooth positive marginals. The choice of forward SDE (variance-preserving, variance-exploding, sub-VP, EDM-style) only affects how easy the score is to learn and how well-behaved the reverse sampler is, not whether reversal is mathematically valid.

Exercises

ExerciseCore

Problem

Verify Anderson's formula in the simplest case: standard Brownian motion $dX_t = dB_t$ on $[0, T]$ with $X_0 \sim \mathcal{N}(\mu_0, \sigma_0^2)$ . Write down $p_t(x)$ , compute $\nabla \log p_t(x)$ , write the backward SDE, and confirm that the backward dynamics produce the right marginals.

ExerciseAdvanced

Problem

Prove the divergence-of-current calculation that underlies Anderson's formula: starting from the forward Fokker–Planck equation $\partial_t p + \nabla \cdot J = 0$ with $J = b\, p - \tfrac{1}{2} \nabla \cdot (\sigma \sigma^\top\, p)$ , show that the time-reversed density $\tilde{p}_s(x) = p_{T-s}(x)$ satisfies a forward Fokker–Planck equation with drift $\bar{b} = b - \sigma\sigma^\top \nabla \log p_t - \nabla \cdot (\sigma \sigma^\top)$ and the same diffusion $\sigma$ .

References

Canonical:

Anderson, Reverse-time diffusion equation models (Stochastic Processes and their Applications 12, 1982). The original paper. Concise, four pages, and still the cleanest derivation.
Haussmann and Pardoux, Time reversal of diffusions (Annals of Probability 14, 1986). The rigorous measure-theoretic treatment under modern existence-and-uniqueness assumptions.
Föllmer, Random fields and diffusion processes (Lecture Notes in Mathematics 1362, 1988). Functional-analytic proof using entropy and dual processes; useful for the Schrödinger-bridge connection.
Pavliotis, Stochastic Processes and Applications (Springer, 2014), Chapter 4.6. Modern textbook treatment of time reversal in the framework of reversible Markov processes.

Current:

Song, Sohl-Dickstein, Kingma, Kumar, Ermon, and Poole, Score-based generative modeling through stochastic differential equations (ICLR 2021). The paper that brought Anderson's theorem into modern generative modeling; introduces variance-preserving / variance-exploding SDEs and the probability-flow ODE.
Karras, Aittala, Aila, and Laine, Elucidating the design space of diffusion-based generative models (NeurIPS 2022). The "EDM" paper; analyzes how the choice of forward SDE and the singular behavior of the score near $t = 0$ affect sampler quality.
Kingma, Salimans, Poole, and Ho, Variational diffusion models (NeurIPS 2021). Recasts diffusion in terms of variational lower bounds; gives an alternative entry point to the same time-reversal result.
De Bortoli, Thornton, Heng, and Doucet, Diffusion Schrödinger bridge with applications to score-based generative modeling (NeurIPS 2021). Connects time-reversal to the Schrödinger-bridge / entropic-OT framework.

Next Topics

Score Matching: the training objective that learns $\nabla \log p_t$ for the score network in the reverse SDE.
Diffusion Models: the family of generative models built directly on Anderson's theorem.
Probability Flow ODE: the deterministic dual of the reverse SDE with the same marginals.
Fokker–Planck Equation: the PDE machinery behind the proof.
Stochastic Differential Equations: the parent framework of forward and backward processes.

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Fokker–Planck Equationlayer 3 · tier 2
Stochastic Differential Equationslayer 3 · tier 2

Derived topics

3

Score Matchinglayer 3 · tier 1
Diffusion Modelslayer 4 · tier 1
Probability Flow ODElayer 3 · tier 2

Graph-backed continuations

Score Matching Diffusion Models Probability Flow ODE