Mathematical Infrastructure
Time Reversal of SDEs
Anderson 1982: any forward Itô SDE has an explicit time-reversed SDE whose drift is the original drift minus the divergence term in σσ^⊤ ∇ log p_t. The single result that turns a forward noising process into a generative sampler and underlies every score-based diffusion model.
Why This Matters
Modern score-based diffusion models work by running an SDE backward in time. The forward SDE corrupts data into noise (typically a variance-preserving or variance-exploding noising schedule ending at a Gaussian); the backward SDE generates new samples by reversing this process, starting from Gaussian noise and integrating back to the data distribution. The fact that this is possible at all — and that the backward drift has a closed form involving the score of the forward marginal — is the content of Anderson's time-reversal theorem (1982).
Anderson's theorem is the single mathematical result that licenses every score-based generative model: DDPM, NCSN, DDIM, EDM, score-SDE, flow matching variants, and most controlled-generation methods. It tells you that learning for the forward noising marginal is sufficient to invert the noising and sample from the data distribution. The score-matching loss is what makes this score learnable; the time-reversal theorem is what makes the learned score useful.
Beyond generative modeling, time reversal sits behind detailed-balance arguments for non-reversible Langevin samplers, dual representations in stochastic control, and the path-measure identities that drive Schrödinger-bridge and Föllmer-process methods. It is one of the most consequential results in stochastic calculus for ML.
Mental Model
A forward SDE pushes a particle from time to time . Its marginal density evolves by the Fokker–Planck equation and "spreads out" as time advances. To run the dynamics backward, starting at time distributed according to the spread-out marginal and recover a sample from , you need a different SDE that pushes particles in the opposite direction. Anderson's theorem gives you that SDE explicitly.
The backward SDE has the same diffusion as the forward one. The backward drift is the forward drift, minus a correction involving the score :
The term is exactly what the Fokker–Planck equation needs subtracted to make the marginal flow in reverse. The score field is the only extra information required; everything else (drift, diffusion, time grid) is shared between the forward and backward processes.
Formal Statement
Reverse-Time SDE (Anderson 1982)
Let solve the forward Itô SDE on with marginal density satisfying the Fokker–Planck equation. Then the reverse-time process satisfies the SDE
where is a -adapted Brownian motion. In the common case where does not depend on , the divergence term vanishes and the formula simplifies to
Equivalently, in forward time-coordinates running from to : (with , a backward Brownian motion).
The score field is the gradient of the log-density of the forward marginal at time . It is the only place where information about the original data distribution enters the backward dynamics.
The Theorem
Anderson's Time-Reversal Theorem
Statement
Under the assumptions above, the law of the reverse-time process on matches the law of the SDE with initial distribution . The terminal distribution at time of the backward SDE is exactly , the original initial distribution of the forward SDE.
Intuition
Both the forward and the reverse-time process have the same marginals ; they just traverse them in opposite directions. The Fokker–Planck equation is a continuity equation for these marginals, and there is exactly one drift field that produces the marginal flow in reverse: it is the forward drift minus the "Stein gradient" . The diffusion stays the same because the Fokker–Planck equation is symmetric in the second-order diffusion term but anti-symmetric in the first-order drift term.
Proof Sketch
Write the forward Fokker–Planck equation as a continuity equation with current . Decompose for an effective drift . The reverse-time process must produce the same current with the opposite sign of the time direction, which (after symmetry-of-noise arguments via Girsanov) forces the backward drift to be . Anderson's original 1982 paper does this directly via Itô's lemma on the time-reversed filtration; Haussmann and Pardoux (1986) provided the modern measure-theoretic proof.
Why It Matters
This is the result that made score-based diffusion possible. Before 2020, generative models used either invertible flows (which constrain architectures to be invertible) or GANs (which are unstable and have no likelihood). Score-based diffusion (Song et al., 2021) said: train a network to learn for a known forward noising SDE, then plug the learned score into Anderson's reverse SDE and sample. Architecture constraints vanish — any function approximator works for the score. The training objective is regression, not adversarial. The reverse SDE is exactly Anderson's formula; the only innovation was learning the score parametrically and treating (variance-exploding) or (variance-preserving) as a hyperparameter. The reverse-time process satisfies the explicit SDE above, with drift involving the score of the forward marginal.
Failure Mode
The theorem requires to be smooth and strictly positive everywhere might visit. For data distributions supported on a low-dimensional manifold in (which is the realistic case for images), is not a density on ; it is a measure concentrated on the manifold. The forward noising process smooths it into a strictly positive for , so the score is well-defined on the bulk. But the backward SDE evaluated near encounters a score field that diverges or is poorly defined on the manifold's boundary. This is why diffusion samplers stop the reverse process at small rather than running all the way to , and why singular-perturbation behavior near is the hardest part of diffusion sampler engineering (Karras et al. 2022).
Score-Based Diffusion: The Canonical Use
The forward "variance-preserving" noising SDE is for a schedule . With and large , the marginal is approximately standard Gaussian.
By Anderson, the corresponding backward SDE is
A score network is trained with score-matching. At sampling time, replace with in the backward SDE and integrate numerically (Euler–Maruyama, predictor–corrector, or higher-order solvers). Each integration step pushes a Gaussian sample closer to the data distribution, and at the sample is approximately drawn from .
The probability-flow ODE (Song et al. 2021) is the deterministic dual of the same backward dynamics; see probability-flow-ode for the closed-form connection.
Worked Example: Time-Reversed OU
Take the forward OU SDE on with . The marginal is . The score is
Anderson's reverse SDE (with constant) is integrated from to . Substituting the score and simplifying gives an explicit Gauss–Markov reverse process whose terminal distribution at is exactly . This is the cleanest verification of Anderson's theorem: forward OU and its time reversal are both linear Gaussian processes, and you can check their marginals match analytically.
Common Confusions
The reverse SDE is NOT the forward SDE with negated drift
A naive guess: to reverse an SDE, just flip the sign of . This is wrong — it gives the wrong stationary distribution and the wrong intermediate marginals. The correct backward drift is , with the score correction. The score correction is the nontrivial content of Anderson's theorem; without it, you would need nothing more than a sign flip and the theorem would be trivial.
The backward Brownian motion is a different Brownian motion
The forward and backward SDEs use different Brownian motions: is adapted to the forward filtration, (or ) is adapted to the backward filtration. They are not the same process run in reverse. The two SDEs only agree in distribution, not pathwise. Running the forward SDE and then "playing the tape backwards" does not produce a sample from the backward SDE. This matters for any analysis that tries to couple forward and backward trajectories.
Time reversal works for any forward SDE; the score is the only data-dependent piece
Some papers describe diffusion as if it required a special "noise schedule" to make reversal possible. It does not. Anderson's theorem applies to any forward SDE with smooth positive marginals. The choice of forward SDE (variance-preserving, variance-exploding, sub-VP, EDM-style) only affects how easy the score is to learn and how well-behaved the reverse sampler is, not whether reversal is mathematically valid.
Exercises
Problem
Verify Anderson's formula in the simplest case: standard Brownian motion on with . Write down , compute , write the backward SDE, and confirm that the backward dynamics produce the right marginals.
Problem
Prove the divergence-of-current calculation that underlies Anderson's formula: starting from the forward Fokker–Planck equation with , show that the time-reversed density satisfies a forward Fokker–Planck equation with drift and the same diffusion .
References
No canonical references provided.
No current references provided.
No frontier references provided.
Next Topics
- Score Matching: the training objective that learns for the score network in the reverse SDE.
- Diffusion Models: the family of generative models built directly on Anderson's theorem.
- Probability Flow ODE: the deterministic dual of the reverse SDE with the same marginals.
- Fokker–Planck Equation: the PDE machinery behind the proof.
- Stochastic Differential Equations: the parent framework of forward and backward processes.
Last reviewed: April 18, 2026
Prerequisites
Foundations this topic depends on.
- Stochastic Differential EquationsLayer 3
- Brownian MotionLayer 2
- Measure-Theoretic ProbabilityLayer 0B
- Martingale TheoryLayer 0B
- Ito's LemmaLayer 3
- Stochastic Calculus for MLLayer 3
- Fokker–Planck EquationLayer 3
- PDE Fundamentals for Machine LearningLayer 1
- Fast Fourier TransformLayer 1
- Exponential Function PropertiesLayer 0A
- Eigenvalues and EigenvectorsLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Functional Analysis CoreLayer 0B
- Metric Spaces, Convergence, and CompletenessLayer 0A
- Inner Product Spaces and OrthogonalityLayer 0A
- Vectors, Matrices, and Linear MapsLayer 0A