Probability Flow ODE

Sneiderman, Robby

ML Methods

Probability Flow ODE

Song et al. 2021: every diffusion SDE has a deterministic ODE that produces the same time-marginals. The deterministic dual of Anderson's reverse SDE; the basis of DDIM, DPM-Solver, EDM samplers, exact-likelihood computation, and the conceptual bridge to flow matching.

AdvancedTier 2CurrentSupporting~45 min

Prerequisites

Stochastic Differential Equations Fokker Planck Equation Score Matching Time Reversal of Sdes

Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 3 | tier 2. This page has 4 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Diffusion Models

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every fast deterministic sampler used in production diffusion models — DDIM, DPM-Solver, DPM-Solver++, EDM, UniPC — is integrating the same underlying object: the probability flow ODE of Song, Sohl-Dickstein, Kingma, Kumar, Ermon, and Poole (ICLR 2021). Anderson's reverse SDE is the stochastic way to invert a forward noising process; the probability flow ODE is the deterministic way that produces the same intermediate marginals. The choice between them happens purely at sampling time, with the same trained score network.

The ODE form is what enables three things that the SDE form cannot give. It admits exact log-likelihood computation via the instantaneous change-of-variables formula (Chen, Rubanova, Bettencourt, and Duvenaud, 2018), so a diffusion model becomes a continuous normalizing flow at inference time. It allows high-order solvers (DPM-Solver, Heun, exponential integrators) that take 20-50 steps where Euler-Maruyama on the reverse SDE needs hundreds or thousands. It is deterministic: the same noise sample maps to the same image, which is the property that makes DDIM-style image editing, latent-space interpolation, and consistency models possible at all.

The probability flow ODE is also the conceptual bridge from score-based diffusion to flow matching (Lipman, Chen, Ben-Hamu, Nickel, and Le, ICLR 2023) and rectified flow (Liu, Gong, and Liu, ICLR 2023). Flow matching trains a velocity field $v_\theta(x, t)$ to directly match the probability flow drift, bypassing the score parameterization entirely. Once you see the ODE, the score is just one way to specify a transport field; flow matching is what you get if you specify the transport field directly.

Beyond generative modeling, the same construction shows up in optimal transport (Otto's gradient-flow / JKO scheme), Fokker-Planck-driven particle methods (Maoutsa, Reich, and Opper, 2020), and any setting where you want to replace a stochastic sampler with a deterministic transport map without changing the marginals.

Mental Model

A Fokker-Planck equation is a continuity equation for densities: $\partial_t p_t + \nabla \cdot J_t = 0$ , where $J_t$ is the probability current. The current can be split into a part that comes from the SDE drift and a part that comes from the diffusion. The first key observation is that many different vector fields produce the same marginal flow: any divergence-free perturbation $w_t$ with $\nabla \cdot (w_t p_t) = 0$ can be added to a valid drift without changing $\{p_t\}$ . The second is that the probability-flow ODE field is the unique drift in the gradient class $v_t = b_t - \tfrac{1}{2} g_t^2 \nabla \log p_t$ that reproduces the SDE's marginals; it is not the only deterministic transport between them. Other deterministic vector fields (gradient field plus any divergence-free vector field with respect to $p_t$ ) also reproduce the same family of marginals.

The probability-flow ODE field is the canonical such drift. It carries each density level through time on noiseless characteristic curves whose distribution at every time matches the SDE's marginal. Pathwise the SDE and the ODE are very different (Brownian fluctuations versus smooth ODE trajectories), but as measures on intermediate times they are indistinguishable.

A useful aphorism: the Fokker-Planck equation is what the densities do; the SDE and the ODE are two different particle realizations of those densities. Anderson's reverse SDE adds noise to average over many trajectories; the probability flow ODE removes noise to follow one trajectory deterministically. Both are correct, and both are useful, for different jobs.

Formal Statement

Definition

Probability Flow ODE $d x / d t = b (x, t) - \frac{1}{2} σ σ^{⊤} \nabla lo g p_{t} (x)$

Let $X_t \in \mathbb{R}^d$ solve the forward Itô SDE $dX_t = b(X_t, t)\,dt + \sigma(t)\,dB_t$ on $[0, T]$ with marginal density $p_t(x)$ (assume the diffusion coefficient $\sigma(t) \in \mathbb{R}^{d \times d}$ is independent of $x$ for simplicity). The probability flow ODE associated to this SDE is the deterministic ODE

\frac{dx}{dt} = b(x, t) - \tfrac{1}{2}\, \sigma(t)\sigma(t)^\top\, \nabla \log p_t(x).

If $X_0 \sim p_0$ and $\{Y_t\}_{t \in [0, T]}$ solves the ODE with $Y_0 \sim p_0$ , then $Y_t$ has marginal density $p_t$ for every $t \in [0, T]$ . The score field $\nabla \log p_t$ is the same object that appears in Anderson's reverse-time SDE; the only difference is the factor $\tfrac{1}{2}$ in front of the score correction.

In the general case where $\sigma$ depends on $x$ , the formula picks up an additional divergence term and reads $dx/dt = b - \tfrac{1}{2} \nabla \cdot (\sigma \sigma^\top) - \tfrac{1}{2} \sigma \sigma^\top \nabla \log p_t$ . For variance-preserving and variance-exploding diffusion schedules $\sigma$ is $x$ -independent, and the simpler form above is the one used in practice.

Marginal Equivalence Theorem

Theorem

Probability Flow ODE Marginal Equivalence

Statement

Under the assumptions above, let $p_t$ be the marginal density of the SDE $dX_t = b(X_t, t)\,dt + \sigma(t)\,dB_t$ with $X_0 \sim p_0$ . Define the deterministic vector field $v(x, t) = b(x, t) - \tfrac{1}{2} \sigma(t) \sigma(t)^\top \nabla \log p_t(x)$ and let $\{Y_t\}$ solve the ODE $dY_t/dt = v(Y_t, t)$ with $Y_0 \sim p_0$ . Then for every $t \in [0, T]$ the law of $Y_t$ has density $p_t$ , the same density the SDE produces.

Intuition

Both processes must satisfy a continuity equation $\partial_t p + \nabla \cdot (v\, p) = 0$ for some velocity field $v$ . The SDE's Fokker-Planck operator $-\nabla \cdot (b\, p) + \tfrac{1}{2} \nabla^2 : (\sigma \sigma^\top p)$ rewrites as a continuity equation with $v_{\text{eff}} = b - \tfrac{1}{2} \sigma \sigma^\top \nabla \log p$ . That $v_{\text{eff}}$ is the unique deterministic drift compatible with the marginal flow. The diffusion has been absorbed into the drift via the score; the noise term disappears because deterministic transport already carries the right amount of mass.

Proof Sketch

Start from the Fokker-Planck equation $\partial_t p_t = -\nabla \cdot (b\, p_t) + \tfrac{1}{2} \sum_{i, j} \partial_i \partial_j ((\sigma \sigma^\top)_{ij}\, p_t)$ . With $\sigma$ independent of $x$ , the second term simplifies to $\tfrac{1}{2} \nabla^2 : (\sigma \sigma^\top)\, p_t = \tfrac{1}{2} \nabla \cdot (\sigma \sigma^\top \nabla p_t)$ . Use the identity $\nabla p_t = p_t\, \nabla \log p_t$ to rewrite this as $\tfrac{1}{2} \nabla \cdot (\sigma \sigma^\top \nabla \log p_t \cdot p_t)$ . Substitute back into the Fokker-Planck equation and collect terms: $\partial_t p_t = -\nabla \cdot \big[(b - \tfrac{1}{2} \sigma \sigma^\top \nabla \log p_t)\, p_t\big]$ . This is exactly the continuity equation $\partial_t p_t + \nabla \cdot (v\, p_t) = 0$ for $v = b - \tfrac{1}{2} \sigma \sigma^\top \nabla \log p_t$ . The characteristic curves of this continuity equation are the trajectories of the ODE $dx/dt = v(x, t)$ , and mass is carried along characteristics, so the marginal of any ODE solution started from $p_0$ is $p_t$ .

Why It Matters

Two practical consequences follow. First, you can sample from a diffusion model by integrating a deterministic ODE instead of a stochastic SDE. This allows higher-order ODE solvers (Heun, RK4, exponential integrators, DPM-Solver multistep) that match SDE-sampler quality in 20-50 function evaluations instead of 250-1000. Second, because the ODE is a continuous normalizing flow, you can compute exact log-likelihoods via the instantaneous change-of-variables formula (next theorem), which the stochastic sampler cannot do. The deterministic ODE $dx/dt = b - \tfrac{1}{2} \sigma\sigma^\top \nabla \log p_t$ produces marginal density $p_t$ identical to the SDE's marginal at every $t \in [0, T]$ .

Failure Mode

The score $\nabla \log p_t$ becomes singular near the data manifold as $t \to 0$ . For data concentrated on a low-dimensional submanifold of $\mathbb{R}^d$ (the realistic case for images), $p_0$ is not a density on $\mathbb{R}^d$ at all, and $p_t$ for small $t$ has nearly singular score near the manifold. The ODE then becomes stiff: standard explicit solvers either blow up or take vanishingly small steps near $t = 0$ . EDM (Karras et al., 2022) addresses this by reparameterizing the ODE so the singular behavior is absorbed into a preconditioner, and DPM-Solver uses exponential integrators that are stable for the linear part of the dynamics. Naive Euler integration of the probability flow ODE on unprocessed forward SDEs is the canonical failure mode.

report a correction →

Instantaneous Change of Variables and Exact Likelihood

Theorem

Instantaneous Change of Variables (Chen et al. 2018)

Statement

Let $x_t$ solve $dx/dt = v(x, t)$ with initial condition $x_0$ and let $p_t(x)$ be the density of the random variable $x_t$ when $x_0 \sim p_0$ . Then along the trajectory,

\frac{d}{dt} \log p_t(x_t) = -\,\mathrm{tr}\!\left(\nabla_x v(x_t, t)\right) = -\,\nabla \cdot v(x_t, t).

Integrating from $0$ to $T$ gives the exact log-likelihood identity $\log p_T(x_T) = \log p_0(x_0) - \int_0^T \nabla \cdot v(x_s, s)\,ds$ , where $x_s$ is the ODE trajectory connecting $x_0$ and $x_T$ .

Intuition

The density along the flow is the Jacobian determinant of the flow map. Differentiating $\log \det$ in time gives the trace of the time-derivative of the Jacobian, which is $\mathrm{tr}(\nabla v)$ . The minus sign comes from the continuity equation: where the velocity field has positive divergence, mass is being stretched apart and density must drop.

Proof Sketch

By the continuity equation $\partial_t p + \nabla \cdot (v p) = 0$ and the chain rule along characteristics, $\frac{d}{dt} p_t(x_t) = \partial_t p_t(x_t) + \nabla p_t(x_t) \cdot v(x_t, t) = -\nabla \cdot (v p_t)(x_t) + \nabla p_t \cdot v = -p_t\, \nabla \cdot v$ at $x = x_t$ . Divide by $p_t(x_t)$ to get $\frac{d}{dt} \log p_t(x_t) = -\nabla \cdot v(x_t, t)$ . The integral form follows by the fundamental theorem of calculus.

Why It Matters

Combined with the probability flow ODE, this gives an exact likelihood estimator for diffusion models. Algorithm: integrate the ODE forward from a data point $x_0$ to noise $x_T$ , simultaneously accumulating the trace integral $\int_0^T \mathrm{tr}(\nabla v)\,ds$ . The log-likelihood is $\log p_T(x_T) - \int_0^T \mathrm{tr}(\nabla v(x_s, s))\,ds$ where $\log p_T(x_T)$ is the standard Gaussian log-density at the noise endpoint. This is the only way to get exact (not lower-bound) likelihoods from a diffusion model, and it is what Song et al. 2021 used to report bits-per-dim on CIFAR-10 competitive with autoregressive models. The trace $\mathrm{tr}(\nabla v)$ in high dimensions is estimated stochastically via Hutchinson's trick: $\mathrm{tr}(\nabla v) = \mathbb{E}_\epsilon[\epsilon^\top \nabla v\, \epsilon]$ for $\epsilon \sim \mathcal{N}(0, I)$ or Rademacher, which costs one Jacobian-vector product per estimate. The log-density along an ODE trajectory evolves as $\frac{d}{dt} \log p_t(x_t) = -\mathrm{tr}(\nabla_x v(x_t, t))$ .

Failure Mode

The Hutchinson estimator has variance that grows with the dimension; for high-dimensional images, getting a tight likelihood estimate requires many samples per data point. The ODE integration error also compounds with the trace-integral error; halving the step size doubles both the integration work and the trace-estimator work. Reported "exact" likelihoods always carry an integration tolerance that practitioners sometimes skip past in the comparison tables.

report a correction →

Connection to DDIM, EDM, and Flow Matching

DDIM (Song, Meng, and Ermon, ICLR 2021) was discovered before the continuous-time formulation; the original paper described a non-Markovian discrete forward process whose deterministic reverse is closed form. Once Song et al. 2021 introduced the SDE framework, DDIM was recognized as exactly the discretization of the probability flow ODE for the variance-preserving diffusion schedule. The "deterministic DDIM" sampler that ships in every diffusion library is integrating $dx/dt = -\tfrac{1}{2} \beta(t)[x + \nabla \log p_t(x)]$ with a first-order Euler step in a transformed time variable. There is no separate trained model; it is the same DDPM weights, sampled differently.

EDM (Karras, Aittala, Aila, and Laine, NeurIPS 2022) reparameterizes the probability flow ODE in terms of a noise-level $\sigma$ rather than a time $t$ , and applies preconditioning to the network so the singular behavior near the data manifold is absorbed into the parameterization. With Heun's second-order solver and the EDM preconditioner, 30-40 function evaluations match the FID of 1000-step DDPM sampling. DPM-Solver (Lu, Zhou, Bao, Chen, Li, and Zhu, NeurIPS 2022) goes further by exploiting the linear structure of the variance-preserving forward process: the linear term has a closed-form solution, so only the score-driven nonlinear term needs numerical integration, yielding 10-20 step samplers with little quality loss.

Flow matching (Lipman et al. 2023) takes the reframing one step further. Train a network $v_\theta(x, t)$ to directly regress the probability flow drift, conditional on a chosen forward path between data and noise. The training loss is $\mathbb{E}_{t, x_0, x_1, x_t}\|v_\theta(x_t, t) - u_t(x_t \mid x_0, x_1)\|^2$ where $u_t$ is the conditional velocity along the chosen path. The score parameterization disappears entirely; what remains is a direct regression on the velocity field of a continuous normalizing flow. For Gaussian source paths this reproduces score-matching diffusion exactly, but the framework also accommodates other interpolation paths (rectified flow, optimal-transport-coupled paths) and is the formulation behind Stable Diffusion 3.

Worked Example: Variance-Preserving SDE

The variance-preserving (VP) forward SDE is $dX_t = -\tfrac{1}{2} \beta(t) X_t\,dt + \sqrt{\beta(t)}\,dB_t$ for a positive schedule $\beta(t)$ . The diffusion coefficient is $\sigma(t) = \sqrt{\beta(t)}$ , so $\sigma\sigma^\top = \beta(t)\, I$ . The probability flow ODE is

\frac{dx}{dt} = -\tfrac{1}{2} \beta(t)\, x - \tfrac{1}{2} \beta(t)\, \nabla \log p_t(x) = -\tfrac{1}{2} \beta(t) \big[x + \nabla \log p_t(x)\big].

Compare with Anderson's reverse SDE for the same forward process: $dx = -\tfrac{1}{2} \beta(t) [x + 2 \nabla \log p_t(x)]\,dt + \sqrt{\beta(t)}\,d\bar{B}_t$ . The deterministic ODE has a half score correction where the SDE has a full one. The factor of $2$ is what makes the SDE trajectory wander stochastically while the ODE trajectory stays on a single deterministic curve.

As a sanity check, suppose the VP process has reached its stationary marginal $p_t = \mathcal{N}(0, I)$ (this is the limit as $t \to \infty$ for the VP schedule). The score is $\nabla \log p_t(x) = -x$ , and the ODE becomes $dx/dt = -\tfrac{1}{2} \beta(t) (x - x) = 0$ . The deterministic transport correctly identifies that once the marginals have stopped evolving, no transport is needed. For finite-time intermediate marginals, integrate the ODE numerically with a learned score $s_\theta(x, t) \approx \nabla \log p_t(x)$ in place of the unknown true score; this is what every deterministic diffusion sampler does at inference time.

Common Confusions

Watch Out

Same marginals are not the same joint distribution

The probability flow ODE and the SDE produce the same time-marginals $p_t$ , but they are very different processes pathwise. The SDE's trajectory is a Brownian-driven random function with Hölder-1/2 regularity; the ODE's trajectory is a smooth deterministic curve. Quantities that depend on more than one time slice — joint distributions of $(X_{t_1}, X_{t_2})$ , exit times, path-functional expectations — generally differ between the two processes. The marginal-equivalence theorem only asserts equality at single times.

Watch Out

The probability flow ODE is unique among gradient-class transports, not among all transports

Many vector fields produce the same density evolution $p_t$ . The probability flow ODE is the unique drift you get by absorbing the diffusion into the deterministic transport via the identity $\sigma \sigma^\top \nabla p = \sigma \sigma^\top (\nabla \log p)\, p$ . If you allow vector fields outside this gradient-class structure (for example, adding a divergence-free perturbation $w$ with $\nabla \cdot (w p) = 0$ ), you get a different ODE that produces the same marginals. Rectified flow exploits exactly this freedom by post-processing trajectories to make them straighter. The probability flow ODE is canonical because it is the unique gradient-class deterministic transport, not because it is the only deterministic transport.

Watch Out

DDIM is not a different model from DDPM

A common belief: DDPM and DDIM are different generative models, with DDIM being faster and lower quality. They are the same model. DDPM and DDIM share weights, training loss, and forward process. They differ only at sampling time: DDPM integrates the reverse SDE (stochastic) with first-order Euler steps; DDIM integrates the probability flow ODE (deterministic). The trained network $\epsilon_\theta(x, t)$ is identical. This is why you can take any pretrained DDPM checkpoint and run DDIM sampling on it without retraining. The same identity applies to score-SDE models and EDM samplers.

Exercises

ExerciseCore

Problem

Verify the probability flow ODE for standard Brownian motion: $dX_t = dB_t$ with $X_0 \sim \mathcal{N}(0, 1)$ on $\mathbb{R}$ . Compute the marginal $p_t$ and the score $\nabla \log p_t$ . Write the probability flow ODE explicitly. Solve the ODE starting from $Y_0 = y_0$ and confirm that $Y_t \sim \mathcal{N}(0, 1 + t)$ when $Y_0 \sim \mathcal{N}(0, 1)$ .

ExerciseAdvanced

Problem

Derive the instantaneous change of variables formula $\frac{d}{dt} \log p_t(x_t) = -\mathrm{tr}(\nabla_x v(x_t, t))$ from the continuity equation $\partial_t p + \nabla \cdot (v p) = 0$ along characteristics of the ODE $dx/dt = v(x, t)$ .

References

Canonical:

Chen, Rubanova, Bettencourt, and Duvenaud, Neural ordinary differential equations (NeurIPS 2018), Section 4. The instantaneous change of variables formula and the FFJORD architecture; the basis of exact likelihood for continuous normalizing flows.
Pavliotis, Stochastic Processes and Applications (Springer, 2014), Chapter 4. Background on Fokker-Planck equations, continuity equations, and the decomposition of probability currents into drift and diffusion components.
Maoutsa, Reich, and Opper, Interacting particle solutions of Fokker-Planck equations through gradient-log-density estimation (Entropy 22, 2020). Predates the diffusion-model use; introduces the deterministic-transport interpretation of Fokker-Planck flow.

Current:

Song, Sohl-Dickstein, Kingma, Kumar, Ermon, and Poole, Score-based generative modeling through stochastic differential equations (ICLR 2021), Section 4. Introduces the probability flow ODE and proves the marginal-equivalence result; Appendix D covers exact-likelihood computation via the instantaneous change of variables.
Song, Meng, and Ermon, Denoising diffusion implicit models (ICLR 2021). The discrete-time formulation that, in continuous-time limit, becomes the probability flow ODE for the variance-preserving forward process.
Karras, Aittala, Aila, and Laine, Elucidating the design space of diffusion-based generative models (NeurIPS 2022). The EDM paper. Reparameterizes the probability flow ODE and analyzes preconditioning, time-step discretization, and second-order Heun integration.
Lu, Zhou, Bao, Chen, Li, and Zhu, DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling (NeurIPS 2022). Exponential integrators for the probability flow ODE; achieves 10-20 step sampling on standard image benchmarks.
Lipman, Chen, Ben-Hamu, Nickel, and Le, Flow matching for generative modeling (ICLR 2023). Trains the velocity field of the probability flow ODE directly via conditional regression; the framework generalizes to non-diffusion paths.

Next Topics

Diffusion Models: the generative-modeling family in which the probability flow ODE is the deterministic sampler.
Score Matching: the training objective for the score $\nabla \log p_t$ that appears in the ODE drift.
Time Reversal of SDEs: Anderson's stochastic dual to the probability flow ODE; same marginals, different paths.
Fokker-Planck Equation: the PDE machinery the marginal-equivalence proof reduces to.
Stochastic Differential Equations: the parent framework for both the SDE and its ODE counterpart.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Score Matchinglayer 3 · tier 1
Fokker–Planck Equationlayer 3 · tier 2
Stochastic Differential Equationslayer 3 · tier 2
Time Reversal of SDEslayer 3 · tier 2

Derived topics

1

Diffusion Modelslayer 4 · tier 1

Graph-backed continuations

Diffusion Models