Skip to main content

ML Methods

Probability Flow ODE

Song et al. 2021: every diffusion SDE has a deterministic ODE that produces the same time-marginals. The deterministic dual of Anderson's reverse SDE; the basis of DDIM, DPM-Solver, EDM samplers, exact-likelihood computation, and the conceptual bridge to flow matching.

AdvancedTier 2Current~45 min
0

Why This Matters

Every fast deterministic sampler used in production diffusion models — DDIM, DPM-Solver, DPM-Solver++, EDM, UniPC — is integrating the same underlying object: the probability flow ODE of Song, Sohl-Dickstein, Kingma, Kumar, Ermon, and Poole (ICLR 2021). Anderson's reverse SDE is the stochastic way to invert a forward noising process; the probability flow ODE is the deterministic way that produces the same intermediate marginals. The choice between them happens purely at sampling time, with the same trained score network.

The ODE form is what enables three things that the SDE form cannot give. It admits exact log-likelihood computation via the instantaneous change-of-variables formula (Chen, Rubanova, Bettencourt, and Duvenaud, 2018), so a diffusion model becomes a continuous normalizing flow at inference time. It allows high-order solvers (DPM-Solver, Heun, exponential integrators) that take 20-50 steps where Euler-Maruyama on the reverse SDE needs hundreds or thousands. It is deterministic: the same noise sample maps to the same image, which is the property that makes DDIM-style image editing, latent-space interpolation, and consistency models possible at all.

The probability flow ODE is also the conceptual bridge from score-based diffusion to flow matching (Lipman, Chen, Ben-Hamu, Nickel, and Le, ICLR 2023) and rectified flow (Liu, Gong, and Liu, ICLR 2023). Flow matching trains a velocity field vθ(x,t)v_\theta(x, t) to directly match the probability flow drift, bypassing the score parameterization entirely. Once you see the ODE, the score is just one way to specify a transport field; flow matching is what you get if you specify the transport field directly.

Beyond generative modeling, the same construction shows up in optimal transport (Otto's gradient-flow / JKO scheme), Fokker-Planck-driven particle methods (Maoutsa, Reich, and Opper, 2020), and any setting where you want to replace a stochastic sampler with a deterministic transport map without changing the marginals.

Mental Model

A Fokker-Planck equation is a continuity equation for densities: tpt+Jt=0\partial_t p_t + \nabla \cdot J_t = 0, where JtJ_t is the probability current. The current can be split into a part that comes from the SDE drift and a part that comes from the diffusion. The first key observation is that many different vector fields produce the same current, because the current Jt=vptJ_t = v\, p_t underdetermines vv on regions where ptp_t is constant. The second is that for a given Fokker-Planck flow there is exactly one deterministic transport that reproduces it: the unique vv such that Jt=vptJ_t = v\, p_t with no diffusion term.

That unique deterministic drift is the probability flow ODE field. It carries each density level through time on noiseless characteristic curves whose distribution at every time matches the SDE's marginal. Pathwise the SDE and the ODE are very different (Brownian fluctuations versus smooth ODE trajectories), but as measures on intermediate times they are indistinguishable.

A useful aphorism: the Fokker-Planck equation is what the densities do; the SDE and the ODE are two different particle realizations of those densities. Anderson's reverse SDE adds noise to average over many trajectories; the probability flow ODE removes noise to follow one trajectory deterministically. Both are correct, and both are useful, for different jobs.

Formal Statement

Definition

Probability Flow ODE

Let XtRdX_t \in \mathbb{R}^d solve the forward Itô SDE dXt=b(Xt,t)dt+σ(t)dBtdX_t = b(X_t, t)\,dt + \sigma(t)\,dB_t on [0,T][0, T] with marginal density pt(x)p_t(x) (assume the diffusion coefficient σ(t)Rd×d\sigma(t) \in \mathbb{R}^{d \times d} is independent of xx for simplicity). The probability flow ODE associated to this SDE is the deterministic ODE

dxdt=b(x,t)12σ(t)σ(t)logpt(x).\frac{dx}{dt} = b(x, t) - \tfrac{1}{2}\, \sigma(t)\sigma(t)^\top\, \nabla \log p_t(x).

If X0p0X_0 \sim p_0 and {Yt}t[0,T]\{Y_t\}_{t \in [0, T]} solves the ODE with Y0p0Y_0 \sim p_0, then YtY_t has marginal density ptp_t for every t[0,T]t \in [0, T]. The score field logpt\nabla \log p_t is the same object that appears in Anderson's reverse-time SDE; the only difference is the factor 12\tfrac{1}{2} in front of the score correction.

In the general case where σ\sigma depends on xx, the formula picks up an additional divergence term and reads dx/dt=b12(σσ)12σσlogptdx/dt = b - \tfrac{1}{2} \nabla \cdot (\sigma \sigma^\top) - \tfrac{1}{2} \sigma \sigma^\top \nabla \log p_t. For variance-preserving and variance-exploding diffusion schedules σ\sigma is xx-independent, and the simpler form above is the one used in practice.

Marginal Equivalence Theorem

Theorem

Probability Flow ODE Marginal Equivalence

Statement

Under the assumptions above, let ptp_t be the marginal density of the SDE dXt=b(Xt,t)dt+σ(t)dBtdX_t = b(X_t, t)\,dt + \sigma(t)\,dB_t with X0p0X_0 \sim p_0. Define the deterministic vector field v(x,t)=b(x,t)12σ(t)σ(t)logpt(x)v(x, t) = b(x, t) - \tfrac{1}{2} \sigma(t) \sigma(t)^\top \nabla \log p_t(x) and let {Yt}\{Y_t\} solve the ODE dYt/dt=v(Yt,t)dY_t/dt = v(Y_t, t) with Y0p0Y_0 \sim p_0. Then for every t[0,T]t \in [0, T] the law of YtY_t has density ptp_t, the same density the SDE produces.

Intuition

Both processes must satisfy a continuity equation tp+(vp)=0\partial_t p + \nabla \cdot (v\, p) = 0 for some velocity field vv. The SDE's Fokker-Planck operator (bp)+122:(σσp)-\nabla \cdot (b\, p) + \tfrac{1}{2} \nabla^2 : (\sigma \sigma^\top p) rewrites as a continuity equation with veff=b12σσlogpv_{\text{eff}} = b - \tfrac{1}{2} \sigma \sigma^\top \nabla \log p. That veffv_{\text{eff}} is the unique deterministic drift compatible with the marginal flow. The diffusion has been absorbed into the drift via the score; the noise term disappears because deterministic transport already carries the right amount of mass.

Proof Sketch

Start from the Fokker-Planck equation tpt=(bpt)+12i,jij((σσ)ijpt)\partial_t p_t = -\nabla \cdot (b\, p_t) + \tfrac{1}{2} \sum_{i, j} \partial_i \partial_j ((\sigma \sigma^\top)_{ij}\, p_t). With σ\sigma independent of xx, the second term simplifies to 122:(σσ)pt=12(σσpt)\tfrac{1}{2} \nabla^2 : (\sigma \sigma^\top)\, p_t = \tfrac{1}{2} \nabla \cdot (\sigma \sigma^\top \nabla p_t). Use the identity pt=ptlogpt\nabla p_t = p_t\, \nabla \log p_t to rewrite this as 12(σσlogptpt)\tfrac{1}{2} \nabla \cdot (\sigma \sigma^\top \nabla \log p_t \cdot p_t). Substitute back into the Fokker-Planck equation and collect terms: tpt=[(b12σσlogpt)pt]\partial_t p_t = -\nabla \cdot \big[(b - \tfrac{1}{2} \sigma \sigma^\top \nabla \log p_t)\, p_t\big]. This is exactly the continuity equation tpt+(vpt)=0\partial_t p_t + \nabla \cdot (v\, p_t) = 0 for v=b12σσlogptv = b - \tfrac{1}{2} \sigma \sigma^\top \nabla \log p_t. The characteristic curves of this continuity equation are the trajectories of the ODE dx/dt=v(x,t)dx/dt = v(x, t), and mass is carried along characteristics, so the marginal of any ODE solution started from p0p_0 is ptp_t.

Why It Matters

Two practical consequences follow. First, you can sample from a diffusion model by integrating a deterministic ODE instead of a stochastic SDE. This allows higher-order ODE solvers (Heun, RK4, exponential integrators, DPM-Solver multistep) that match SDE-sampler quality in 20-50 function evaluations instead of 250-1000. Second, because the ODE is a continuous normalizing flow, you can compute exact log-likelihoods via the instantaneous change-of-variables formula (next theorem), which the stochastic sampler cannot do. The deterministic ODE dx/dt=b12σσlogptdx/dt = b - \tfrac{1}{2} \sigma\sigma^\top \nabla \log p_t produces marginal density ptp_t identical to the SDE's marginal at every t[0,T]t \in [0, T].

Failure Mode

The score logpt\nabla \log p_t becomes singular near the data manifold as t0t \to 0. For data concentrated on a low-dimensional submanifold of Rd\mathbb{R}^d (the realistic case for images), p0p_0 is not a density on Rd\mathbb{R}^d at all, and ptp_t for small tt has nearly singular score near the manifold. The ODE then becomes stiff: standard explicit solvers either blow up or take vanishingly small steps near t=0t = 0. EDM (Karras et al., 2022) addresses this by reparameterizing the ODE so the singular behavior is absorbed into a preconditioner, and DPM-Solver uses exponential integrators that are stable for the linear part of the dynamics. Naive Euler integration of the probability flow ODE on unprocessed forward SDEs is the canonical failure mode.

Instantaneous Change of Variables and Exact Likelihood

Theorem

Instantaneous Change of Variables (Chen et al. 2018)

Statement

Let xtx_t solve dx/dt=v(x,t)dx/dt = v(x, t) with initial condition x0x_0 and let pt(x)p_t(x) be the density of the random variable xtx_t when x0p0x_0 \sim p_0. Then along the trajectory,

ddtlogpt(xt)=tr ⁣(xv(xt,t))=v(xt,t).\frac{d}{dt} \log p_t(x_t) = -\,\mathrm{tr}\!\left(\nabla_x v(x_t, t)\right) = -\,\nabla \cdot v(x_t, t).

Integrating from 00 to TT gives the exact log-likelihood identity logpT(xT)=logp0(x0)0Tv(xs,s)ds\log p_T(x_T) = \log p_0(x_0) - \int_0^T \nabla \cdot v(x_s, s)\,ds, where xsx_s is the ODE trajectory connecting x0x_0 and xTx_T.

Intuition

The density along the flow is the Jacobian determinant of the flow map. Differentiating logdet\log \det in time gives the trace of the time-derivative of the Jacobian, which is tr(v)\mathrm{tr}(\nabla v). The minus sign comes from the continuity equation: where the velocity field has positive divergence, mass is being stretched apart and density must drop.

Proof Sketch

By the continuity equation tp+(vp)=0\partial_t p + \nabla \cdot (v p) = 0 and the chain rule along characteristics, ddtpt(xt)=tpt(xt)+pt(xt)v(xt,t)=(vpt)(xt)+ptv=ptv\frac{d}{dt} p_t(x_t) = \partial_t p_t(x_t) + \nabla p_t(x_t) \cdot v(x_t, t) = -\nabla \cdot (v p_t)(x_t) + \nabla p_t \cdot v = -p_t\, \nabla \cdot v at x=xtx = x_t. Divide by pt(xt)p_t(x_t) to get ddtlogpt(xt)=v(xt,t)\frac{d}{dt} \log p_t(x_t) = -\nabla \cdot v(x_t, t). The integral form follows by the fundamental theorem of calculus.

Why It Matters

Combined with the probability flow ODE, this gives an exact likelihood estimator for diffusion models. Algorithm: integrate the ODE forward from a data point x0x_0 to noise xTx_T, simultaneously accumulating the trace integral 0Ttr(v)ds\int_0^T \mathrm{tr}(\nabla v)\,ds. The log-likelihood is logpT(xT)0Ttr(v(xs,s))ds\log p_T(x_T) - \int_0^T \mathrm{tr}(\nabla v(x_s, s))\,ds where logpT(xT)\log p_T(x_T) is the standard Gaussian log-density at the noise endpoint. This is the only way to get exact (not lower-bound) likelihoods from a diffusion model, and it is what Song et al. 2021 used to report bits-per-dim on CIFAR-10 competitive with autoregressive models. The trace tr(v)\mathrm{tr}(\nabla v) in high dimensions is estimated stochastically via Hutchinson's trick: tr(v)=Eϵ[ϵvϵ]\mathrm{tr}(\nabla v) = \mathbb{E}_\epsilon[\epsilon^\top \nabla v\, \epsilon] for ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) or Rademacher, which costs one Jacobian-vector product per estimate. The log-density along an ODE trajectory evolves as ddtlogpt(xt)=tr(xv(xt,t))\frac{d}{dt} \log p_t(x_t) = -\mathrm{tr}(\nabla_x v(x_t, t)).

Failure Mode

The Hutchinson estimator has variance that grows with the dimension; for high-dimensional images, getting a tight likelihood estimate requires many samples per data point. The ODE integration error also compounds with the trace-integral error; halving the step size doubles both the integration work and the trace-estimator work. Reported "exact" likelihoods always carry an integration tolerance that practitioners sometimes skip past in the comparison tables.

Connection to DDIM, EDM, and Flow Matching

DDIM (Song, Meng, and Ermon, ICLR 2021) was discovered before the continuous-time formulation; the original paper described a non-Markovian discrete forward process whose deterministic reverse is closed form. Once Song et al. 2021 introduced the SDE framework, DDIM was recognized as exactly the discretization of the probability flow ODE for the variance-preserving diffusion schedule. The "deterministic DDIM" sampler that ships in every diffusion library is integrating dx/dt=12β(t)[x+logpt(x)]dx/dt = -\tfrac{1}{2} \beta(t)[x + \nabla \log p_t(x)] with a first-order Euler step in a transformed time variable. There is no separate trained model; it is the same DDPM weights, sampled differently.

EDM (Karras, Aittala, Aila, and Laine, NeurIPS 2022) reparameterizes the probability flow ODE in terms of a noise-level σ\sigma rather than a time tt, and applies preconditioning to the network so the singular behavior near the data manifold is absorbed into the parameterization. With Heun's second-order solver and the EDM preconditioner, 30-40 function evaluations match the FID of 1000-step DDPM sampling. DPM-Solver (Lu, Zhou, Bao, Chen, Li, and Zhu, NeurIPS 2022) goes further by exploiting the linear structure of the variance-preserving forward process: the linear term has a closed-form solution, so only the score-driven nonlinear term needs numerical integration, yielding 10-20 step samplers with little quality loss.

Flow matching (Lipman et al. 2023) takes the reframing one step further. Train a network vθ(x,t)v_\theta(x, t) to directly regress the probability flow drift, conditional on a chosen forward path between data and noise. The training loss is Et,x0,x1,xtvθ(xt,t)ut(xtx0,x1)2\mathbb{E}_{t, x_0, x_1, x_t}\|v_\theta(x_t, t) - u_t(x_t | x_0, x_1)\|^2 where utu_t is the conditional velocity along the chosen path. The score parameterization disappears entirely; what remains is a direct regression on the velocity field of a continuous normalizing flow. For Gaussian source paths this reproduces score-matching diffusion exactly, but the framework also accommodates other interpolation paths (rectified flow, optimal-transport-coupled paths) and is the formulation behind Stable Diffusion 3.

Worked Example: Variance-Preserving SDE

The variance-preserving (VP) forward SDE is dXt=12β(t)Xtdt+β(t)dBtdX_t = -\tfrac{1}{2} \beta(t) X_t\,dt + \sqrt{\beta(t)}\,dB_t for a positive schedule β(t)\beta(t). The diffusion coefficient is σ(t)=β(t)\sigma(t) = \sqrt{\beta(t)}, so σσ=β(t)I\sigma\sigma^\top = \beta(t)\, I. The probability flow ODE is

dxdt=12β(t)x12β(t)logpt(x)=12β(t)[x+logpt(x)].\frac{dx}{dt} = -\tfrac{1}{2} \beta(t)\, x - \tfrac{1}{2} \beta(t)\, \nabla \log p_t(x) = -\tfrac{1}{2} \beta(t) \big[x + \nabla \log p_t(x)\big].

Compare with Anderson's reverse SDE for the same forward process: dx=12β(t)[x+2logpt(x)]dt+β(t)dBˉtdx = -\tfrac{1}{2} \beta(t) [x + 2 \nabla \log p_t(x)]\,dt + \sqrt{ \beta(t)}\,d\bar{B}_t. The deterministic ODE has a half score correction where the SDE has a full one. The factor of 22 is what makes the SDE trajectory wander stochastically while the ODE trajectory stays on a single deterministic curve.

As a sanity check, suppose the VP process has reached its stationary marginal pt=N(0,I)p_t = \mathcal{N}(0, I) (this is the limit as tt \to \infty for the VP schedule). The score is logpt(x)=x\nabla \log p_t(x) = -x, and the ODE becomes dx/dt=12β(t)(xx)=0dx/dt = -\tfrac{1}{2} \beta(t) (x - x) = 0. The deterministic transport correctly identifies that once the marginals have stopped evolving, no transport is needed. For finite-time intermediate marginals, integrate the ODE numerically with a learned score sθ(x,t)logpt(x)s_\theta(x, t) \approx \nabla \log p_t(x) in place of the unknown true score; this is what every deterministic diffusion sampler does at inference time.

Common Confusions

Watch Out

Same marginals are not the same joint distribution

The probability flow ODE and the SDE produce the same time-marginals ptp_t, but they are very different processes pathwise. The SDE's trajectory is a Brownian-driven random function with Hölder-1/2 regularity; the ODE's trajectory is a smooth deterministic curve. Quantities that depend on more than one time slice — joint distributions of (Xt1,Xt2)(X_{t_1}, X_{t_2}), exit times, path-functional expectations — generally differ between the two processes. The marginal-equivalence theorem only asserts equality at single times.

Watch Out

The probability flow ODE is unique among gradient-class transports, not among all transports

Many vector fields produce the same density evolution ptp_t. The probability flow ODE is the unique drift you get by absorbing the diffusion into the deterministic transport via the identity σσp=σσ(logp)p\sigma \sigma^\top \nabla p = \sigma \sigma^\top (\nabla \log p)\, p. If you allow vector fields outside this gradient-class structure (for example, adding a divergence-free perturbation ww with (wp)=0\nabla \cdot (w p) = 0), you get a different ODE that produces the same marginals. Rectified flow exploits exactly this freedom by post-processing trajectories to make them straighter. The probability flow ODE is canonical because it is the unique gradient-class deterministic transport, not because it is the only deterministic transport.

Watch Out

DDIM is not a different model from DDPM

A common belief: DDPM and DDIM are different generative models, with DDIM being faster and lower quality. They are the same model. DDPM and DDIM share weights, training loss, and forward process. They differ only at sampling time: DDPM integrates the reverse SDE (stochastic) with first-order Euler steps; DDIM integrates the probability flow ODE (deterministic). The trained network ϵθ(x,t)\epsilon_\theta(x, t) is identical. This is why you can take any pretrained DDPM checkpoint and run DDIM sampling on it without retraining. The same identity applies to score-SDE models and EDM samplers.

Exercises

ExerciseCore

Problem

Verify the probability flow ODE for standard Brownian motion: dXt=dBtdX_t = dB_t with X0N(0,1)X_0 \sim \mathcal{N}(0, 1) on R\mathbb{R}. Compute the marginal ptp_t and the score logpt\nabla \log p_t. Write the probability flow ODE explicitly. Solve the ODE starting from Y0=y0Y_0 = y_0 and confirm that YtN(0,1+t)Y_t \sim \mathcal{N}(0, 1 + t) when Y0N(0,1)Y_0 \sim \mathcal{N}(0, 1).

ExerciseAdvanced

Problem

Derive the instantaneous change of variables formula ddtlogpt(xt)=tr(xv(xt,t))\frac{d}{dt} \log p_t(x_t) = -\mathrm{tr}(\nabla_x v(x_t, t)) from the continuity equation tp+(vp)=0\partial_t p + \nabla \cdot (v p) = 0 along characteristics of the ODE dx/dt=v(x,t)dx/dt = v(x, t).

References

No canonical references provided.

Next Topics

  • Diffusion Models: the generative-modeling family in which the probability flow ODE is the deterministic sampler.
  • Score Matching: the training objective for the score logpt\nabla \log p_t that appears in the ODE drift.
  • Time Reversal of SDEs: Anderson's stochastic dual to the probability flow ODE; same marginals, different paths.
  • Fokker-Planck Equation: the PDE machinery the marginal-equivalence proof reduces to.
  • Stochastic Differential Equations: the parent framework for both the SDE and its ODE counterpart.

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics