Skip to main content

Scientific ML

Continuous Normalizing Flows

Generative models that replace the stack of invertible layers in a normalizing flow with a learned ODE, trading the per-layer Jacobian determinant for an O(d) trace via Hutchinson's estimator.

AdvancedTier 3Current~45 min
0

Why This Matters

Discrete normalizing flows pay an architectural tax: every layer must be invertible with a tractable Jacobian determinant. That constraint is what forced coupling layers, masked autoregressive structure, and the alternation patterns that limit expressiveness.

Continuous normalizing flows (CNFs) replace the discrete stack with a single Neural ODE. Two consequences fall out for free. First, invertibility is automatic: the ODE flow map is a diffeomorphism whenever the drift is Lipschitz. Second, the change-of-variables formula becomes an instantaneous one, in which the log-determinant of the Jacobian is replaced by an integral of its trace. The trace of a d×dd \times d matrix costs O(d2)O(d^2) deterministically, but Hutchinson's stochastic estimator gives an unbiased O(d)O(d) approximation. That is the FFJORD trick (Grathwohl et al., 2018) and what made CNFs scale beyond toy problems.

CNFs sit between classical normalizing flows and diffusion models. The probability flow ODE used in diffusion sampling is, formally, a CNF whose drift is the learned score. Understanding CNFs is the cleanest route to seeing why diffusion-style methods replaced flows for image generation.

The Instantaneous Change of Variables

Let z(t)Rdz(t) \in \mathbb{R}^d evolve under the ODE z˙=fθ(z(t),t)\dot z = f_\theta(z(t), t) from t=0t = 0 to t=Tt = T. Write ptp_t for the density of z(t)z(t). Discrete flows accumulate logdet\log\det terms layer by layer; the continuous version replaces this with a differential equation for the log-density along the trajectory.

Theorem

Instantaneous Change of Variables

Statement

Along the ODE trajectory z(t)z(t),

dlogpt(z(t))dt=tr ⁣(fθ(z(t),t)z).\frac{d \log p_t(z(t))}{dt} = -\mathrm{tr}\!\left(\frac{\partial f_\theta(z(t), t)}{\partial z}\right).

Integrating from 00 to TT gives

logpT(x)=logp0(z(0))0Ttr ⁣(fθ(z(t),t)z)dt\log p_T(x) = \log p_0(z(0)) - \int_0^T \mathrm{tr}\!\left(\frac{\partial f_\theta(z(t), t)}{\partial z}\right) dt

where z(T)=xz(T) = x and z(0)z(0) is recovered by integrating the ODE backward from xx.

Intuition

A discrete invertible layer multiplies the local volume by detJ|\det J|. A continuous flow does this in the limit of infinitely many infinitesimal steps. The product of determinants becomes an integral, and Jacobi's formula turns ddtlogdetetA=tr(A)\frac{d}{dt}\log\det e^{tA} = \mathrm{tr}(A). So the Jacobian's log-determinant is replaced by the time integral of its trace.

Proof Sketch

Apply the transport equation (continuity equation) for the density of an ODE flow: tpt+(ptfθ)=0\partial_t p_t + \nabla \cdot (p_t f_\theta) = 0. Expand the divergence: tpt=pttr(fθ/z)ptfθ\partial_t p_t = -p_t \, \mathrm{tr}(\partial f_\theta / \partial z) - \nabla p_t \cdot f_\theta. The total derivative of logpt\log p_t along the trajectory z(t)z(t) is ddtlogpt(z(t))=1pt(tpt+ptfθ)\frac{d}{dt}\log p_t(z(t)) = \frac{1}{p_t}(\partial_t p_t + \nabla p_t \cdot f_\theta). Substituting cancels the advection term and leaves tr(fθ/z)-\mathrm{tr}(\partial f_\theta / \partial z).

Why It Matters

The Jacobian determinant of a discrete coupling layer cost O(d)O(d) only because the architecture forced a triangular structure. The continuous version costs O(d2)O(d^2) for an exact trace (sum of dd diagonal entries, each requiring one backward pass) and O(d)O(d) for the Hutchinson estimator. The trace is architecture-free: fθf_\theta can be any neural network. No coupling, no masking, no input partitioning.

Failure Mode

The trace is along the entire trajectory. If the ODE is stiff or requires many adaptive solver steps, training cost balloons. CNFs that chase highly localized density features force the solver to take small steps near those features, making each gradient evaluation expensive. This is the empirical reason CNFs have been displaced by score-based diffusion for image generation.

FFJORD: Making the Trace Cheap

The O(d2)O(d^2) cost of the deterministic trace is still too high for image-scale data. FFJORD (Grathwohl et al., 2018) makes the trace stochastic.

Proposition

Hutchinson Trace Estimator for CNFs

Statement

For any matrix ARd×dA \in \mathbb{R}^{d \times d} and noise vector ϵRd\epsilon \in \mathbb{R}^d with E[ϵ]=0\mathbb{E}[\epsilon] = 0 and E[ϵϵ]=I\mathbb{E}[\epsilon \epsilon^\top] = I,

tr(A)=Eϵ ⁣[ϵAϵ].\mathrm{tr}(A) = \mathbb{E}_\epsilon\!\left[\epsilon^\top A \, \epsilon\right].

Applied to the CNF,

tr ⁣(fθz)=Eϵ ⁣[ϵfθzϵ]\mathrm{tr}\!\left(\frac{\partial f_\theta}{\partial z}\right) = \mathbb{E}_\epsilon\!\left[\epsilon^\top \frac{\partial f_\theta}{\partial z} \epsilon\right]

and the inner product ϵ(fθ/z)\epsilon^\top (\partial f_\theta / \partial z) is a single vector-Jacobian product. Cost: O(d)O(d) per sample.

Intuition

You don't need every diagonal entry of fθ/z\partial f_\theta / \partial z. You need their sum. A random projection ϵ(f/z)ϵ\epsilon^\top (\partial f / \partial z) \epsilon is unbiased for that sum because cross terms have zero expectation when ϵ\epsilon has identity covariance. Drawing a fresh ϵ\epsilon each step gives stochastic gradients of the log-likelihood.

Proof Sketch

Expand ϵAϵ=i,jAijϵiϵj\epsilon^\top A \epsilon = \sum_{i,j} A_{ij} \epsilon_i \epsilon_j. Take expectation: E[ϵiϵj]=δij\mathbb{E}[\epsilon_i \epsilon_j] = \delta_{ij} by the identity-covariance assumption. So the sum collapses to iAii=tr(A)\sum_i A_{ii} = \mathrm{tr}(A).

Why It Matters

Common choices for ϵ\epsilon are standard normal or Rademacher (entries ±1\pm 1). Rademacher has lower variance because ϵi2=1\epsilon_i^2 = 1 deterministically, removing one source of noise. With a single random projection per ODE step, training a CNF costs the same per step as training the underlying neural ODE, plus one extra vector-Jacobian product.

Failure Mode

Variance of the Hutchinson estimator scales with fθ/zF2\|\partial f_\theta / \partial z\|_F^2. For drifts with large Jacobians (sharp features, near-singular dynamics), the estimator's variance dominates and gradients become noisy. In practice this caps the practical step count and limits density sharpness.

Training and Sampling

A single forward integration computes both logpT(x)\log p_T(x) and a sample.

Density evaluation (given data xx):

  1. Solve the ODE backward from xx at t=Tt=T to z(0)z(0) at t=0t=0, jointly with the log-density ODE ˙=ϵ(fθ/z)ϵ\dot \ell = -\epsilon^\top (\partial f_\theta / \partial z) \epsilon with (T)=0\ell(T) = 0.
  2. Return logpT(x)=logp0(z(0))+(0)\log p_T(x) = \log p_0(z(0)) + \ell(0).

Sampling (no data):

  1. Draw z(0)p0z(0) \sim p_0 (typically standard Gaussian).
  2. Solve the ODE forward from t=0t=0 to t=Tt=T.
  3. Return x=z(T)x = z(T).

Both use the adjoint sensitivity method for gradients with O(1)O(1) memory in the depth dimension. The whole package trains by maximum likelihood on logpT(x)\log p_T(x).

CNFs vs Discrete Flows

PropertyDiscrete flow (RealNVP, MAF)Continuous flow (FFJORD)
Per-layer JacobianTriangular by constructionTrace via Hutchinson
Architecture constraintCoupling, masking, alternationAny Lipschitz network
Cost per evaluationO(Kd)O(K \cdot d) for KK layersO(Nstepsd)O(N_{\text{steps}} \cdot d) for solver steps
Memory in depthO(K)O(K) activationsO(1)O(1) via adjoint
Density estimateExactExact in expectation, stochastic gradients
Sample quality on imagesLower than diffusionLower than diffusion

The architectural freedom of CNFs is real but did not translate into state-of-the-art image generation. Score-based and diffusion models won the empirical race because they sidestep invertibility entirely while still giving access to a probability flow ODE for likelihood when needed.

Connection to Diffusion Models

Score-based diffusion models (Song et al., 2021) train a score network sθ(x,t)xlogpt(x)s_\theta(x, t) \approx \nabla_x \log p_t(x) and sample by reversing an SDE. The corresponding probability flow ODE,

dx=[μ(x,t)12g(t)2sθ(x,t)]dt,dx = \left[\mu(x, t) - \tfrac{1}{2} g(t)^2 \, s_\theta(x, t)\right] dt,

has the same marginals as the SDE at every tt. This ODE is a continuous normalizing flow, with drift fθ(x,t)=μ12g2sθf_\theta(x, t) = \mu - \tfrac{1}{2} g^2 s_\theta. So a trained diffusion model gives you a CNF for free, with the score playing the role of the learned drift. See the neural SDEs page for the full construction.

The CNF perspective explains why deterministic diffusion samplers (DDIM, DPM-Solver, EDM Heun) work: they are integrating a CNF whose drift is the learned score. It also explains why exact likelihood is reachable through diffusion: integrate the trace of the score's Jacobian along the probability flow trajectory.

Common Confusions

Watch Out

CNFs are not normalizing flows with more layers

A discrete flow stacks KK invertible layers with a learned Jacobian per layer. A CNF is a single ODE: in the limit KK \to \infty with infinitesimal steps, the per-layer determinants compose into a trace integral. The architectures available to each are different. A CNF can use a vanilla MLP for the drift; a discrete flow cannot use that MLP as a layer.

Watch Out

The trace is exact in expectation, not in any single backward pass

The Hutchinson estimator is unbiased, but each gradient step uses one (or a few) random ϵ\epsilon. Training is stochastic gradient ascent on log-likelihood, and the noise from the estimator adds to the noise from minibatching. With enough Monte Carlo samples per step you recover the deterministic trace, but in practice one sample per step is standard.

Watch Out

CNFs do not need the data manifold to match the integration time T

The base distribution at t=0t = 0 and the data distribution at t=Tt = T have the same dimensionality. The flow does not change manifold dimension; it warps the density. If the data lives on a low-dimensional manifold inside Rd\mathbb{R}^d, the CNF will assign nonzero density off that manifold (the integrated trace remains finite). This is a known limitation shared with discrete flows.

Summary

  • CNFs replace the stack of invertible layers in a normalizing flow with a single Neural ODE.
  • The instantaneous change of variables turns the per-layer logdetJ\log\det J into a time integral of tr(fθ/z)\mathrm{tr}(\partial f_\theta / \partial z).
  • FFJORD makes the trace cheap by replacing it with the Hutchinson stochastic estimator, giving an unbiased O(d)O(d) cost per evaluation.
  • The drift fθf_\theta can be any Lipschitz network, removing the coupling/masking architectural constraints that limited discrete flows.
  • The probability flow ODE used in diffusion sampling is formally a CNF whose drift is the learned score; CNFs are the bridge from classical flows to diffusion-based likelihood.

Exercises

ExerciseCore

Problem

Consider the 1D linear CNF z˙=αz\dot z = -\alpha z with α>0\alpha > 0, integrated from t=0t = 0 to t=Tt = T, starting from z(0)N(0,1)z(0) \sim \mathcal{N}(0, 1). Use the instantaneous change of variables to compute logpT(x)\log p_T(x) in closed form. Compare against the analytic solution z(T)=eαTz(0)z(T) = e^{-\alpha T} z(0).

ExerciseAdvanced

Problem

For a CNF with drift fθ(z,t)Rdf_\theta(z, t) \in \mathbb{R}^d, you have a vector ϵRd\epsilon \in \mathbb{R}^d with E[ϵ]=0\mathbb{E}[\epsilon] = 0 and E[ϵϵ]=I\mathbb{E}[\epsilon \epsilon^\top] = I. Show that the variance of the Hutchinson estimator T^=ϵ(fθ/z)ϵ\hat T = \epsilon^\top (\partial f_\theta / \partial z) \epsilon depends on the off-diagonal entries of fθ/z\partial f_\theta / \partial z when ϵ\epsilon is standard Gaussian, and that switching to Rademacher ϵ\epsilon (entries ±1\pm 1 with equal probability) eliminates the contribution of ϵi2\epsilon_i^2 variance. Why does this make Rademacher the default choice in FFJORD implementations?

References

Canonical:

  • Chen, Rubanova, Bettencourt, Duvenaud, "Neural Ordinary Differential Equations" (NeurIPS 2018, arXiv:1806.07366), Section 4 introduces the continuous normalizing flow construction
  • Grathwohl, Chen, Bettencourt, Sutskever, Duvenaud, "FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models" (ICLR 2019, arXiv:1810.01367), Sections 2-3

Survey:

  • Papamakarios, Nalisnick, Rezende, Mohamed, Lakshminarayanan, "Normalizing Flows for Probabilistic Modeling and Inference" (JMLR 2021), Chapter 5 covers continuous flows
  • Kobyzev, Prince, Brubaker, "Normalizing Flows: An Introduction and Review of Current Methods" (TPAMI 2020), Section IV

Related:

  • Hutchinson, "A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines" (Communications in Statistics 1990) — original trace estimator
  • Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, "Score-Based Generative Modeling Through Stochastic Differential Equations" (ICLR 2021, arXiv:2011.13456), Section 4.3 derives the probability flow ODE
  • Kidger, "On Neural Differential Equations" (PhD thesis 2022, arXiv:2202.02435), Chapter 5

Next Topics

  • Diffusion models: the dominant generative paradigm whose probability flow ODE is a CNF in disguise
  • Neural SDEs: stochastic generalization, recovering CNFs as the deterministic limit
  • Adjoint sensitivity method: how CNF gradients are computed without storing solver state

Last reviewed: April 17, 2026

Prerequisites

Foundations this topic depends on.

Next Topics