Continuous Normalizing Flows

Sneiderman, Robby

Scientific ML

Continuous Normalizing Flows

Generative models that replace the stack of invertible layers in a normalizing flow with a learned ODE, trading the per-layer Jacobian determinant for an O(d) trace via Hutchinson's estimator.

AdvancedTier 3CurrentFrontier watch~45 min

Prerequisites

Neural Odes Normalization Flows The Jacobian Matrix Adjoint Sensitivity Method

Prereq Map

Learning position

Read this page in the graph.

scientific-ml | layer 3 | tier 3. This page has 4 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Diffusion Models

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Discrete normalizing flows pay an architectural tax: every layer must be invertible with a tractable Jacobian determinant. That constraint is what forced coupling layers, masked autoregressive structure, and the alternation patterns that limit expressiveness.

Continuous normalizing flows (CNFs) replace the discrete stack with a single Neural ODE. Two consequences fall out for free. First, invertibility is automatic: the ODE flow map is a diffeomorphism whenever the drift is Lipschitz. Second, the change-of-variables formula becomes an instantaneous one, in which the log-determinant of the Jacobian is replaced by an integral of its trace. The trace of a $d \times d$ matrix costs $O(d^2)$ deterministically, but Hutchinson's stochastic estimator gives an unbiased $O(d)$ approximation. That is the FFJORD trick (Grathwohl et al., 2018) and what made CNFs scale beyond toy problems.

CNFs sit between classical normalizing flows and diffusion models. The probability flow ODE used in diffusion sampling is, formally, a CNF whose drift is the learned score. Understanding CNFs is the cleanest route to seeing why diffusion-style methods replaced flows for image generation.

Current Checkpoint

CNFs are no longer the main path for high-quality image generation, but they remain central for understanding modern continuous-time generative modeling. The useful bridge is:

CNFs give an exact likelihood path through an ODE flow.
Score-based diffusion gives a stochastic process plus a probability-flow ODE with the same marginals.
Flow matching and rectified-flow methods train velocity fields along chosen probability paths, often avoiding expensive simulation during training.

The implementation lesson is that the solver matters as much as the neural network. A model with a beautiful drift field can still be slow or unstable if the ODE is stiff, the trace estimator has high variance, or the tolerance settings turn every sample into many solver steps.

Build It This Way by Default

Use CNFs in the curriculum as the mathematical bridge between invertible flows, neural ODEs, diffusion probability-flow ODEs, and flow matching. Do not pitch CNFs as the default production image generator.

The Instantaneous Change of Variables

Let $z(t) \in \mathbb{R}^d$ evolve under the ODE $\dot z = f_\theta(z(t), t)$ from $t = 0$ to $t = T$ . Write $p_t$ for the density of $z(t)$ . Discrete flows accumulate $\log\det$ terms layer by layer; the continuous version replaces this with a differential equation for the log-density along the trajectory.

Theorem

Instantaneous Change of Variables

Statement

Along the ODE trajectory $z(t)$ ,

$\frac{d \log p_t(z(t))}{dt} = -\mathrm{tr}\!\left(\frac{\partial f_\theta(z(t), t)}{\partial z}\right).$

Integrating from $0$ to $T$ gives

$\log p_T(x) = \log p_0(z(0)) - \int_0^T \mathrm{tr}\!\left(\frac{\partial f_\theta(z(t), t)}{\partial z}\right) dt$

where $z(T) = x$ and $z(0)$ is recovered by integrating the ODE backward from $x$ .

Intuition

A discrete invertible layer multiplies the local volume by $|\det J|$ . A continuous flow does this in the limit of infinitely many infinitesimal steps. The product of determinants becomes an integral, and Jacobi's formula turns $\frac{d}{dt}\log\det e^{tA} = \mathrm{tr}(A)$ . So the Jacobian's log-determinant is replaced by the time integral of its trace.

Proof Sketch

Apply the transport equation (continuity equation) for the density of an ODE flow: $\partial_t p_t + \nabla \cdot (p_t f_\theta) = 0$ . Expand the divergence: $\partial_t p_t = -p_t \, \mathrm{tr}(\partial f_\theta / \partial z) - \nabla p_t \cdot f_\theta$ . The total derivative of $\log p_t$ along the trajectory $z(t)$ is $\frac{d}{dt}\log p_t(z(t)) = \frac{1}{p_t}(\partial_t p_t + \nabla p_t \cdot f_\theta)$ . Substituting cancels the advection term and leaves $-\mathrm{tr}(\partial f_\theta / \partial z)$ .

Why It Matters

The Jacobian determinant of a discrete coupling layer cost $O(d)$ only because the architecture forced a triangular structure. The continuous version costs $O(d^2)$ for an exact trace (sum of $d$ diagonal entries, each requiring one backward pass) and $O(d)$ for the Hutchinson estimator. The trace is architecture-free: $f_\theta$ can be any neural network. No coupling, no masking, no input partitioning.

Failure Mode

The trace is along the entire trajectory. If the ODE is stiff or requires many adaptive solver steps, training cost balloons. CNFs that chase highly localized density features force the solver to take small steps near those features, making each gradient evaluation expensive. This is the empirical reason CNFs have been displaced by score-based diffusion for image generation.

report a correction →

FFJORD: Making the Trace Cheap

The $O(d^2)$ cost of the deterministic trace is still too high for image-scale data. FFJORD (Grathwohl et al., 2018) makes the trace stochastic.

Proposition

Hutchinson Trace Estimator for CNFs

Statement

For any matrix $A \in \mathbb{R}^{d \times d}$ and noise vector $\epsilon \in \mathbb{R}^d$ with $\mathbb{E}[\epsilon] = 0$ and $\mathbb{E}[\epsilon \epsilon^\top] = I$ ,

$\mathrm{tr}(A) = \mathbb{E}_\epsilon\!\left[\epsilon^\top A \, \epsilon\right].$

Applied to the CNF,

$\mathrm{tr}\!\left(\frac{\partial f_\theta}{\partial z}\right) = \mathbb{E}_\epsilon\!\left[\epsilon^\top \frac{\partial f_\theta}{\partial z} \epsilon\right]$

and the inner product $\epsilon^\top (\partial f_\theta / \partial z)$ is a single vector-Jacobian product. Cost: $O(d)$ per sample.

Intuition

You don't need every diagonal entry of $\partial f_\theta / \partial z$ . You need their sum. A random projection $\epsilon^\top (\partial f / \partial z) \epsilon$ is unbiased for that sum because cross terms have zero expectation when $\epsilon$ has identity covariance. Drawing a fresh $\epsilon$ each step gives stochastic gradients of the log-likelihood.

Proof Sketch

Expand $\epsilon^\top A \epsilon = \sum_{i,j} A_{ij} \epsilon_i \epsilon_j$ . Take expectation: $\mathbb{E}[\epsilon_i \epsilon_j] = \delta_{ij}$ by the identity-covariance assumption. So the sum collapses to $\sum_i A_{ii} = \mathrm{tr}(A)$ .

Why It Matters

Common choices for $\epsilon$ are standard normal or Rademacher (entries $\pm 1$ ). Rademacher has lower variance because $\epsilon_i^2 = 1$ deterministically, removing one source of noise. With a single random projection per ODE step, training a CNF costs the same per step as training the underlying neural ODE, plus one extra vector-Jacobian product.

Failure Mode

Variance of the Hutchinson estimator scales with $\|\partial f_\theta / \partial z\|_F^2$ . For drifts with large Jacobians (sharp features, near-singular dynamics), the estimator's variance dominates and gradients become noisy. In practice this caps the practical step count and limits density sharpness.

report a correction →

Training and Sampling

A single forward integration computes both $\log p_T(x)$ and a sample.

Density evaluation (given data $x$ ):

Solve the ODE backward from $x$ at $t=T$ to $z(0)$ at $t=0$ , jointly with the log-density ODE $\dot \ell = -\epsilon^\top (\partial f_\theta / \partial z) \epsilon$ with $\ell(T) = 0$ .
Return $\log p_T(x) = \log p_0(z(0)) + \ell(0)$ .

Sampling (no data):

Draw $z(0) \sim p_0$ (typically standard Gaussian).
Solve the ODE forward from $t=0$ to $t=T$ .
Return $x = z(T)$ .

Both use the adjoint sensitivity method for gradients with $O(1)$ memory in the depth dimension. The whole package trains by maximum likelihood on $\log p_T(x)$ .

CNFs vs Discrete Flows

Property	Discrete flow (RealNVP, MAF)	Continuous flow (FFJORD)
Per-layer Jacobian	Triangular by construction	Trace via Hutchinson
Architecture constraint	Coupling, masking, alternation	Any Lipschitz network
Cost per evaluation	$O(K \cdot d)$ for $K$ layers	$O(N_{\text{steps}} \cdot d)$ for solver steps
Memory in depth	$O(K)$ activations	$O(1)$ via adjoint
Density estimate	Exact	Exact in expectation, stochastic gradients
Sample quality on images	Lower than diffusion	Lower than diffusion

The architectural freedom of CNFs is real, but score-based and diffusion models achieved lower FID scores on standard benchmarks and displaced CNFs for image generation. They sidestep invertibility entirely while still giving access to a probability flow ODE for likelihood when needed.

Connection to Diffusion Models

Score-based diffusion models (Song et al., 2021) train a score network $s_\theta(x, t) \approx \nabla_x \log p_t(x)$ and sample by reversing an SDE. The corresponding probability flow ODE,

$dx = \left[\mu(x, t) - \tfrac{1}{2} g(t)^2 \, s_\theta(x, t)\right] dt,$

has the same marginals as the SDE at every $t$ . This ODE is a continuous normalizing flow, with drift $f_\theta(x, t) = \mu - \tfrac{1}{2} g^2 s_\theta$ . So a trained diffusion model gives you a CNF for free, with the score playing the role of the learned drift. See the neural SDEs page for the full construction.

The CNF perspective explains why deterministic diffusion samplers (DDIM, DPM-Solver, EDM Heun) work: they are integrating a CNF whose drift is the learned score. It also explains why exact likelihood is reachable through diffusion: integrate the trace of the score's Jacobian along the probability flow trajectory.

Common Confusions

Watch Out

CNFs are not normalizing flows with more layers

A discrete flow stacks $K$ invertible layers with a learned Jacobian per layer. A CNF is a single ODE: in the limit $K \to \infty$ with infinitesimal steps, the per-layer determinants compose into a trace integral. The architectures available to each are different. A CNF can use a vanilla MLP for the drift; a discrete flow cannot use that MLP as a layer.

Watch Out

The trace is exact in expectation, not in any single backward pass

The Hutchinson estimator is unbiased, but each gradient step uses one (or a few) random $\epsilon$ . Training is stochastic gradient ascent on log-likelihood, and the noise from the estimator adds to the noise from minibatching. With enough Monte Carlo samples per step you recover the deterministic trace, but in practice one sample per step is standard.

Watch Out

CNFs do not need the data manifold to match the integration time T

The base distribution at $t = 0$ and the data distribution at $t = T$ have the same dimensionality. The flow does not change manifold dimension; it warps the density. If the data lives on a low-dimensional manifold inside $\mathbb{R}^d$ , the CNF will assign nonzero density off that manifold (the integrated trace remains finite). This is a known limitation shared with discrete flows.

Summary

CNFs replace the stack of invertible layers in a normalizing flow with a single Neural ODE.
The instantaneous change of variables turns the per-layer $\log\det J$ into a time integral of $\mathrm{tr}(\partial f_\theta / \partial z)$ .
FFJORD makes the trace cheap by replacing it with the Hutchinson stochastic estimator, giving an unbiased $O(d)$ cost per evaluation.
The drift $f_\theta$ can be any Lipschitz network, removing the coupling/masking architectural constraints that limited discrete flows.
The probability flow ODE used in diffusion sampling is formally a CNF whose drift is the learned score; CNFs are the bridge from classical flows to diffusion-based likelihood.

Exercises

ExerciseCore

Problem

Consider the 1D linear CNF $\dot z = -\alpha z$ with $\alpha > 0$ , integrated from $t = 0$ to $t = T$ , starting from $z(0) \sim \mathcal{N}(0, 1)$ . Use the instantaneous change of variables to compute $\log p_T(x)$ in closed form. Compare against the analytic solution $z(T) = e^{-\alpha T} z(0)$ .

ExerciseAdvanced

Problem

For a CNF with drift $f_\theta(z, t) \in \mathbb{R}^d$ , you have a vector $\epsilon \in \mathbb{R}^d$ with $\mathbb{E}[\epsilon] = 0$ and $\mathbb{E}[\epsilon \epsilon^\top] = I$ . Show that the variance of the Hutchinson estimator $\hat T = \epsilon^\top (\partial f_\theta / \partial z) \epsilon$ depends on the off-diagonal entries of $\partial f_\theta / \partial z$ when $\epsilon$ is standard Gaussian, and that switching to Rademacher $\epsilon$ (entries $\pm 1$ with equal probability) eliminates the contribution of $\epsilon_i^2$ variance. Why does this make Rademacher the default choice in FFJORD implementations?

References

Canonical:

Chen, Rubanova, Bettencourt, Duvenaud, "Neural Ordinary Differential Equations" (NeurIPS 2018, arXiv:1806.07366), Section 4 introduces the continuous normalizing flow construction
Grathwohl, Chen, Bettencourt, Sutskever, Duvenaud, "FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models" (ICLR 2019, arXiv:1810.01367), Sections 2-3

Survey:

Papamakarios, Nalisnick, Rezende, Mohamed, Lakshminarayanan, "Normalizing Flows for Probabilistic Modeling and Inference" (JMLR 2021), Chapter 5 covers continuous flows
Kobyzev, Prince, Brubaker, "Normalizing Flows: An Introduction and Review of Current Methods" (TPAMI 2020), Section IV

Related:

Hutchinson, "A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines" (Communications in Statistics 1990) — original trace estimator
Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, "Score-Based Generative Modeling Through Stochastic Differential Equations" (ICLR 2021, arXiv:2011.13456), Section 4.3 derives the probability flow ODE
Lipman et al., "Flow Matching for Generative Modeling" (ICLR 2023, arXiv:2210.02747), continuous probability paths without simulation during training
Kidger, "On Neural Differential Equations" (PhD thesis 2022, arXiv:2202.02435), Chapter 5

Next Topics

Diffusion models: the dominant generative paradigm whose probability flow ODE is a CNF in disguise
Neural SDEs: stochastic generalization, recovering CNFs as the deterministic limit
Adjoint sensitivity method: how CNF gradients are computed without storing solver state

Last reviewed: May 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

The Jacobian Matrixlayer 0A · tier 1
Adjoint Sensitivity Methodlayer 3 · tier 2
Normalizing Flowslayer 3 · tier 3
Neural ODEs and Continuous-Depth Networkslayer 4 · tier 3

Derived topics

2

Diffusion Modelslayer 4 · tier 1
Neural SDEs and the Diffusion Bridgelayer 4 · tier 3

Graph-backed continuations

Diffusion Models Neural SDEs and the Diffusion Bridge