Scientific ML
Continuous Normalizing Flows
Generative models that replace the stack of invertible layers in a normalizing flow with a learned ODE, trading the per-layer Jacobian determinant for an O(d) trace via Hutchinson's estimator.
Prerequisites
Why This Matters
Discrete normalizing flows pay an architectural tax: every layer must be invertible with a tractable Jacobian determinant. That constraint is what forced coupling layers, masked autoregressive structure, and the alternation patterns that limit expressiveness.
Continuous normalizing flows (CNFs) replace the discrete stack with a single Neural ODE. Two consequences fall out for free. First, invertibility is automatic: the ODE flow map is a diffeomorphism whenever the drift is Lipschitz. Second, the change-of-variables formula becomes an instantaneous one, in which the log-determinant of the Jacobian is replaced by an integral of its trace. The trace of a matrix costs deterministically, but Hutchinson's stochastic estimator gives an unbiased approximation. That is the FFJORD trick (Grathwohl et al., 2018) and what made CNFs scale beyond toy problems.
CNFs sit between classical normalizing flows and diffusion models. The probability flow ODE used in diffusion sampling is, formally, a CNF whose drift is the learned score. Understanding CNFs is the cleanest route to seeing why diffusion-style methods replaced flows for image generation.
The Instantaneous Change of Variables
Let evolve under the ODE from to . Write for the density of . Discrete flows accumulate terms layer by layer; the continuous version replaces this with a differential equation for the log-density along the trajectory.
Instantaneous Change of Variables
Statement
Along the ODE trajectory ,
Integrating from to gives
where and is recovered by integrating the ODE backward from .
Intuition
A discrete invertible layer multiplies the local volume by . A continuous flow does this in the limit of infinitely many infinitesimal steps. The product of determinants becomes an integral, and Jacobi's formula turns . So the Jacobian's log-determinant is replaced by the time integral of its trace.
Proof Sketch
Apply the transport equation (continuity equation) for the density of an ODE flow: . Expand the divergence: . The total derivative of along the trajectory is . Substituting cancels the advection term and leaves .
Why It Matters
The Jacobian determinant of a discrete coupling layer cost only because the architecture forced a triangular structure. The continuous version costs for an exact trace (sum of diagonal entries, each requiring one backward pass) and for the Hutchinson estimator. The trace is architecture-free: can be any neural network. No coupling, no masking, no input partitioning.
Failure Mode
The trace is along the entire trajectory. If the ODE is stiff or requires many adaptive solver steps, training cost balloons. CNFs that chase highly localized density features force the solver to take small steps near those features, making each gradient evaluation expensive. This is the empirical reason CNFs have been displaced by score-based diffusion for image generation.
FFJORD: Making the Trace Cheap
The cost of the deterministic trace is still too high for image-scale data. FFJORD (Grathwohl et al., 2018) makes the trace stochastic.
Hutchinson Trace Estimator for CNFs
Statement
For any matrix and noise vector with and ,
Applied to the CNF,
and the inner product is a single vector-Jacobian product. Cost: per sample.
Intuition
You don't need every diagonal entry of . You need their sum. A random projection is unbiased for that sum because cross terms have zero expectation when has identity covariance. Drawing a fresh each step gives stochastic gradients of the log-likelihood.
Proof Sketch
Expand . Take expectation: by the identity-covariance assumption. So the sum collapses to .
Why It Matters
Common choices for are standard normal or Rademacher (entries ). Rademacher has lower variance because deterministically, removing one source of noise. With a single random projection per ODE step, training a CNF costs the same per step as training the underlying neural ODE, plus one extra vector-Jacobian product.
Failure Mode
Variance of the Hutchinson estimator scales with . For drifts with large Jacobians (sharp features, near-singular dynamics), the estimator's variance dominates and gradients become noisy. In practice this caps the practical step count and limits density sharpness.
Training and Sampling
A single forward integration computes both and a sample.
Density evaluation (given data ):
- Solve the ODE backward from at to at , jointly with the log-density ODE with .
- Return .
Sampling (no data):
- Draw (typically standard Gaussian).
- Solve the ODE forward from to .
- Return .
Both use the adjoint sensitivity method for gradients with memory in the depth dimension. The whole package trains by maximum likelihood on .
CNFs vs Discrete Flows
| Property | Discrete flow (RealNVP, MAF) | Continuous flow (FFJORD) |
|---|---|---|
| Per-layer Jacobian | Triangular by construction | Trace via Hutchinson |
| Architecture constraint | Coupling, masking, alternation | Any Lipschitz network |
| Cost per evaluation | for layers | for solver steps |
| Memory in depth | activations | via adjoint |
| Density estimate | Exact | Exact in expectation, stochastic gradients |
| Sample quality on images | Lower than diffusion | Lower than diffusion |
The architectural freedom of CNFs is real but did not translate into state-of-the-art image generation. Score-based and diffusion models won the empirical race because they sidestep invertibility entirely while still giving access to a probability flow ODE for likelihood when needed.
Connection to Diffusion Models
Score-based diffusion models (Song et al., 2021) train a score network and sample by reversing an SDE. The corresponding probability flow ODE,
has the same marginals as the SDE at every . This ODE is a continuous normalizing flow, with drift . So a trained diffusion model gives you a CNF for free, with the score playing the role of the learned drift. See the neural SDEs page for the full construction.
The CNF perspective explains why deterministic diffusion samplers (DDIM, DPM-Solver, EDM Heun) work: they are integrating a CNF whose drift is the learned score. It also explains why exact likelihood is reachable through diffusion: integrate the trace of the score's Jacobian along the probability flow trajectory.
Common Confusions
CNFs are not normalizing flows with more layers
A discrete flow stacks invertible layers with a learned Jacobian per layer. A CNF is a single ODE: in the limit with infinitesimal steps, the per-layer determinants compose into a trace integral. The architectures available to each are different. A CNF can use a vanilla MLP for the drift; a discrete flow cannot use that MLP as a layer.
The trace is exact in expectation, not in any single backward pass
The Hutchinson estimator is unbiased, but each gradient step uses one (or a few) random . Training is stochastic gradient ascent on log-likelihood, and the noise from the estimator adds to the noise from minibatching. With enough Monte Carlo samples per step you recover the deterministic trace, but in practice one sample per step is standard.
CNFs do not need the data manifold to match the integration time T
The base distribution at and the data distribution at have the same dimensionality. The flow does not change manifold dimension; it warps the density. If the data lives on a low-dimensional manifold inside , the CNF will assign nonzero density off that manifold (the integrated trace remains finite). This is a known limitation shared with discrete flows.
Summary
- CNFs replace the stack of invertible layers in a normalizing flow with a single Neural ODE.
- The instantaneous change of variables turns the per-layer into a time integral of .
- FFJORD makes the trace cheap by replacing it with the Hutchinson stochastic estimator, giving an unbiased cost per evaluation.
- The drift can be any Lipschitz network, removing the coupling/masking architectural constraints that limited discrete flows.
- The probability flow ODE used in diffusion sampling is formally a CNF whose drift is the learned score; CNFs are the bridge from classical flows to diffusion-based likelihood.
Exercises
Problem
Consider the 1D linear CNF with , integrated from to , starting from . Use the instantaneous change of variables to compute in closed form. Compare against the analytic solution .
Problem
For a CNF with drift , you have a vector with and . Show that the variance of the Hutchinson estimator depends on the off-diagonal entries of when is standard Gaussian, and that switching to Rademacher (entries with equal probability) eliminates the contribution of variance. Why does this make Rademacher the default choice in FFJORD implementations?
References
Canonical:
- Chen, Rubanova, Bettencourt, Duvenaud, "Neural Ordinary Differential Equations" (NeurIPS 2018, arXiv:1806.07366), Section 4 introduces the continuous normalizing flow construction
- Grathwohl, Chen, Bettencourt, Sutskever, Duvenaud, "FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models" (ICLR 2019, arXiv:1810.01367), Sections 2-3
Survey:
- Papamakarios, Nalisnick, Rezende, Mohamed, Lakshminarayanan, "Normalizing Flows for Probabilistic Modeling and Inference" (JMLR 2021), Chapter 5 covers continuous flows
- Kobyzev, Prince, Brubaker, "Normalizing Flows: An Introduction and Review of Current Methods" (TPAMI 2020), Section IV
Related:
- Hutchinson, "A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines" (Communications in Statistics 1990) — original trace estimator
- Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, "Score-Based Generative Modeling Through Stochastic Differential Equations" (ICLR 2021, arXiv:2011.13456), Section 4.3 derives the probability flow ODE
- Kidger, "On Neural Differential Equations" (PhD thesis 2022, arXiv:2202.02435), Chapter 5
Next Topics
- Diffusion models: the dominant generative paradigm whose probability flow ODE is a CNF in disguise
- Neural SDEs: stochastic generalization, recovering CNFs as the deterministic limit
- Adjoint sensitivity method: how CNF gradients are computed without storing solver state
Last reviewed: April 17, 2026
Prerequisites
Foundations this topic depends on.
- Neural ODEs and Continuous-Depth NetworksLayer 4
- Classical ODEs: Existence, Stability, and Numerical MethodsLayer 1
- Continuity in R^nLayer 0A
- Metric Spaces, Convergence, and CompletenessLayer 0A
- The Jacobian MatrixLayer 0A
- Skip Connections and ResNetsLayer 2
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Gradient Flow and Vanishing GradientsLayer 2
- Automatic DifferentiationLayer 1
- Normalizing FlowsLayer 3
- Common Probability DistributionsLayer 0A