Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Normalizing Flows

Generative models that transform a simple base distribution through invertible mappings, enabling exact log-likelihood computation via the change of variables formula.

AdvancedTier 3Stable~50 min
0

Why This Matters

Normalizing flows are the only deep generative model family that provides exact, tractable log-likelihoods without variational approximations or adversarial training. This makes them theoretically clean: you know exactly what you are optimizing. They also provide exact sampling and exact density evaluation in one model. However, flows have largely been displaced by diffusion models for image generation because of architectural constraints imposed by invertibility.

Understanding flows is still valuable: they clarify what you gain and lose by requiring invertibility, and the change-of-variables formula underlies many other methods including continuous normalizing flows and flow matching.

The Core Idea

Start with a simple base distribution pZ(z)=N(z;0,I)p_Z(z) = \mathcal{N}(z; 0, I). Apply a sequence of invertible, differentiable transformations f=fKfK1f1f = f_K \circ f_{K-1} \circ \cdots \circ f_1 to get x=f(z)x = f(z). The density of xx is determined by the change of variables formula.

Definition

Normalizing Flow

A normalizing flow is a sequence of invertible transformations f1,f2,,fKf_1, f_2, \ldots, f_K mapping a base distribution pZp_Z to a target distribution pXp_X. "Normalizing" refers to the change of variables that ensures the transformed density integrates to 1. "Flow" refers to the successive transformations that warp the density.

The Change of Variables Formula

Theorem

Change of Variables for Normalizing Flows

Statement

If x=f(z)x = f(z) where f:RdRdf: \mathbb{R}^d \to \mathbb{R}^d is a diffeomorphism and zpZz \sim p_Z, then:

logpX(x)=logpZ(f1(x))logdetfzz=f1(x)\log p_X(x) = \log p_Z(f^{-1}(x)) - \log \left|\det \frac{\partial f}{\partial z}\bigg|_{z=f^{-1}(x)}\right|

For a composition f=fKf1f = f_K \circ \cdots \circ f_1:

logpX(x)=logpZ(z0)k=1Klogdetfkzk1\log p_X(x) = \log p_Z(z_0) - \sum_{k=1}^{K} \log \left|\det \frac{\partial f_k}{\partial z_{k-1}}\right|

where z0=f1(x)z_0 = f^{-1}(x) and zk=fk(zk1)z_k = f_k(z_{k-1}).

Intuition

The Jacobian determinant measures how much ff locally stretches or compresses volume. If ff expands a region by factor 10, the density in that region must decrease by factor 10 to keep the total probability at 1. The log-determinant accounts for this volume change.

Proof Sketch

Start from the requirement pX(x)dx=1\int p_X(x) dx = 1. Substitute x=f(z)x = f(z), so dx=detJfdzdx = |\det J_f| dz. Then pX(f(z))detJf=pZ(z)p_X(f(z)) |\det J_f| = p_Z(z), giving pX(x)=pZ(f1(x))/detJfp_X(x) = p_Z(f^{-1}(x)) / |\det J_f|. Take logs.

Why It Matters

This is the entire basis of normalizing flows. Unlike VAEs (which optimize a lower bound) or GANs (which use adversarial training), flows optimize the exact log-likelihood directly. No approximation, no mode collapse, no posterior gap. The cost is that you must design ff so that both f1f^{-1} and detJf\det J_f are tractable to compute.

Failure Mode

Computing detJf\det J_f for a general d×dd \times d matrix costs O(d3)O(d^3). For high-dimensional data (images with d>104d > 10^4), this is prohibitive unless the Jacobian has special structure (triangular, block-diagonal, etc.). This architectural constraint is the central limitation of flows.

Architectural Solutions

Coupling Layers (RealNVP)

Proposition

Coupling Layer Jacobian is Triangular

Statement

For a coupling layer that splits z=(za,zb)z = (z_a, z_b) and computes:

xa=za,xb=zbexp(s(za))+t(za)x_a = z_a, \quad x_b = z_b \odot \exp(s(z_a)) + t(z_a)

where ss and tt are arbitrary neural networks, the Jacobian is lower triangular with determinant:

detJ=jexp(s(za)j)=exp(js(za)j)\det J = \prod_j \exp(s(z_a)_j) = \exp\left(\sum_j s(z_a)_j\right)

This costs O(d)O(d) to compute, not O(d3)O(d^3).

Intuition

Since xa=zax_a = z_a (identity), the top-left block of the Jacobian is II. Since xbx_b depends on zaz_a only through ss and tt, the off-diagonal block structure makes the Jacobian triangular. The determinant of a triangular matrix is the product of diagonal entries.

Proof Sketch

Write the full Jacobian in block form: J=(I0xb/zadiag(exp(s(za))))J = \begin{pmatrix} I & 0 \\ \partial x_b / \partial z_a & \text{diag}(\exp(s(z_a))) \end{pmatrix}. The determinant of a block-triangular matrix is the product of the determinants of the diagonal blocks: det(I)det(diag(exp(s)))=exp(sj)\det(I) \cdot \det(\text{diag}(\exp(s))) = \exp(\sum s_j).

Why It Matters

This is the key architectural trick that makes flows practical. The networks ss and tt can be arbitrarily complex (deep ResNets, attention layers) without affecting the cost of the log-determinant computation. Expressiveness comes from stacking many coupling layers with alternating partitions.

Failure Mode

A single coupling layer leaves half the dimensions unchanged. You need to alternate which dimensions are "active" across layers. With poor alternation patterns, some dimensions may never interact, limiting expressiveness.

Autoregressive Flows

Autoregressive flows (MAF, IAF) use the autoregressive property: dimension xix_i depends only on x1,,xi1x_1, \ldots, x_{i-1}. The Jacobian is triangular by construction.

MAF (Masked Autoregressive Flow): fast density evaluation (parallel), slow sampling (sequential, one dimension at a time).

IAF (Inverse Autoregressive Flow): fast sampling (parallel), slow density evaluation. The inverse of MAF.

The tradeoff between MAF and IAF is a direct consequence of the asymmetry between forward and inverse passes in autoregressive models.

Why Flows Lost to Diffusion

Flows require exact invertibility, which constrains architecture: input and output must have the same dimensionality, and every layer must be invertible. This prevents using standard architectures (U-Nets, standard ResNets). Diffusion models avoid this by learning a denoising process that does not require invertibility, allowing more expressive architectures. The result: diffusion models achieve better sample quality on images with simpler training procedures.

Flows remain useful for density estimation, variational inference (as flexible posterior approximations), and physics simulations where exact likelihood matters.

Common Confusions

Watch Out

Flows are not just fancy coordinate transforms

While each layer is a coordinate transformation, the composition of many layers with learned parameters can represent highly complex distributions. The universal approximation results for flows (Huang et al., 2018) show that sufficiently deep flows can approximate any target density.

Watch Out

The base distribution choice matters less than you think

A standard Gaussian base is used in nearly all flow models. The flow layers are expressive enough to warp any unimodal base into a complex multimodal target. Using a more complex base distribution rarely helps in practice.

Key Takeaways

  • Flows compute exact log-likelihoods via the change of variables formula
  • The computational bottleneck is the Jacobian determinant, which costs O(d3)O(d^3) in general but O(d)O(d) with coupling or autoregressive structure
  • Coupling layers (RealNVP) let ss and tt be arbitrary networks while keeping the determinant tractable
  • MAF is fast for density evaluation; IAF is fast for sampling
  • Diffusion models displaced flows for image generation because invertibility constrains architecture, but flows remain valuable where exact likelihood is needed

Exercises

ExerciseCore

Problem

Write the change of variables formula for a 1D normalizing flow x=f(z)=z3x = f(z) = z^3 where zN(0,1)z \sim \mathcal{N}(0,1). What is pX(x)p_X(x)?

ExerciseAdvanced

Problem

A coupling layer splits zR4z \in \mathbb{R}^4 as za=(z1,z2)z_a = (z_1, z_2) and zb=(z3,z4)z_b = (z_3, z_4). The scale network outputs s(za)=(1.0,0.5)s(z_a) = (1.0, -0.5) and the translation network outputs t(za)=(0.3,0.7)t(z_a) = (0.3, 0.7). Compute the output xx and the log-determinant of the Jacobian when z=(1,2,3,4)z = (1, 2, 3, 4).

References

Canonical:

  • Dinh et al., "Density estimation using Real-NVP" (2017), Section 3
  • Rezende & Mohamed, "Variational Inference with Normalizing Flows" (2015), Section 3

Current:

  • Papamakarios et al., "Normalizing Flows for Probabilistic Modeling and Inference" (2021), Chapters 3-4

  • Kobyzev et al., "Normalizing Flows: An Introduction and Review" (2020)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

  • Diffusion models: the generative paradigm that displaced flows
  • Energy-based models: density modeling without normalization

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics