Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Modern Generalization

Optimal Transport and Earth Mover's Distance

The Monge and Kantorovich formulations of optimal transport, the linear programming dual, Sinkhorn regularization, and applications to WGANs, domain adaptation, and fairness.

AdvancedTier 2Current~55 min
0

Why This Matters

Optimal transport (OT) gives you a principled way to measure the distance between probability distributions that respects the geometry of the underlying space. Unlike KL divergence or total variation, the Wasserstein distance does not blow up when distributions have non-overlapping support. This property made it central to Wasserstein GANs, and it continues to appear in domain adaptation, fairness auditing, and dataset comparison.

Formal Setup

Let μ\mu and ν\nu be probability measures on a metric space (X,d)(X, d). We want to quantify the "cost" of transforming μ\mu into ν\nu.

Definition

Monge Problem

Find a transport map T:XXT: X \to X such that T#μ=νT_\# \mu = \nu (the pushforward of μ\mu under TT equals ν\nu) and the total transport cost is minimized:

infT:T#μ=νXc(x,T(x))dμ(x)\inf_{T: T_\# \mu = \nu} \int_X c(x, T(x)) \, d\mu(x)

where c(x,y)c(x, y) is a cost function (typically c(x,y)=d(x,y)pc(x,y) = d(x,y)^p).

The Monge problem is hard: it requires a deterministic map, which may not exist (e.g., transporting a Dirac mass to a sum of two Dirac masses).

Definition

Kantorovich Relaxation

Replace the deterministic map with a coupling (joint distribution) γ\gamma on X×XX \times X with marginals μ\mu and ν\nu:

Wpp(μ,ν)=infγΠ(μ,ν)X×Xc(x,y)dγ(x,y)W_p^p(\mu, \nu) = \inf_{\gamma \in \Pi(\mu, \nu)} \int_{X \times X} c(x, y) \, d\gamma(x, y)

where Π(μ,ν)\Pi(\mu, \nu) is the set of all joint distributions with first marginal μ\mu and second marginal ν\nu. For c(x,y)=d(x,y)pc(x,y) = d(x,y)^p, the pp-th root of this quantity is the pp-Wasserstein distance Wp(μ,ν)W_p(\mu, \nu).

The Kantorovich relaxation is a linear program: minimize a linear objective over a convex polytope of couplings.

Main Theorems

Theorem

Kantorovich Duality

Statement

The Kantorovich primal has a dual:

W1(μ,ν)=supf:Lip(f)1(fdμfdν)W_1(\mu, \nu) = \sup_{f: \text{Lip}(f) \leq 1} \left( \int f \, d\mu - \int f \, d\nu \right)

For the p=1p = 1 case with c(x,y)=d(x,y)c(x,y) = d(x,y), the supremum is over all 1-Lipschitz functions ff.

Intuition

The primal asks: what is the cheapest way to move mass from μ\mu to ν\nu? The dual asks: what is the largest price difference a 1-Lipschitz "potential function" can extract between the two distributions? Strong duality says these two quantities are equal.

Proof Sketch

This follows from LP duality for the discrete case, or from Fenchel-Rockafellar convex duality in the continuous case. The constraint γΠ(μ,ν)\gamma \in \Pi(\mu, \nu) introduces dual variables (one for each marginal constraint), which become the potential functions ff and gg. The 1-Lipschitz constraint arises from f(x)g(y)c(x,y)f(x) - g(y) \leq c(x,y), and at optimality g=fcg = f^c (the cc-transform).

Why It Matters

The dual is what makes Wasserstein GANs work. Instead of solving a combinatorial transport problem, the WGAN critic learns a 1-Lipschitz function ff that maximizes Expdata[f(x)]Expgen[f(x)]\mathbb{E}_{x \sim p_{\text{data}}}[f(x)] - \mathbb{E}_{x \sim p_{\text{gen}}}[f(x)]. This is exactly the Kantorovich dual.

Failure Mode

Computing the exact W1W_1 via the dual requires enforcing the Lipschitz constraint on ff. Weight clipping (original WGAN) is crude and causes training instability. Gradient penalty (WGAN-GP) is better but adds computational cost. Spectral normalization provides a tighter constraint but still does not guarantee exact 1-Lipschitz behavior.

Sinkhorn Algorithm

Exact OT in discrete settings costs O(n3logn)O(n^3 \log n) via network simplex. This is prohibitive for large datasets.

Definition

Entropic Regularization

Add an entropy penalty to the Kantorovich problem:

Wϵ(μ,ν)=infγΠ(μ,ν)cdγ+ϵH(γ)W_\epsilon(\mu, \nu) = \inf_{\gamma \in \Pi(\mu, \nu)} \int c \, d\gamma + \epsilon \, H(\gamma)

where H(γ)=γlogγH(\gamma) = -\int \gamma \log \gamma is the entropy of the coupling. As ϵ0\epsilon \to 0, this recovers the original OT problem.

Theorem

Sinkhorn Convergence

Statement

The Sinkhorn algorithm (alternating row and column normalization of Kij=eCij/ϵK_{ij} = e^{-C_{ij}/\epsilon}) converges linearly to the unique optimal entropic coupling. Each iteration costs O(n2)O(n^2), and the algorithm typically needs O(1/ϵ)O(1/\epsilon) iterations for a fixed accuracy.

Intuition

Entropic regularization makes the optimal coupling strictly positive and unique. The Sinkhorn iteration is coordinate ascent on the dual, alternating between optimizing the two marginal constraints. The matrix KK acts as a soft assignment kernel.

Proof Sketch

The entropically regularized dual decomposes into two blocks of variables (one per marginal). Each block update has a closed-form solution involving a row or column normalization of KK. Convergence follows from the contraction property of these updates in Hilbert's projective metric.

Why It Matters

Sinkhorn reduced OT computation from O(n3)O(n^3) to roughly O(n2/ϵ)O(n^2 / \epsilon), making it practical for mini-batch computation in ML pipelines. It is differentiable, enabling end-to-end training with OT-based losses.

Failure Mode

Small ϵ\epsilon is needed for accuracy but causes numerical instability (exponentials of large numbers). Log-domain stabilization is necessary in practice. Large ϵ\epsilon gives fast convergence but a blurred, inaccurate transport plan.

Applications in ML

Wasserstein GANs. The original GAN loss uses JS divergence, which has zero gradient when generator and data supports do not overlap. The W1W_1 loss has informative gradients everywhere, stabilizing training. The WGAN critic approximates the Kantorovich dual.

Domain adaptation. If source and target domains have distributions μS\mu_S and μT\mu_T, bounding generalization on the target requires a measure of distribution shift. OT provides this measure while respecting feature geometry.

Fairness. Measuring whether a model treats two demographic groups similarly can be framed as measuring the OT distance between the conditional output distributions for each group.

Sliced Wasserstein. Computing WpW_p in high dimensions is expensive. The sliced Wasserstein distance projects distributions onto random 1D lines (where OT reduces to sorting, costing O(nlogn)O(n \log n)) and averages. This is a valid metric and scales to high-dimensional settings.

Common Confusions

Watch Out

Wasserstein distance is not just another f-divergence

KL divergence, JS divergence, and total variation are all ff-divergences. The Wasserstein distance is not. It requires a metric on the underlying space and metrizes weak convergence (convergence in distribution), while ff-divergences do not. This is precisely why W1W_1 gives useful gradients when supports do not overlap.

Watch Out

Earth movers distance is W1, not Wp for arbitrary p

The name "earth mover's distance" specifically refers to W1W_1 with the ground metric cost c(x,y)=d(x,y)c(x,y) = d(x,y). The W2W_2 distance (quadratic cost) has different properties and a different dual formulation involving the Brenier theorem.

Exercises

ExerciseCore

Problem

For two discrete distributions μ=(1/3,2/3)\mu = (1/3, 2/3) and ν=(1/2,1/2)\nu = (1/2, 1/2) on {0,1}\{0, 1\} with ground metric d(0,1)=1d(0,1) = 1, compute W1(μ,ν)W_1(\mu, \nu) by writing out the coupling polytope and solving the LP.

ExerciseAdvanced

Problem

Show that the Kantorovich dual for W1W_1 on a finite metric space reduces to a linear program. How many constraints does the LP have if both distributions are supported on nn points?

References

Canonical:

  • Villani, Optimal Transport: Old and New (2009), Chapters 1-6
  • Peyre & Cuturi, Computational Optimal Transport (2019), Chapters 2-4

Current:

  • Arjovsky, Chintala, Bottou, "Wasserstein GAN" (2017)
  • Cuturi, "Sinkhorn Distances" (NeurIPS 2013)

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.