Optimal Transport and Earth Mover's Distance

Sneiderman, Robby

Modern Generalization

Optimal Transport and Earth Mover's Distance

The Monge and Kantorovich formulations of optimal transport, the linear programming dual, Sinkhorn regularization, and applications to WGANs, domain adaptation, and fairness.

AdvancedTier 2CurrentSupporting~55 min

Prerequisites

Convex Duality Wasserstein Distances

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

modern-generalization | layer 3 | tier 2. This page has 2 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Optimal transport (OT) gives you a principled way to measure the distance between probability distributions that respects the geometry of the underlying space. Unlike KL divergence or total variation, the Wasserstein distance does not blow up when distributions have non-overlapping support. This property made it central to Wasserstein GANs, and it continues to appear in domain adaptation, fairness auditing, and dataset comparison.

Infographic on optimal transport and earth-mover's distance: the discrete formulation as a coupling minimization, the Kantorovich relaxation, the Wasserstein-p distance, the Sinkhorn-Knopp regularized solver, and ML applications (generative model evaluation, distribution alignment, domain adaptation, the Wasserstein GAN). — Optimal transport defines a distance between distributions by the cost of moving mass. Sinkhorn regularization makes it tractable for ML.

Formal Setup

Let $\mu$ and $\nu$ be probability measures on a metric space $(X, d)$ . We want to quantify the "cost" of transforming $\mu$ into $\nu$ .

Definition

Monge Problem $T : X \to X$

Find a transport map $T: X \to X$ such that $T_\# \mu = \nu$ (the pushforward of $\mu$ under $T$ equals $\nu$ ) and the total transport cost is minimized:

$\inf_{T: T_\# \mu = \nu} \int_X c(x, T(x)) \, d\mu(x)$

where $c(x, y)$ is a cost function (typically $c(x,y) = d(x,y)^p$ ).

The Monge problem is hard: it requires a deterministic map, which may not exist (e.g., transporting a Dirac mass to a sum of two Dirac masses).

Definition

Kantorovich Relaxation $γ \in Π (μ, ν)$

Replace the deterministic map with a coupling (joint distribution) $\gamma$ on $X \times X$ with marginals $\mu$ and $\nu$ :

$W_p^p(\mu, \nu) = \inf_{\gamma \in \Pi(\mu, \nu)} \int_{X \times X} c(x, y) \, d\gamma(x, y)$

where $\Pi(\mu, \nu)$ is the set of all joint distributions with first marginal $\mu$ and second marginal $\nu$ . For $c(x,y) = d(x,y)^p$ , the $p$ -th root of this quantity is the $p$ -Wasserstein distance $W_p(\mu, \nu)$ .

The Kantorovich relaxation is a linear program: minimize a linear objective over a convex polytope of couplings.

Main Theorems

Theorem

Kantorovich Duality

Statement

The Kantorovich primal has a dual:

$W_1(\mu, \nu) = \sup_{f: \text{Lip}(f) \leq 1} \left( \int f \, d\mu - \int f \, d\nu \right)$

For the $p = 1$ case with $c(x,y) = d(x,y)$ , the supremum is over all 1-Lipschitz functions $f$ .

Intuition

The primal asks: what is the cheapest way to move mass from $\mu$ to $\nu$ ? The dual asks: what is the largest price difference a 1-Lipschitz "potential function" can extract between the two distributions? Strong duality says these two quantities are equal.

Proof Sketch

This follows from LP duality for the discrete case, or from Fenchel-Rockafellar convex duality in the continuous case. The constraint $\gamma \in \Pi(\mu, \nu)$ introduces dual variables (one for each marginal constraint), which become the potential functions $f$ and $g$ . The 1-Lipschitz constraint arises from $f(x) - g(y) \leq c(x,y)$ , and at optimality $g = f^c$ (the $c$ -transform).

Why It Matters

The dual is what makes Wasserstein GANs work. Instead of solving a combinatorial transport problem, the WGAN critic learns a 1-Lipschitz function $f$ that maximizes $\mathbb{E}_{x \sim p_{\text{data}}}[f(x)] - \mathbb{E}_{x \sim p_{\text{gen}}}[f(x)]$ . This is exactly the Kantorovich dual.

Failure Mode

Computing the exact $W_1$ via the dual requires enforcing the Lipschitz constraint on $f$ . Weight clipping (original WGAN) is crude and causes training instability. Gradient penalty (WGAN-GP) is better but adds computational cost. Spectral normalization provides a tighter constraint but still does not guarantee exact 1-Lipschitz behavior.

report a correction →

Sinkhorn Algorithm

Exact OT in discrete settings costs $O(n^3 \log n)$ via network simplex. This is prohibitive for large datasets.

Definition

Entropic Regularization

Add an entropy penalty to the Kantorovich problem:

$W_\epsilon(\mu, \nu) = \inf_{\gamma \in \Pi(\mu, \nu)} \int c \, d\gamma + \epsilon \, H(\gamma)$

where $H(\gamma) = -\int \gamma \log \gamma$ is the entropy of the coupling. As $\epsilon \to 0$ , this recovers the original OT problem.

Theorem

Sinkhorn Convergence

Statement

The Sinkhorn algorithm (alternating row and column normalization of $K_{ij} = e^{-C_{ij}/\epsilon}$ ) converges linearly to the unique optimal entropic coupling. Each iteration costs $O(n^2)$ . To approximate the unregularized OT cost within additive error $\epsilon$ , the sharp iteration complexity is $\tilde O(1/\epsilon^2)$ (Altschuler, Weed, Rigollet 2017, arXiv:1705.09634), giving a total time of $\tilde O(n^2 / \epsilon^2)$ . Some expositions state a looser $O(1/\epsilon)$ bound for convergence in the KL sense at fixed regularization; the $O(1/\epsilon^2)$ form is the correct complexity for approximating the original OT problem.

Intuition

Entropic regularization makes the optimal coupling strictly positive and unique. The Sinkhorn iteration is coordinate ascent on the dual, alternating between optimizing the two marginal constraints. The matrix $K$ acts as a soft assignment kernel.

Proof Sketch

The entropically regularized dual decomposes into two blocks of variables (one per marginal). Each block update has a closed-form solution involving a row or column normalization of $K$ . Convergence follows from the contraction property of these updates in Hilbert's projective metric.

Why It Matters

Sinkhorn reduced OT computation from $O(n^3)$ (exact LP) to $\tilde O(n^2 / \epsilon^2)$ for an $\epsilon$ -approximate OT cost (Altschuler, Weed, Rigollet 2017), making it practical for mini-batch computation in ML pipelines. It is differentiable, enabling end-to-end training with OT-based losses.

Failure Mode

Small $\epsilon$ is needed for accuracy but causes numerical instability (exponentials of large numbers). Log-domain stabilization is necessary in practice. Large $\epsilon$ gives fast convergence but a blurred, inaccurate transport plan.

report a correction →

Applications in ML

Wasserstein GANs. The original GAN loss uses JS divergence, which has zero gradient when generator and data supports do not overlap. The $W_1$ loss has informative gradients everywhere, stabilizing training. The WGAN critic approximates the Kantorovich dual.

Domain adaptation. If source and target domains have distributions $\mu_S$ and $\mu_T$ , bounding generalization on the target requires a measure of distribution shift. OT provides this measure while respecting feature geometry.

Fairness. Measuring whether a model treats two demographic groups similarly can be framed as measuring the OT distance between the conditional output distributions for each group.

Sliced Wasserstein. Computing $W_p$ in high dimensions is expensive. The sliced Wasserstein distance projects distributions onto random 1D lines (where OT reduces to sorting, costing $O(n \log n)$ ) and averages. This is a valid metric and scales to high-dimensional settings.

Common Confusions

Watch Out

Wasserstein distance is not just another f-divergence

KL divergence, JS divergence, and total variation are all $f$ -divergences. The Wasserstein distance is not. It requires a metric on the underlying space and metrizes weak convergence (convergence in distribution), while $f$ -divergences do not. This is precisely why $W_1$ gives useful gradients when supports do not overlap.

Watch Out

Earth movers distance is W1, not Wp for arbitrary p

The name "earth mover's distance" specifically refers to $W_1$ with the ground metric cost $c(x,y) = d(x,y)$ . The $W_2$ distance (quadratic cost) has different properties and a different dual formulation involving the Brenier theorem.

Exercises

ExerciseCore

Problem

For two discrete distributions $\mu = (1/3, 2/3)$ and $\nu = (1/2, 1/2)$ on $\{0, 1\}$ with ground metric $d(0,1) = 1$ , compute $W_1(\mu, \nu)$ by writing out the coupling polytope and solving the LP.

ExerciseAdvanced

Problem

Show that the Kantorovich dual for $W_1$ on a finite metric space reduces to a linear program. How many constraints does the LP have if both distributions are supported on $n$ points?

References

Canonical:

Villani, Optimal Transport: Old and New (2009), Chapters 1-6
Peyre & Cuturi, Computational Optimal Transport (2019), Chapters 2-4

Current:

Arjovsky, Chintala, Bottou, "Wasserstein GAN" (2017)
Cuturi, "Sinkhorn Distances" (NeurIPS 2013)
Altschuler, Weed, Rigollet, "Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration" (NeurIPS 2017, arXiv:1705.09634), tight $\tilde O(n^2/\epsilon^2)$ complexity analysis

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Convex Dualitylayer 2 · tier 1
Wasserstein Distanceslayer 4 · tier 3

Derived topics

0

No published topic currently declares this as a prerequisite.