Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Modern Generalization

Wasserstein Distances

The Wasserstein (earth mover's) distance measures the minimum cost of transporting one probability distribution to another, with deep connections to optimal transport, GANs, and distributional robustness.

AdvancedTier 3Stable~55 min
0

Why This Matters

Most distances between probability distributions that you already know, KL divergence and total variation, have a critical failure mode: they blow up or become uninformative when the distributions have non-overlapping support. This happens constantly in machine learning. A generative model that produces images slightly shifted from the real data distribution has zero overlap in high-dimensional pixel space, making KL divergence infinite.

The Wasserstein distance fixes this. It measures how much "work" it takes to reshape one distribution into another, and it remains well-behaved even when the distributions live on different low-dimensional manifolds. This property made it the foundation of WGANs and a key tool in distributional robustness.

Mental Model

Imagine two piles of sand with the same total mass but different shapes. The Wasserstein distance is the minimum total cost of shoveling sand from one pile to reshape it into the other, where cost is the amount of sand moved times the distance it travels.

For probability distributions: you are looking for the cheapest way to transport all the probability mass from distribution PP to distribution QQ.

Formal Setup

Let (M,d)(M, d) be a metric space and let P,QP, Q be probability distributions on MM.

Definition

Coupling

A coupling of PP and QQ is a joint distribution γ\gamma on M×MM \times M whose marginals are PP and QQ:

γ(A×M)=P(A),γ(M×B)=Q(B)\gamma(A \times M) = P(A), \quad \gamma(M \times B) = Q(B)

for all measurable sets A,BA, B. The set of all couplings is denoted Γ(P,Q)\Gamma(P, Q).

A coupling specifies a "transport plan": γ(x,y)\gamma(x, y) describes how much mass at location xx is sent to location yy.

Definition

Wasserstein-p Distance

The Wasserstein-pp distance between PP and QQ is:

Wp(P,Q)=(infγΓ(P,Q)M×Md(x,y)pdγ(x,y))1/pW_p(P, Q) = \left( \inf_{\gamma \in \Gamma(P, Q)} \int_{M \times M} d(x, y)^p \, d\gamma(x, y) \right)^{1/p}

for p1p \geq 1. The most common cases are p=1p = 1 (earth mover's distance) and p=2p = 2 (used in optimal transport theory).

Definition

Wasserstein-1 Distance (Earth Mover's Distance)

The special case p=1p = 1:

W1(P,Q)=infγΓ(P,Q)M×Md(x,y)dγ(x,y)W_1(P, Q) = \inf_{\gamma \in \Gamma(P, Q)} \int_{M \times M} d(x, y) \, d\gamma(x, y)

This is the total cost of the cheapest transport plan, where cost is mass times distance.

Main Theorems

Theorem

Kantorovich-Rubinstein Duality

Statement

The Wasserstein-1 distance has the dual representation:

W1(P,Q)=supfL1(ExP[f(x)]EyQ[f(y)])W_1(P, Q) = \sup_{\|f\|_L \leq 1} \left( \mathbb{E}_{x \sim P}[f(x)] - \mathbb{E}_{y \sim Q}[f(y)] \right)

where the supremum is over all 1-Lipschitz functions f:MRf: M \to \mathbb{R} (functions satisfying f(x)f(y)d(x,y)|f(x) - f(y)| \leq d(x, y) for all x,yx, y). The identity requires PP and QQ to have finite first moments so that both sides are finite (Villani 2009, Theorem 5.10 and Definition 6.4).

Intuition

The primal formulation asks: what is the cheapest transport plan? The dual formulation asks: what is the largest difference in expected value of a "smooth" (Lipschitz) test function between the two distributions? The primal and dual are equivalent via linear programming duality.

Proof Sketch

The primal is a linear program: minimize a linear objective (transport cost) subject to linear constraints (marginals match PP and QQ). The dual of this linear program introduces a function ff for each marginal constraint. The constraint that f(x)f(y)d(x,y)f(x) - f(y) \leq d(x, y) for all (x,y)(x,y) in the support of any coupling is equivalent to the 1-Lipschitz condition. Strong duality holds because the primal is feasible (the product coupling always exists).

Why It Matters

The dual form is computationally useful: instead of optimizing over the infinite-dimensional space of couplings, you optimize over Lipschitz functions. This is exactly what the WGAN critic does: it parameterizes a neural network ff and enforces (approximately) the Lipschitz constraint.

Failure Mode

The duality is exact for W1W_1 only. For WpW_p with p>1p > 1, the dual formulation is more complex and involves cc-conjugate functions rather than simple Lipschitz constraints.

Why Wasserstein Beats KL and TV

Proposition

Wasserstein Metrizes Weak Convergence

Statement

On a compact metric space, convergence in WpW_p is equivalent to weak convergence (convergence in distribution). In particular:

  1. If PnPP_n \to P weakly, then Wp(Pn,P)0W_p(P_n, P) \to 0.
  2. If Wp(Pn,P)0W_p(P_n, P) \to 0, then PnPP_n \to P weakly.

This does not hold for KL divergence or total variation in general.

Intuition

KL divergence and TV distance can be maximally large between distributions that are "close" in an intuitive sense. Consider P=δ0P = \delta_0 and Q=δϵQ = \delta_\epsilon: two point masses that are ϵ\epsilon apart. The TV distance is 1 (maximally far) and KL divergence is infinite. But W1(P,Q)=ϵW_1(P, Q) = \epsilon, which correctly captures that the distributions are close. Wasserstein respects the geometry of the underlying space.

Proof Sketch

Use the dual formulation. If PnPP_n \to P weakly, then EPn[f]EP[f]\mathbb{E}_{P_n}[f] \to \mathbb{E}_P[f] for all continuous bounded ff. Lipschitz functions are continuous and bounded (on a compact space), so the supremum over 1-Lipschitz functions converges. The reverse direction follows because Lipschitz functions are dense in continuous functions on compact spaces.

Why It Matters

This is the reason WGANs work better than the original GAN loss. The original GAN uses a JS divergence (related to cross-entropy loss) that gives no useful gradient when the generator distribution and the real distribution have disjoint supports (which happens almost always in high dimensions). The Wasserstein distance provides meaningful gradients that guide the generator toward the data distribution.

Failure Mode

On non-compact spaces, Wasserstein convergence is strictly stronger than weak convergence. You additionally need moment conditions (e.g., E[Xp]\mathbb{E}[\|X\|^p] is uniformly bounded).

Comparison of Probability Metrics

PropertyKL DivergenceTotal VariationWasserstein-1
Metric?No (asymmetric)YesYes
Finite for disjoint support?No (== \infty)Yes (=1= 1)Yes
Respects geometry?NoNoYes
Useful gradients for GANs?Often notOften notYes
Computational costCheap (if densities known)ModerateExpensive

Applications in Machine Learning

WGANs: The critic (discriminator) in a WGAN approximates the 1-Lipschitz function achieving the supremum in the Kantorovich-Rubinstein dual. The generator minimizes W1(Pdata,Pgen)W_1(P_{\text{data}}, P_{\text{gen}}). The Lipschitz constraint is enforced by weight clipping (original WGAN) or gradient penalty (WGAN-GP).

Distributional robustness: Instead of optimizing expected loss under a single distribution PP, optimize the worst-case expected loss over all distributions within a Wasserstein ball around PP:

minθsupQ:Wp(Q,P)ϵEQ[(θ;x,y)]\min_\theta \sup_{Q: W_p(Q, P) \leq \epsilon} \mathbb{E}_Q[\ell(\theta; x, y)]

This gives robustness guarantees against distribution shift.

Fairness: Wasserstein distance can measure how different the model's output distributions are across demographic groups, providing a smooth, geometrically meaningful fairness metric.

Computational Aspects

For discrete distributions with nn points each, computing WpW_p requires solving a linear program with O(n2)O(n^2) variables. The Sinkhorn algorithm provides an efficient approximation by adding an entropic regularization term:

W1ϵ(P,Q)=infγΓ(P,Q)d(x,y)dγ+ϵKL(γPQ)W_1^{\epsilon}(P, Q) = \inf_{\gamma \in \Gamma(P, Q)} \int d(x,y) \, d\gamma + \epsilon \, \text{KL}(\gamma \| P \otimes Q)

Sinkhorn's algorithm solves this by iterated matrix scaling. The tight approximation-complexity bound is O~(n2/ϵ2)\tilde{O}(n^2 / \epsilon^2) for an ϵ\epsilon-approximation of W1W_1 (Altschuler, Weed, Rigollet, NeurIPS 2017, arXiv:1705.09634). Some earlier references state looser O(n2/ϵ)O(n^2 / \epsilon) bounds depending on whether ϵ\epsilon refers to the regularization parameter or the final approximation accuracy. Either way, Sinkhorn is practical for moderate-sized problems.

Common Confusions

Watch Out

Wasserstein distance is not a divergence

KL and JS are divergences (not true metrics). Wasserstein distances are true metrics: they are symmetric, satisfy the triangle inequality, and Wp(P,Q)=0W_p(P, Q) = 0 if and only if P=QP = Q. This makes them more mathematically well-behaved but also harder to compute.

Watch Out

The Lipschitz constraint in WGANs is approximate

The Kantorovich-Rubinstein dual requires an exact 1-Lipschitz function. In practice, neural network critics are only approximately Lipschitz (via gradient penalty or spectral normalization). The gap between the approximate and exact Wasserstein distance is an active research topic.

Watch Out

W1 and W2 have different properties

W1W_1 has the clean Kantorovich-Rubinstein dual. W2W_2 has nicer geometric properties: the W2W_2 space of distributions is a Riemannian manifold (the Otto calculus). Different applications favor different pp.

Summary

  • Wasserstein distance measures the minimum cost of transporting one distribution to another
  • W1W_1 has a dual form: supremum of expected difference over 1-Lipschitz functions
  • Unlike KL/TV, Wasserstein respects the geometry of the underlying space
  • Wasserstein stays well-behaved for distributions with disjoint support
  • The dual form is the theoretical basis for WGANs
  • Computational cost is higher than KL/TV but manageable with Sinkhorn

Exercises

ExerciseCore

Problem

Compute W1W_1 between P=δ0P = \delta_0 and Q=δaQ = \delta_a on R\mathbb{R} (two point masses at 0 and a>0a > 0). What is the optimal coupling?

ExerciseAdvanced

Problem

Let P=Uniform[0,1]P = \text{Uniform}[0, 1] and Q=Uniform[a,a+1]Q = \text{Uniform}[a, a+1] for a>0a > 0. Compute W1(P,Q)W_1(P, Q) using the Kantorovich-Rubinstein dual. What is the optimal Lipschitz function?

ExerciseResearch

Problem

Why does the original GAN (using JS divergence) suffer from vanishing gradients when the generator distribution and data distribution have disjoint supports, while WGAN does not? Explain using the properties of JS divergence versus Wasserstein distance.

References

Canonical:

  • Villani, Optimal Transport: Old and New (2009), Chapters 1-6
  • Kantorovich, "On the Translocation of Masses" (1942)

Current:

  • Arjovsky, Chintala, Bottou, "Wasserstein Generative Adversarial Networks" (ICML 2017)
  • Peyre & Cuturi, Computational Optimal Transport (2019)
  • Altschuler, Weed, Rigollet, "Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration" (NeurIPS 2017, arXiv:1705.09634). Tight O~(n2/ϵ2)\tilde{O}(n^2/\epsilon^2) complexity bound for Sinkhorn-based W1W_1 approximation.
  • Alvarez-Melis & Jaakkola, "Gromov-Wasserstein Alignment of Word Embedding Spaces" (EMNLP 2018). Uses Gromov-Wasserstein distance to align embedding spaces across languages without shared ground space.
  • Genevay, Dulac-Arnold, Vert, "Differentiable Deep Clustering with Cluster Size Constraints" (2019). Recasts k-means as an entropic optimal transport problem for end-to-end differentiable clustering.

Planned Additions

This page currently develops the Kantorovich (coupling) formulation and the W1W_1 dual. Several closely-related pieces of optimal-transport theory are deferred to dedicated pages:

  • Monge formulation: the original deterministic-transport-map problem infTc(x,T(x))dP(x)\inf_T \int c(x, T(x)) \, dP(x) subject to T#P=QT_\# P = Q. Weaker than Kantorovich (a minimizer may not exist; relaxing to couplings is what makes the problem tractable).
  • Brenier's theorem: for W2W_2 on Rd\mathbb{R}^d with PP absolutely continuous w.r.t. Lebesgue measure, the optimal transport map exists, is unique, and equals the gradient of a convex function (Brenier 1991, Comm. Pure Appl. Math.; Villani 2009, Theorem 9.4).
  • Wasserstein barycenters: the Fréchet mean of a family of distributions under W2W_2, argminμiλiW22(μ,μi)\arg\min_\mu \sum_i \lambda_i W_2^2(\mu, \mu_i) (Agueh & Carlier 2011, SIAM J. Math. Anal.). Used in domain adaptation and shape interpolation.
  • Sliced Wasserstein distance: averages one-dimensional WpW_p along random projections, giving a computationally cheap proxy (Bonneel, Rabin, Peyré, Pfister 2015, J. Math. Imaging Vis.).

Next Topics

The natural next steps from Wasserstein distances:

  • WGAN theory: how the Wasserstein distance is used in generative modeling
  • Distributional robustness: using Wasserstein balls for robust optimization
  • Optimal transport: the full mathematical theory

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics