Modern Generalization
Wasserstein Distances
The Wasserstein (earth mover's) distance measures the minimum cost of transporting one probability distribution to another, with deep connections to optimal transport, GANs, and distributional robustness.
Prerequisites
Why This Matters
Most distances between probability distributions that you already know, KL divergence and total variation, have a critical failure mode: they blow up or become uninformative when the distributions have non-overlapping support. This happens constantly in machine learning. A generative model that produces images slightly shifted from the real data distribution has zero overlap in high-dimensional pixel space, making KL divergence infinite.
The Wasserstein distance fixes this. It measures how much "work" it takes to reshape one distribution into another, and it remains well-behaved even when the distributions live on different low-dimensional manifolds. This property made it the foundation of WGANs and a key tool in distributional robustness.
Mental Model
Imagine two piles of sand with the same total mass but different shapes. The Wasserstein distance is the minimum total cost of shoveling sand from one pile to reshape it into the other, where cost is the amount of sand moved times the distance it travels.
For probability distributions: you are looking for the cheapest way to transport all the probability mass from distribution to distribution .
Formal Setup
Let be a metric space and let be probability distributions on .
Coupling
A coupling of and is a joint distribution on whose marginals are and :
for all measurable sets . The set of all couplings is denoted .
A coupling specifies a "transport plan": describes how much mass at location is sent to location .
Wasserstein-p Distance
The Wasserstein- distance between and is:
for . The most common cases are (earth mover's distance) and (used in optimal transport theory).
Wasserstein-1 Distance (Earth Mover's Distance)
The special case :
This is the total cost of the cheapest transport plan, where cost is mass times distance.
Main Theorems
Kantorovich-Rubinstein Duality
Statement
The Wasserstein-1 distance has the dual representation:
where the supremum is over all 1-Lipschitz functions (functions satisfying for all ). The identity requires and to have finite first moments so that both sides are finite (Villani 2009, Theorem 5.10 and Definition 6.4).
Intuition
The primal formulation asks: what is the cheapest transport plan? The dual formulation asks: what is the largest difference in expected value of a "smooth" (Lipschitz) test function between the two distributions? The primal and dual are equivalent via linear programming duality.
Proof Sketch
The primal is a linear program: minimize a linear objective (transport cost) subject to linear constraints (marginals match and ). The dual of this linear program introduces a function for each marginal constraint. The constraint that for all in the support of any coupling is equivalent to the 1-Lipschitz condition. Strong duality holds because the primal is feasible (the product coupling always exists).
Why It Matters
The dual form is computationally useful: instead of optimizing over the infinite-dimensional space of couplings, you optimize over Lipschitz functions. This is exactly what the WGAN critic does: it parameterizes a neural network and enforces (approximately) the Lipschitz constraint.
Failure Mode
The duality is exact for only. For with , the dual formulation is more complex and involves -conjugate functions rather than simple Lipschitz constraints.
Why Wasserstein Beats KL and TV
Wasserstein Metrizes Weak Convergence
Statement
On a compact metric space, convergence in is equivalent to weak convergence (convergence in distribution). In particular:
- If weakly, then .
- If , then weakly.
This does not hold for KL divergence or total variation in general.
Intuition
KL divergence and TV distance can be maximally large between distributions that are "close" in an intuitive sense. Consider and : two point masses that are apart. The TV distance is 1 (maximally far) and KL divergence is infinite. But , which correctly captures that the distributions are close. Wasserstein respects the geometry of the underlying space.
Proof Sketch
Use the dual formulation. If weakly, then for all continuous bounded . Lipschitz functions are continuous and bounded (on a compact space), so the supremum over 1-Lipschitz functions converges. The reverse direction follows because Lipschitz functions are dense in continuous functions on compact spaces.
Why It Matters
This is the reason WGANs work better than the original GAN loss. The original GAN uses a JS divergence (related to cross-entropy loss) that gives no useful gradient when the generator distribution and the real distribution have disjoint supports (which happens almost always in high dimensions). The Wasserstein distance provides meaningful gradients that guide the generator toward the data distribution.
Failure Mode
On non-compact spaces, Wasserstein convergence is strictly stronger than weak convergence. You additionally need moment conditions (e.g., is uniformly bounded).
Comparison of Probability Metrics
| Property | KL Divergence | Total Variation | Wasserstein-1 |
|---|---|---|---|
| Metric? | No (asymmetric) | Yes | Yes |
| Finite for disjoint support? | No () | Yes () | Yes |
| Respects geometry? | No | No | Yes |
| Useful gradients for GANs? | Often not | Often not | Yes |
| Computational cost | Cheap (if densities known) | Moderate | Expensive |
Applications in Machine Learning
WGANs: The critic (discriminator) in a WGAN approximates the 1-Lipschitz function achieving the supremum in the Kantorovich-Rubinstein dual. The generator minimizes . The Lipschitz constraint is enforced by weight clipping (original WGAN) or gradient penalty (WGAN-GP).
Distributional robustness: Instead of optimizing expected loss under a single distribution , optimize the worst-case expected loss over all distributions within a Wasserstein ball around :
This gives robustness guarantees against distribution shift.
Fairness: Wasserstein distance can measure how different the model's output distributions are across demographic groups, providing a smooth, geometrically meaningful fairness metric.
Computational Aspects
For discrete distributions with points each, computing requires solving a linear program with variables. The Sinkhorn algorithm provides an efficient approximation by adding an entropic regularization term:
Sinkhorn's algorithm solves this by iterated matrix scaling. The tight approximation-complexity bound is for an -approximation of (Altschuler, Weed, Rigollet, NeurIPS 2017, arXiv:1705.09634). Some earlier references state looser bounds depending on whether refers to the regularization parameter or the final approximation accuracy. Either way, Sinkhorn is practical for moderate-sized problems.
Common Confusions
Wasserstein distance is not a divergence
KL and JS are divergences (not true metrics). Wasserstein distances are true metrics: they are symmetric, satisfy the triangle inequality, and if and only if . This makes them more mathematically well-behaved but also harder to compute.
The Lipschitz constraint in WGANs is approximate
The Kantorovich-Rubinstein dual requires an exact 1-Lipschitz function. In practice, neural network critics are only approximately Lipschitz (via gradient penalty or spectral normalization). The gap between the approximate and exact Wasserstein distance is an active research topic.
W1 and W2 have different properties
has the clean Kantorovich-Rubinstein dual. has nicer geometric properties: the space of distributions is a Riemannian manifold (the Otto calculus). Different applications favor different .
Summary
- Wasserstein distance measures the minimum cost of transporting one distribution to another
- has a dual form: supremum of expected difference over 1-Lipschitz functions
- Unlike KL/TV, Wasserstein respects the geometry of the underlying space
- Wasserstein stays well-behaved for distributions with disjoint support
- The dual form is the theoretical basis for WGANs
- Computational cost is higher than KL/TV but manageable with Sinkhorn
Exercises
Problem
Compute between and on (two point masses at 0 and ). What is the optimal coupling?
Problem
Let and for . Compute using the Kantorovich-Rubinstein dual. What is the optimal Lipschitz function?
Problem
Why does the original GAN (using JS divergence) suffer from vanishing gradients when the generator distribution and data distribution have disjoint supports, while WGAN does not? Explain using the properties of JS divergence versus Wasserstein distance.
References
Canonical:
- Villani, Optimal Transport: Old and New (2009), Chapters 1-6
- Kantorovich, "On the Translocation of Masses" (1942)
Current:
- Arjovsky, Chintala, Bottou, "Wasserstein Generative Adversarial Networks" (ICML 2017)
- Peyre & Cuturi, Computational Optimal Transport (2019)
- Altschuler, Weed, Rigollet, "Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration" (NeurIPS 2017, arXiv:1705.09634). Tight complexity bound for Sinkhorn-based approximation.
- Alvarez-Melis & Jaakkola, "Gromov-Wasserstein Alignment of Word Embedding Spaces" (EMNLP 2018). Uses Gromov-Wasserstein distance to align embedding spaces across languages without shared ground space.
- Genevay, Dulac-Arnold, Vert, "Differentiable Deep Clustering with Cluster Size Constraints" (2019). Recasts k-means as an entropic optimal transport problem for end-to-end differentiable clustering.
Planned Additions
This page currently develops the Kantorovich (coupling) formulation and the dual. Several closely-related pieces of optimal-transport theory are deferred to dedicated pages:
- Monge formulation: the original deterministic-transport-map problem subject to . Weaker than Kantorovich (a minimizer may not exist; relaxing to couplings is what makes the problem tractable).
- Brenier's theorem: for on with absolutely continuous w.r.t. Lebesgue measure, the optimal transport map exists, is unique, and equals the gradient of a convex function (Brenier 1991, Comm. Pure Appl. Math.; Villani 2009, Theorem 9.4).
- Wasserstein barycenters: the Fréchet mean of a family of distributions under , (Agueh & Carlier 2011, SIAM J. Math. Anal.). Used in domain adaptation and shape interpolation.
- Sliced Wasserstein distance: averages one-dimensional along random projections, giving a computationally cheap proxy (Bonneel, Rabin, Peyré, Pfister 2015, J. Math. Imaging Vis.).
Next Topics
The natural next steps from Wasserstein distances:
- WGAN theory: how the Wasserstein distance is used in generative modeling
- Distributional robustness: using Wasserstein balls for robust optimization
- Optimal transport: the full mathematical theory
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A