Wasserstein Distances

Sneiderman, Robby

Modern Generalization

Wasserstein Distances

The Wasserstein (earth mover's) distance measures the minimum cost of transporting one probability distribution to another, with deep connections to optimal transport, GANs, and distributional robustness.

AdvancedTier 3StableSupporting~55 min

Prerequisites

Common Probability Distributions Measure Theoretic Probability Convex Duality Distance Metrics Compared

Start 8-question practice · 6 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

modern-generalization | layer 4 | tier 3. This page has 5 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Optimal Transport and Earth Mover's Distance

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Most distances between probability distributions that you already know, KL divergence and total variation, have a critical failure mode: they blow up or become uninformative when the distributions have non-overlapping support. This happens constantly in machine learning. A generative model that produces images slightly shifted from the real data distribution has zero overlap in high-dimensional pixel space, making KL divergence infinite.

The Wasserstein distance fixes this. It measures how much "work" it takes to reshape one distribution into another, and it remains well-behaved even when the distributions live on different low-dimensional manifolds. This property made it the foundation of WGANs and a key tool in distributional robustness.

Wasserstein distance keeps the geometry of the underlying space

Move probability mass, pay distance times mass, then choose the cheapest coupling.

Mental Model

Imagine two piles of sand with the same total mass but different shapes. The Wasserstein distance is the minimum total cost of shoveling sand from one pile to reshape it into the other, where cost is the amount of sand moved times the distance it travels.

For probability distributions: you are looking for the cheapest way to transport all the probability mass from distribution $P$ to distribution $Q$ .

Formal Setup

Let $(M, d)$ be a metric space and let $P, Q$ be probability distributions on $M$ .

Definition

Coupling

A coupling of $P$ and $Q$ is a joint distribution $\gamma$ on $M \times M$ whose marginals are $P$ and $Q$ :

$\gamma(A \times M) = P(A), \quad \gamma(M \times B) = Q(B)$

for all measurable sets $A, B$ . The set of all couplings is denoted $\Gamma(P, Q)$ .

A coupling specifies a "transport plan": $\gamma(x, y)$ describes how much mass at location $x$ is sent to location $y$ .

Definition

Wasserstein-p Distance $W_{p}$

The Wasserstein- $p$ distance between $P$ and $Q$ is:

$W_p(P, Q) = \left( \inf_{\gamma \in \Gamma(P, Q)} \int_{M \times M} d(x, y)^p \, d\gamma(x, y) \right)^{1/p}$

for $p \geq 1$ . The most common cases are $p = 1$ (earth mover's distance) and $p = 2$ (used in optimal transport theory).

Definition

Wasserstein-1 Distance (Earth Mover's Distance)

The special case $p = 1$ :

$W_1(P, Q) = \inf_{\gamma \in \Gamma(P, Q)} \int_{M \times M} d(x, y) \, d\gamma(x, y)$

This is the total cost of the cheapest transport plan, where cost is mass times distance.

Main Theorems

Theorem

Kantorovich-Rubinstein Duality

Statement

The Wasserstein-1 distance has the dual representation:

$W_1(P, Q) = \sup_{\|f\|_L \leq 1} \left( \mathbb{E}_{x \sim P}[f(x)] - \mathbb{E}_{y \sim Q}[f(y)] \right)$

where the supremum is over all 1-Lipschitz functions $f: M \to \mathbb{R}$ (functions satisfying $|f(x) - f(y)| \leq d(x, y)$ for all $x, y$ ). The identity requires $P$ and $Q$ to have finite first moments so that both sides are finite (Villani 2009, Theorem 5.10 and Definition 6.4).

Intuition

The primal formulation asks: what is the cheapest transport plan? The dual formulation asks: what is the largest difference in expected value of a "smooth" (Lipschitz) test function between the two distributions? The primal and dual are equivalent via linear programming duality.

Proof Sketch

The primal is a linear program: minimize a linear objective (transport cost) subject to linear constraints (marginals match $P$ and $Q$ ). The dual of this linear program introduces a function $f$ for each marginal constraint. The constraint that $f(x) - f(y) \leq d(x, y)$ for all $(x,y)$ in the support of any coupling is equivalent to the 1-Lipschitz condition. Strong duality holds because the primal is feasible (the product coupling always exists).

Why It Matters

The dual form is computationally useful: instead of optimizing over the infinite-dimensional space of couplings, you optimize over Lipschitz functions. This is exactly what the WGAN critic does: it parameterizes a neural network $f$ and enforces (approximately) the Lipschitz constraint.

Failure Mode

The duality is exact for $W_1$ only. For $W_p$ with $p > 1$ , the dual formulation is more complex and involves $c$ -conjugate functions rather than simple Lipschitz constraints.

report a correction →

Why Wasserstein Beats KL and TV

Proposition

Wasserstein Metrizes Weak Convergence

Statement

On a compact metric space, convergence in $W_p$ is equivalent to weak convergence (convergence in distribution). In particular:

If $P_n \to P$ weakly, then $W_p(P_n, P) \to 0$ .
If $W_p(P_n, P) \to 0$ , then $P_n \to P$ weakly.

This does not hold for KL divergence or total variation in general.

Intuition

KL divergence and TV distance can be maximally large between distributions that are "close" in an intuitive sense. Consider $P = \delta_0$ and $Q = \delta_\epsilon$ : two point masses that are $\epsilon$ apart. The TV distance is 1 (maximally far) and KL divergence is infinite. But $W_1(P, Q) = \epsilon$ , which correctly captures that the distributions are close. Wasserstein respects the geometry of the underlying space.

Proof Sketch

Use the dual formulation. If $P_n \to P$ weakly, then $\mathbb{E}_{P_n}[f] \to \mathbb{E}_P[f]$ for all continuous bounded $f$ . Lipschitz functions are continuous and bounded (on a compact space), so the supremum over 1-Lipschitz functions converges. The reverse direction follows because Lipschitz functions are dense in continuous functions on compact spaces.

Why It Matters

This is the reason WGANs work better than the original GAN loss. The original GAN uses a JS divergence (related to cross-entropy loss) that gives no useful gradient when the generator distribution and the real distribution have disjoint supports (which happens almost always in high dimensions). The Wasserstein distance provides meaningful gradients that guide the generator toward the data distribution.

Failure Mode

On non-compact spaces, Wasserstein convergence is strictly stronger than weak convergence. You additionally need moment conditions (e.g., $\mathbb{E}[\|X\|^p]$ is uniformly bounded).

report a correction →

Comparison of Probability Metrics

Property	KL Divergence	Total Variation	Wasserstein-1
Metric?	No (asymmetric)	Yes	Yes
Finite for disjoint support?	No ( $= \infty$ )	Yes ( $= 1$ )	Yes
Respects geometry?	No	No	Yes
Useful gradients for GANs?	Often not	Often not	Yes
Computational cost	Cheap (if densities known)	Moderate	Expensive

Applications in Machine Learning

WGANs: The critic (discriminator) in a WGAN approximates the 1-Lipschitz function achieving the supremum in the Kantorovich-Rubinstein dual. The generator minimizes $W_1(P_{\text{data}}, P_{\text{gen}})$ . The Lipschitz constraint is enforced by weight clipping (original WGAN) or gradient penalty (WGAN-GP).

Distributional robustness: Instead of optimizing expected loss under a single distribution $P$ , optimize the worst-case expected loss over all distributions within a Wasserstein ball around $P$ :

$\min_\theta \sup_{Q: W_p(Q, P) \leq \epsilon} \mathbb{E}_Q[\ell(\theta; x, y)]$

This gives robustness guarantees against distribution shift. Mohajerin-Esfahani and Kuhn (2018, Mathematical Programming 171:115-166, arXiv:1505.05116) proved that the inner sup admits a tractable convex dual reformulation for Lipschitz losses, making Wasserstein DRO practical at realistic sample sizes; Gao and Kleywegt (2023, Mathematics of Operations Research) extended the duality to general cost functions.

Fairness: Wasserstein distance can measure how different the model's output distributions are across demographic groups, providing a smooth, geometrically meaningful fairness metric.

Computational Aspects

For discrete distributions with $n$ points each, computing $W_p$ requires solving a linear program with $O(n^2)$ variables. The Sinkhorn algorithm provides an efficient approximation by adding an entropic regularization term:

$W_1^{\epsilon}(P, Q) = \inf_{\gamma \in \Gamma(P, Q)} \int d(x,y) \, d\gamma + \epsilon \, \text{KL}(\gamma \| P \otimes Q)$

Sinkhorn's algorithm solves this by iterated matrix scaling. The tight approximation-complexity bound is $\tilde{O}(n^2 / \epsilon^2)$ for an $\epsilon$ -approximation of $W_1$ (Altschuler, Weed, Rigollet, NeurIPS 2017, arXiv:1705.09634). Some earlier references state looser $O(n^2 / \epsilon)$ bounds depending on whether $\epsilon$ refers to the regularization parameter or the final approximation accuracy. Either way, Sinkhorn is practical for moderate-sized problems.

Common Confusions

Watch Out

Wasserstein distance is not a divergence

KL and JS are divergences (not true metrics). Wasserstein distances are true metrics: they are symmetric, satisfy the triangle inequality, and $W_p(P, Q) = 0$ if and only if $P = Q$ . This makes them more mathematically well-behaved but also harder to compute.

Watch Out

The Lipschitz constraint in WGANs is approximate

The Kantorovich-Rubinstein dual requires an exact 1-Lipschitz function. In practice, neural network critics are only approximately Lipschitz (via gradient penalty or spectral normalization). The gap between the approximate and exact Wasserstein distance is an active research topic.

Watch Out

W1 and W2 have different properties

$W_1$ has the clean Kantorovich-Rubinstein dual. $W_2$ has nicer geometric properties: the $W_2$ space of distributions is a Riemannian manifold (the Otto calculus). Different applications favor different $p$ .

Summary

Wasserstein distance measures the minimum cost of transporting one distribution to another
$W_1$ has a dual form: supremum of expected difference over 1-Lipschitz functions
Unlike KL/TV, Wasserstein respects the geometry of the underlying space
Wasserstein stays well-behaved for distributions with disjoint support
The dual form is the theoretical basis for WGANs
Computational cost is higher than KL/TV but manageable with Sinkhorn

Exercises

ExerciseCore

Problem

Compute $W_1$ between $P = \delta_0$ and $Q = \delta_a$ on $\mathbb{R}$ (two point masses at 0 and $a > 0$ ). What is the optimal coupling?

ExerciseAdvanced

Problem

Let $P = \text{Uniform}[0, 1]$ and $Q = \text{Uniform}[a, a+1]$ for $a > 0$ . Compute $W_1(P, Q)$ using the Kantorovich-Rubinstein dual. What is the optimal Lipschitz function?

ExerciseResearch

Problem

Why does the original GAN (using JS divergence) suffer from vanishing gradients when the generator distribution and data distribution have disjoint supports, while WGAN does not? Explain using the properties of JS divergence versus Wasserstein distance.

References

Canonical:

Villani, Optimal Transport: Old and New (2009), Chapters 1-6
Kantorovich, "On the Translocation of Masses" (1942)

Current:

Arjovsky, Chintala, Bottou, "Wasserstein Generative Adversarial Networks" (ICML 2017)
Peyre & Cuturi, Computational Optimal Transport (2019)
Altschuler, Weed, Rigollet, "Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration" (NeurIPS 2017, arXiv:1705.09634). Tight $\tilde{O}(n^2/\epsilon^2)$ complexity bound for Sinkhorn-based $W_1$ approximation.
Mohajerin Esfahani & Kuhn, "Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations" (Mathematical Programming 171:115-166, 2018; arXiv:1505.05116). Tractable convex dual for Wasserstein DRO.
Alvarez-Melis & Jaakkola, "Gromov-Wasserstein Alignment of Word Embedding Spaces" (EMNLP 2018). Uses Gromov-Wasserstein distance to align embedding spaces across languages without shared ground space.
Genevay, Dulac-Arnold, Vert, "Differentiable Deep Clustering with Cluster Size Constraints" (2019). Recasts k-means as an entropic optimal transport problem for end-to-end differentiable clustering.

Further directions

This page currently develops the Kantorovich (coupling) formulation and the $W_1$ dual. Adjacent pieces of optimal-transport theory worth exploring, either in-depth on a dedicated page or as pointers into the active research literature:

Monge formulation: the original deterministic-transport-map problem $\inf_T \int c(x, T(x)) \, dP(x)$ subject to $T_\# P = Q$ . Weaker than Kantorovich (a minimizer may not exist; relaxing to couplings is what makes the problem tractable).
Brenier's theorem: for $W_2$ on $\mathbb{R}^d$ with $P$ absolutely continuous w.r.t. Lebesgue measure, the optimal transport map exists, is unique, and equals the gradient of a convex function (Brenier 1991, Comm. Pure Appl. Math.; Villani 2009, Theorem 9.4).
Wasserstein barycenters: the Fréchet mean of a family of distributions under $W_2$ , $\arg\min_\mu \sum_i \lambda_i W_2^2(\mu, \mu_i)$ (Agueh & Carlier 2011, SIAM J. Math. Anal.). Used in domain adaptation and shape interpolation.
Sliced Wasserstein distance: averages one-dimensional $W_p$ along random projections, giving a computationally cheap proxy (Bonneel, Rabin, Peyré, Pfister 2015, J. Math. Imaging Vis.).

Next Topics

The natural next steps from Wasserstein distances:

WGAN theory: how the Wasserstein distance is used in generative modeling
Distributional robustness: using Wasserstein balls for robust optimization
Optimal transport: the full mathematical theory

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Common Probability Distributionslayer 0A · tier 1
Measure-Theoretic Probabilitylayer 0B · tier 1
Total Variation Distancelayer 1 · tier 1
Convex Dualitylayer 2 · tier 1
Distance Metrics Comparedlayer 1 · tier 2

Derived topics

1

Optimal Transport and Earth Mover's Distancelayer 3 · tier 2

Graph-backed continuations

Optimal Transport and Earth Mover's Distance