Mathematical Infrastructure
Distance Metrics Compared
Side-by-side reference for the distance and divergence functions that appear in ML: Lp norms, cosine, Mahalanobis, Hamming, Jaccard, edit distance, KL, Jensen-Shannon, Wasserstein, MMD. When each is appropriate and what assumptions it implicitly makes.
Prerequisites
Why This Matters
Picking the wrong distance silently bakes a wrong assumption into the model. Cosine similarity discards magnitude, which is correct for normalized text embeddings and wrong for raw counts. Euclidean distance assumes isotropy, which is wrong for features on different scales. KL divergence is asymmetric and infinite when supports disagree, which is fatal for sample-based GAN training. Wasserstein fixes the support problem but requires solving an optimal-transport program. Each choice has a regime where it is the right tool and a regime where it ruins the experiment.
This page is a comparison reference, not a derivation. For mathematical foundations see metric-spaces-convergence-completeness, wasserstein-distances, kl-divergence, and non-euclidean-and-hyperbolic-geometry.
Vector-on-Vector
| Name | Formula | Metric? | Use it for | Avoid when |
|---|---|---|---|---|
| Euclidean () | yes | Geometry-respecting features, after standardization | Features on heterogeneous scales | |
| Manhattan () | yes | Sparse, robust regression, Lasso geometry | Smooth-gradient optimization | |
| Chebyshev () | yes | Worst-case error, grid-step distance | Average-case behaviour matters | |
| Mahalanobis | yes if | Correlated features with known covariance | ill-conditioned or unknown | |
| Cosine distance | not a metric | L2-normalized embeddings, text similarity | Magnitude carries signal | |
| Angular distance | yes | Want a true metric on the sphere | Comparing magnitudes | |
| Minkowski- | yes for | Tuning robustness vs smoothness | (not a metric) |
Notes:
- for is not a metric (triangle inequality fails).
- Cosine "distance" violates the triangle inequality; use angular distance when you need a true metric.
- Mahalanobis equals Euclidean after whitening by . The two are the same operation in different bases.
Set, Sequence, and String
| Name | Formula | Metric? | Use it for |
|---|---|---|---|
| Hamming | yes | Equal-length binary or categorical strings | |
| Jaccard | yes | Set similarity, document shingles | |
| Tanimoto (binary fingerprints) | same as Jaccard on bit sets | yes | Cheminformatics fingerprint matching |
| Levenshtein (edit) | min insert+delete+substitute | yes | Sequence comparison: spell-check, DNA |
| Longest common subsequence | yes | Diff tools, weak edit alignment | |
| Tree edit distance | min tree-rewriting operations | yes | Parse tree comparison, structured data |
The string and tree metrics all satisfy the triangle inequality but are not embeddable into any low-dimensional Euclidean space without distortion. For high-throughput nearest-neighbour search on these, locality-sensitive hashing or MinHash are the standard tools.
Distribution-on-Distribution
| Name | Formula | Symmetric? | Triangle? | Use it for | Trap |
|---|---|---|---|---|---|
| KL divergence | no | no | MLE, variational inference, info-theoretic bounds | Infinite if | |
| Reverse KL | no | no | Variational lower bounds, mode-seeking objectives | Different mode behaviour than forward KL | |
| Jensen-Shannon | , | yes | is a metric | Symmetric KL surrogate, GAN-style losses | Same support sensitivity as KL |
| Total variation | yes | yes | Worst-case event-probability gap | Often very loose vs Wasserstein | |
| Wasserstein- | yes | yes | Distribution alignment with disjoint supports, OT, GAN training | Quadratic-time exact solvers; use Sinkhorn for scale | |
| Maximum Mean Discrepancy (MMD) | yes | yes | Kernel-based two-sample tests, kernel-MMD GAN | Kernel choice and bandwidth matter a lot | |
| Hellinger | yes | yes | Robust two-sample comparison, L2-on-sqrt-densities | Less common in ML papers | |
| Renyi- | no in general | no | Privacy accounting (DP), info bounds | Requires shared support |
KL is the dominant choice when you have parametric models and access to log-densities; Wasserstein dominates when distributions live on different parts of the space (real vs generated samples in early GAN training); MMD dominates when you only have samples and want a closed-form unbiased estimator with kernels.
Worked Example: Same Vectors, Different Verdicts
Let and . Then:
- Euclidean: .
- Manhattan: .
- Cosine distance: .
By cosine the vectors are identical (same direction); by Euclidean they are far apart. Both are correct given their assumptions: cosine treats and as the same "concept" at different magnitudes; Euclidean treats them as different positions. The choice encodes whether magnitude carries signal.
Common Confusions
KL is not a distance
in general, and neither satisfies the triangle inequality. Forward KL ( with data and model) is mode-covering; reverse KL is mode-seeking. Using "KL distance" colloquially is fine; using it as a metric in proofs is not.
Cosine similarity is not invariant to centering
Cosine treats and () as identical, but treats and as different. Many practitioners "center then cosine," which is equivalent to Pearson correlation, not cosine. The two have different invariances and pick different nearest neighbours.
Wasserstein is not just a smoothed KL
Wasserstein incorporates the geometry of the underlying space (it depends on the cost ), KL does not. Two distributions that are "identical up to a translation" have if the translation moves them off each other's support, but proportional to the translation distance. This is exactly why WGAN trains stably where vanilla GAN does not.
Exercises
Problem
Verify that cosine "distance" violates the triangle inequality by exhibiting three vectors on the unit circle in with .
Problem
Construct two probability distributions on with disjoint supports such that but . Compute explicitly.
References
Related Topics
Last reviewed: April 18, 2026
Prerequisites
Foundations this topic depends on.