Distance Metrics Compared

Sneiderman, Robby

Mathematical Infrastructure

Distance Metrics Compared

Side-by-side reference for the distance and divergence functions that appear in ML: Lp norms, cosine, Mahalanobis, Hamming, Jaccard, edit distance, KL, Jensen-Shannon, Wasserstein, MMD. When each is appropriate and what assumptions it implicitly makes.

CoreTier 2StableReference~22 min

Prerequisites

Metric Spaces Convergence Completeness Non Euclidean and Hyperbolic Geometry

Start 8-question practice · 8 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

mathematical-infrastructure | layer 1 | tier 2. This page has 2 direct prerequisites and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Wasserstein Distances

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Picking the wrong distance silently bakes a wrong assumption into the model. Cosine similarity discards magnitude, which is correct for normalized text embeddings and wrong for raw counts. Euclidean distance assumes isotropy, which is wrong for features on different scales. KL divergence is asymmetric and infinite when supports disagree, which is fatal for sample-based GAN training. Wasserstein fixes the support problem but requires solving an optimal-transport program. Each choice has a regime where it is the right tool and a regime where it ruins the experiment.

This page is a comparison reference, not a derivation. For mathematical foundations see metric-spaces-convergence-completeness, wasserstein-distances, kl-divergence, and non-euclidean-and-hyperbolic-geometry.

Vector-on-Vector

Name	Formula	Metric?	Use it for	Avoid when
Euclidean ( $L_2$ )	$\sqrt{\sum_i (x_i - y_i)^2}$	yes	Geometry-respecting features, after standardization	Features on heterogeneous scales
Manhattan ( $L_1$ )	$\sum_i \lvert x_i - y_i \rvert$	yes	Sparse, robust regression, Lasso geometry	Smooth-gradient optimization
Chebyshev ( $L_\infty$ )	$\max_i \lvert x_i - y_i \rvert$	yes	Worst-case error, grid-step distance	Average-case behaviour matters
Mahalanobis	$\sqrt{(x - y)^T \Sigma^{-1} (x - y)}$	yes if $\Sigma \succ 0$	Correlated features with known covariance	$\Sigma$ ill-conditioned or unknown
Cosine distance	$1 - \frac{x^T y}{\\|x\\| \\|y\\|}$	not a metric	L2-normalized embeddings, text similarity	Magnitude carries signal
Angular distance	$\frac{1}{\pi} \arccos\big(\frac{x^T y}{\\|x\\| \\|y\\|}\big)$	yes	Want a true metric on the sphere	Comparing magnitudes
Minkowski- $p$	$\big(\sum_i \lvert x_i - y_i \rvert^p\big)^{1/p}$	yes for $p \geq 1$	Tuning robustness vs smoothness	$p < 1$ (not a metric)

Notes:

$L_p$ for $p < 1$ is not a metric (triangle inequality fails).
Cosine "distance" $1 - \cos\theta$ violates the triangle inequality; use angular distance when you need a true metric.
Mahalanobis equals Euclidean after whitening by $\Sigma^{-1/2}$ . The two are the same operation in different bases.

Set, Sequence, and String

Name	Formula	Metric?	Use it for
Hamming	$\#\{i : x_i \neq y_i\}$	yes	Equal-length binary or categorical strings
Jaccard	$1 - \frac{\lvert A \cap B \rvert}{\lvert A \cup B \rvert}$	yes	Set similarity, document shingles
Tanimoto (binary fingerprints)	same as Jaccard on bit sets	yes	Cheminformatics fingerprint matching
Levenshtein (edit)	min insert+delete+substitute	yes	Sequence comparison: spell-check, DNA
Longest common subsequence	$\lvert x \rvert + \lvert y \rvert - 2 \cdot \mathrm{LCS}(x, y)$	yes	Diff tools, weak edit alignment
Tree edit distance	min tree-rewriting operations	yes	Parse tree comparison, structured data

The string and tree metrics all satisfy the triangle inequality but are not embeddable into any low-dimensional Euclidean space without distortion. For high-throughput nearest-neighbour search on these, locality-sensitive hashing or MinHash are the standard tools.

Distribution-on-Distribution

Name	Formula	Symmetric?	Triangle?	Use it for	Trap
KL divergence	$\sum_i p_i \log(p_i / q_i)$	no	no	MLE, variational inference, info-theoretic bounds	Infinite if $\mathrm{supp}(p) \not\subset \mathrm{supp}(q)$
Reverse KL	$\sum_i q_i \log(q_i / p_i)$	no	no	Variational lower bounds, mode-seeking objectives	Different mode behaviour than forward KL
Jensen-Shannon	$\tfrac{1}{2} \mathrm{KL}(p \\| m) + \tfrac{1}{2} \mathrm{KL}(q \\| m)$ , $m = \tfrac{1}{2}(p+q)$	yes	only $\sqrt{\cdot}$ is a metric	Symmetric KL surrogate, GAN-style losses	Bounded by $\log 2$ -- saturates for disjoint supports, gives no gradient
Total variation	$\tfrac{1}{2} \sum_i \lvert p_i - q_i \rvert$	yes	yes	Worst-case event-probability gap	Often very loose vs Wasserstein
Wasserstein- $p$	$\inf_{\pi \in \Pi(p, q)} \mathbb{E}_\pi[\\|X - Y\\|^p]^{1/p}$	yes	yes	Distribution alignment with disjoint supports, OT, GAN training	Quadratic-time exact solvers; use Sinkhorn for scale
Maximum Mean Discrepancy (MMD)	$\sup_{\\|f\\|_{\mathcal{H}} \leq 1} \mathbb{E}_p[f] - \mathbb{E}_q[f]$	yes	yes	Kernel-based two-sample tests, kernel-MMD GAN	Metric only if the kernel is characteristic (e.g. Gaussian); kernel and bandwidth choice matter a lot
Hellinger	$\tfrac{1}{\sqrt{2}} \\|\sqrt{p} - \sqrt{q}\\|_2$	yes	yes	Robust two-sample comparison, L2-on-sqrt-densities	Less common in ML papers
Renyi- $\alpha$	$\tfrac{1}{\alpha - 1} \log \sum_i p_i^\alpha q_i^{1 - \alpha}$	no in general	no	Privacy accounting (DP), info bounds	Requires shared support

KL is the dominant choice when you have parametric models and access to log-densities; Wasserstein dominates when distributions live on different parts of the space (real vs generated samples in early GAN training); MMD dominates when you only have samples and want a closed-form unbiased estimator with kernels.

Worked Example: Same Vectors, Different Verdicts

Let $x = (1, 0)$ and $y = (10, 0)$ . Then:

Euclidean: $\|x - y\| = 9$ .
Manhattan: $|1 - 10| + 0 = 9$ .
Cosine distance: $1 - \frac{1 \cdot 10 + 0}{\sqrt{1} \cdot \sqrt{100}} = 1 - 1 = 0$ .

By cosine the vectors are identical (same direction); by Euclidean they are far apart. Both are correct given their assumptions: cosine treats $x$ and $y$ as the same "concept" at different magnitudes; Euclidean treats them as different positions. The choice encodes whether magnitude carries signal.

Common Confusions

Watch Out

KL is not a distance

$\mathrm{KL}(p \| q) \neq \mathrm{KL}(q \| p)$ in general, and neither satisfies the triangle inequality. Forward KL ( $\mathrm{KL}(p \| q)$ with $p$ data and $q$ model) is mode-covering; reverse KL is mode-seeking. Using "KL distance" colloquially is fine; using it as a metric in proofs is not.

Watch Out

Cosine similarity is not invariant to centering

Cosine treats $x$ and $\alpha x$ ( $\alpha > 0$ ) as identical, but treats $x$ and $x + c$ as different. Many practitioners "center then cosine," which is equivalent to Pearson correlation, not cosine. The two have different invariances and pick different nearest neighbours.

Watch Out

Wasserstein is not just a smoothed KL

Wasserstein incorporates the geometry of the underlying space (it depends on the cost $\|x - y\|^p$ ), KL does not. Two distributions that are "identical up to a translation" have $\mathrm{KL} = \infty$ if the translation moves them off each other's support, but $W_p$ proportional to the translation distance. This is exactly why WGAN trains stably where vanilla GAN does not.

Exercises

ExerciseCore

Problem

Verify that cosine "distance" $d_{\cos}(x, y) = 1 - \frac{x^T y}{\|x\| \|y\|}$ violates the triangle inequality by exhibiting three vectors $x, y, z$ on the unit circle in $\mathbb{R}^2$ with $d_{\cos}(x, z) > d_{\cos}(x, y) + d_{\cos}(y, z)$ .

ExerciseAdvanced

Problem

Construct two probability distributions $p, q$ on $\mathbb{R}$ with disjoint supports such that $\mathrm{KL}(p \| q) = +\infty$ but $W_1(p, q) < \infty$ . Compute $W_1(p, q)$ explicitly.

References

Michel Marie Deza, Elena Deza. Encyclopedia of Distances (4th ed.). Springer, 2016. Comprehensive catalog of distance functions across mathematics, physics, computer science, and biology. The reference for this page.
Thomas M. Cover, Joy A. Thomas. Elements of Information Theory (2nd ed.). Wiley-Interscience, 2006. Chapters 2 and 4: KL divergence, Jensen-Shannon, total variation, Pinsker's inequality.
Cedric Villani. Optimal Transport: Old and New. Springer, 2008. Chapters 1-6: Wasserstein distances, the dual Kantorovich formulation, convergence properties.
Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Scholkopf, Alexander Smola. A Kernel Two-Sample Test. JMLR 13, 2012. The foundational reference for Maximum Mean Discrepancy and its statistical properties.
Martin Arjovsky, Soumith Chintala, Leon Bottou. Wasserstein GAN. ICML 2017. Section 2 contains the canonical comparison of TV, KL, JS, Wasserstein on training stability. arXiv:1701.07875
Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical Learning (2nd ed.). Springer, 2009. Section 14.3: distance choices in clustering, including the practical impact of standardization and Mahalanobis weighting.

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Metric Spaces, Convergence, and Completenesslayer 0A · tier 1
Non-Euclidean and Hyperbolic Geometrylayer 1 · tier 2

Derived topics

3

Gram Matrices and Kernel Matriceslayer 1 · tier 1
KL Divergencelayer 1 · tier 1
Wasserstein Distanceslayer 4 · tier 3

Graph-backed continuations

Wasserstein Distances KL Divergence Gram Matrices and Kernel Matrices

Read this page in the graph.

Why This Matters

Vector-on-Vector

Set, Sequence, and String

Distribution-on-Distribution

Worked Example: Same Vectors, Different Verdicts

Common Confusions

Exercises

References

Related Topics

Required before and derived from this topic

Required prerequisites

Derived topics