Skip to main content

Mathematical Infrastructure

Distance Metrics Compared

Side-by-side reference for the distance and divergence functions that appear in ML: Lp norms, cosine, Mahalanobis, Hamming, Jaccard, edit distance, KL, Jensen-Shannon, Wasserstein, MMD. When each is appropriate and what assumptions it implicitly makes.

CoreTier 2Stable~22 min
0

Why This Matters

Picking the wrong distance silently bakes a wrong assumption into the model. Cosine similarity discards magnitude, which is correct for normalized text embeddings and wrong for raw counts. Euclidean distance assumes isotropy, which is wrong for features on different scales. KL divergence is asymmetric and infinite when supports disagree, which is fatal for sample-based GAN training. Wasserstein fixes the support problem but requires solving an optimal-transport program. Each choice has a regime where it is the right tool and a regime where it ruins the experiment.

This page is a comparison reference, not a derivation. For mathematical foundations see metric-spaces-convergence-completeness, wasserstein-distances, kl-divergence, and non-euclidean-and-hyperbolic-geometry.

Vector-on-Vector

NameFormulaMetric?Use it forAvoid when
Euclidean (L2L_2)i(xiyi)2\sqrt{\sum_i (x_i - y_i)^2}yesGeometry-respecting features, after standardizationFeatures on heterogeneous scales
Manhattan (L1L_1)ixiyi\sum_i \lvert x_i - y_i \rvertyesSparse, robust regression, Lasso geometrySmooth-gradient optimization
Chebyshev (LL_\infty)maxixiyi\max_i \lvert x_i - y_i \rvertyesWorst-case error, grid-step distanceAverage-case behaviour matters
Mahalanobis(xy)TΣ1(xy)\sqrt{(x - y)^T \Sigma^{-1} (x - y)}yes if Σ0\Sigma \succ 0Correlated features with known covarianceΣ\Sigma ill-conditioned or unknown
Cosine distance1xTyxy1 - \frac{x^T y}{\|x\| \|y\|}not a metricL2-normalized embeddings, text similarityMagnitude carries signal
Angular distance1πarccos(xTyxy)\frac{1}{\pi} \arccos\big(\frac{x^T y}{\|x\| \|y\|}\big)yesWant a true metric on the sphereComparing magnitudes
Minkowski-pp(ixiyip)1/p\big(\sum_i \lvert x_i - y_i \rvert^p\big)^{1/p}yes for p1p \geq 1Tuning robustness vs smoothnessp<1p < 1 (not a metric)

Notes:

  • LpL_p for p<1p < 1 is not a metric (triangle inequality fails).
  • Cosine "distance" 1cosθ1 - \cos\theta violates the triangle inequality; use angular distance when you need a true metric.
  • Mahalanobis equals Euclidean after whitening by Σ1/2\Sigma^{-1/2}. The two are the same operation in different bases.

Set, Sequence, and String

NameFormulaMetric?Use it for
Hamming#{i:xiyi}\#\{i : x_i \neq y_i\}yesEqual-length binary or categorical strings
Jaccard1ABAB1 - \frac{\lvert A \cap B \rvert}{\lvert A \cup B \rvert}yesSet similarity, document shingles
Tanimoto (binary fingerprints)same as Jaccard on bit setsyesCheminformatics fingerprint matching
Levenshtein (edit)min insert+delete+substituteyesSequence comparison: spell-check, DNA
Longest common subsequencex+y2LCS(x,y)\lvert x \rvert + \lvert y \rvert - 2 \cdot \mathrm{LCS}(x, y)yesDiff tools, weak edit alignment
Tree edit distancemin tree-rewriting operationsyesParse tree comparison, structured data

The string and tree metrics all satisfy the triangle inequality but are not embeddable into any low-dimensional Euclidean space without distortion. For high-throughput nearest-neighbour search on these, locality-sensitive hashing or MinHash are the standard tools.

Distribution-on-Distribution

NameFormulaSymmetric?Triangle?Use it forTrap
KL divergenceipilog(pi/qi)\sum_i p_i \log(p_i / q_i)nonoMLE, variational inference, info-theoretic boundsInfinite if supp(p)⊄supp(q)\mathrm{supp}(p) \not\subset \mathrm{supp}(q)
Reverse KLiqilog(qi/pi)\sum_i q_i \log(q_i / p_i)nonoVariational lower bounds, mode-seeking objectivesDifferent mode behaviour than forward KL
Jensen-Shannon12KL(pm)+12KL(qm)\tfrac{1}{2} \mathrm{KL}(p \| m) + \tfrac{1}{2} \mathrm{KL}(q \| m), m=12(p+q)m = \tfrac{1}{2}(p+q)yes\sqrt{\cdot} is a metricSymmetric KL surrogate, GAN-style lossesSame support sensitivity as KL
Total variation12ipiqi\tfrac{1}{2} \sum_i \lvert p_i - q_i \rvertyesyesWorst-case event-probability gapOften very loose vs Wasserstein
Wasserstein-ppinfπΠ(p,q)Eπ[XYp]1/p\inf_{\pi \in \Pi(p, q)} \mathbb{E}_\pi[\|X - Y\|^p]^{1/p}yesyesDistribution alignment with disjoint supports, OT, GAN trainingQuadratic-time exact solvers; use Sinkhorn for scale
Maximum Mean Discrepancy (MMD)supfH1Ep[f]Eq[f]\sup_{\|f\|_{\mathcal{H}} \leq 1} \mathbb{E}_p[f] - \mathbb{E}_q[f]yesyesKernel-based two-sample tests, kernel-MMD GANKernel choice and bandwidth matter a lot
Hellinger12pq2\tfrac{1}{\sqrt{2}} \|\sqrt{p} - \sqrt{q}\|_2yesyesRobust two-sample comparison, L2-on-sqrt-densitiesLess common in ML papers
Renyi-α\alpha1α1logipiαqi1α\tfrac{1}{\alpha - 1} \log \sum_i p_i^\alpha q_i^{1 - \alpha}no in generalnoPrivacy accounting (DP), info boundsRequires shared support

KL is the dominant choice when you have parametric models and access to log-densities; Wasserstein dominates when distributions live on different parts of the space (real vs generated samples in early GAN training); MMD dominates when you only have samples and want a closed-form unbiased estimator with kernels.

Worked Example: Same Vectors, Different Verdicts

Let x=(1,0)x = (1, 0) and y=(10,0)y = (10, 0). Then:

  • Euclidean: xy=9\|x - y\| = 9.
  • Manhattan: 110+0=9|1 - 10| + 0 = 9.
  • Cosine distance: 1110+01100=11=01 - \frac{1 \cdot 10 + 0}{\sqrt{1} \cdot \sqrt{100}} = 1 - 1 = 0.

By cosine the vectors are identical (same direction); by Euclidean they are far apart. Both are correct given their assumptions: cosine treats xx and yy as the same "concept" at different magnitudes; Euclidean treats them as different positions. The choice encodes whether magnitude carries signal.

Common Confusions

Watch Out

KL is not a distance

KL(pq)KL(qp)\mathrm{KL}(p \| q) \neq \mathrm{KL}(q \| p) in general, and neither satisfies the triangle inequality. Forward KL (KL(pq)\mathrm{KL}(p \| q) with pp data and qq model) is mode-covering; reverse KL is mode-seeking. Using "KL distance" colloquially is fine; using it as a metric in proofs is not.

Watch Out

Cosine similarity is not invariant to centering

Cosine treats xx and αx\alpha x (α>0\alpha > 0) as identical, but treats xx and x+cx + c as different. Many practitioners "center then cosine," which is equivalent to Pearson correlation, not cosine. The two have different invariances and pick different nearest neighbours.

Watch Out

Wasserstein is not just a smoothed KL

Wasserstein incorporates the geometry of the underlying space (it depends on the cost xyp\|x - y\|^p), KL does not. Two distributions that are "identical up to a translation" have KL=\mathrm{KL} = \infty if the translation moves them off each other's support, but WpW_p proportional to the translation distance. This is exactly why WGAN trains stably where vanilla GAN does not.

Exercises

ExerciseCore

Problem

Verify that cosine "distance" dcos(x,y)=1xTyxyd_{\cos}(x, y) = 1 - \frac{x^T y}{\|x\| \|y\|} violates the triangle inequality by exhibiting three vectors x,y,zx, y, z on the unit circle in R2\mathbb{R}^2 with dcos(x,z)>dcos(x,y)+dcos(y,z)d_{\cos}(x, z) > d_{\cos}(x, y) + d_{\cos}(y, z).

ExerciseAdvanced

Problem

Construct two probability distributions p,qp, q on R\mathbb{R} with disjoint supports such that KL(pq)=+\mathrm{KL}(p \| q) = +\infty but W1(p,q)<W_1(p, q) < \infty. Compute W1(p,q)W_1(p, q) explicitly.

References

Related Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics