Skip to main content

Foundations

Total Variation Distance

Total variation distance measures the largest possible discrepancy between two probability distributions: equivalently half the L1 gap, one minus the overlap mass, or the minimum disagreement probability under a coupling.

CoreTier 1Stable~35 min

Why This Matters

Total variation distance is the cleanest notion of worst-case distributional discrepancy. If you give two models the freedom to disagree on any measurable event, total variation asks: what is the biggest probability gap they can produce?

That single quantity shows up everywhere:

  • in coupling arguments, where TV is the disagreement probability under the best joint construction
  • in KL divergence, via Pinsker's inequality
  • in Wasserstein distance, as the contrast case where geometry is ignored
  • in MCMC and mixing-time theory, where convergence is often stated directly in TV

Total variation has three equivalent readings and each teaches a different intuition

Total variation is the unmatched probability mass after overlap is removed.

Total variation is the unmatched probability mass after overlap is removedLine up the two densities, keep the shared part, and count what is left over on either side.overlap kept by min(p, q)

Mental Model

Three equivalent readings are worth holding at once:

  1. Overlap view. Remove the shared mass min(p,q)\min(p,q); what remains is TV.
  2. Event view. Let an adversary pick the measurable set AA where the two distributions disagree most. The biggest gap P(A)Q(A)|P(A)-Q(A)| is TV.
  3. Coupling view. Draw XPX \sim P and YQY \sim Q on the same probability space in the smartest possible way. The smallest possible probability that XYX \neq Y is TV.

Each view is mathematically equivalent, but each teaches a different instinct.

Core Definitions

Definition

Total Variation Distance

For two probability measures PP and QQ on the same measurable space,

TV(P,Q):=supAP(A)Q(A).\mathrm{TV}(P,Q) := \sup_{A} |P(A)-Q(A)|.

The supremum runs over all measurable sets AA.

When densities pp and qq exist with respect to a common dominating measure, this becomes

TV(P,Q)=12p(x)q(x)dx=1min(p,q)dx.\mathrm{TV}(P,Q) = \frac12 \int |p(x)-q(x)|\,dx = 1-\int \min(p,q)\,dx.

So TV is literally half the L1L^1 distance, or equivalently one minus the overlap mass.

Main Theorems

Theorem

Equivalent Forms of Total Variation

Statement

The following are equivalent:

TV(P,Q)=supAP(A)Q(A)=12pqdx=1min(p,q)dx.\mathrm{TV}(P,Q)=\sup_A |P(A)-Q(A)|=\frac12\int |p-q|\,dx=1-\int \min(p,q)\,dx.

If pp and qq are densities, the extremizing event is

A={x:p(x)>q(x)}.A^*=\{x: p(x)>q(x)\}.

Intuition

The signed difference pqp-q has positive and negative regions. If you want the biggest possible probability gap, you keep exactly the region where pp exceeds qq and ignore the region where it falls below. That converts the absolute-value integral into the measurable-event supremum.

Proof Sketch

Split the space into A={p>q}A^*=\{p>q\} and its complement. On AA^*, the difference pqp-q is positive; on the complement, it is negative. Therefore

pqdx=A(pq)dx+(A)c(qp)dx=2(P(A)Q(A)).\int |p-q|\,dx = \int_{A^*}(p-q)\,dx + \int_{(A^*)^c}(q-p)\,dx = 2(P(A^*)-Q(A^*)).

This gives TV(P,Q)=P(A)Q(A)=12pqdx\mathrm{TV}(P,Q)=P(A^*)-Q(A^*)=\tfrac12\int|p-q|\,dx. The overlap identity follows from the pointwise equality pq=p+q2min(p,q)|p-q|=p+q-2\min(p,q) and the fact that p=q=1\int p=\int q=1.

Why It Matters

This theorem is why TV is so interpretable. It is simultaneously an adversarial event gap, a density-overlap deficit, and an L1L^1 discrepancy. Different subfields pick different forms, but they are all the same object.

Failure Mode

TV ignores geometry. Two point masses at nearby locations can have TV=1\mathrm{TV}=1 even if the ambient metric says they are extremely close. That is exactly why Wasserstein distance exists.

Theorem

Coupling Characterization of Total Variation

Statement

Among all couplings (X,Y)(X,Y) with marginals PP and QQ,

TV(P,Q)=inf(X,Y)P[XY].\mathrm{TV}(P,Q)=\inf_{(X,Y)} \mathbb P[X\neq Y].

Moreover, there exists a maximal coupling attaining equality.

Intuition

The common overlap mass can be coupled to agree exactly. Only the leftover unmatched mass must disagree. So the best possible disagreement probability is precisely the amount of unmatched mass, which is TV.

Proof Sketch

Write the shared mass as μ(dx)=min(p(x),q(x))dx\mu(dx)=\min(p(x),q(x))\,dx. Sample from the normalized overlap μ/μ(Ω)\mu/\mu(\Omega) and set X=YX=Y there. The remaining residual masses of PP and QQ are then sampled independently on the non-overlap part. Agreement happens exactly on the shared-mass component, whose total weight is min(p,q)dx=1TV(P,Q)\int \min(p,q)\,dx = 1-\mathrm{TV}(P,Q). Hence the disagreement probability is TV(P,Q)\mathrm{TV}(P,Q).

Why It Matters

This is the bridge from abstract probability metrics to concrete stochastic processes. In mixing-time proofs, you often build a coupling and show the two chains have met by time tt with high probability; the theorem converts that meeting event directly into a TV bound.

Failure Mode

Not every naive coupling is maximal. A bad coupling can make the disagreement probability much larger than TV. The theorem says TV is the best possible disagreement rate, not the rate produced by an arbitrary coupling.

Pinsker's Inequality

TV and KL are linked by the classical bound

TV(P,Q)12DKL(PQ).\mathrm{TV}(P,Q)\le \sqrt{\frac12 D_{\mathrm{KL}}(P\|Q)}.

This is useful in lower bounds, concentration arguments, and asymptotic statistics because KL often tensorizes more easily than TV. But it is only a one-way control: small KL implies small TV, while the reverse is false without extra assumptions.

Why TV and Wasserstein Feel Different

TV is a support-sensitive metric: if two distributions place mass on disjoint sets, TV is already maximal. Wasserstein is a geometry-sensitive metric: if those disjoint sets are close in the ambient space, Wasserstein can still be small.

This difference explains the common dichotomy:

  • TV is natural for hypothesis testing, coupling, and mixing.
  • Wasserstein is natural for transport, generative modeling, and robustness with geometric structure.

Common Confusions

Watch Out

TV is a metric, but it is not geometric

TV satisfies symmetry and the triangle inequality, so it is a genuine metric. But it does not know about ambient distances between outcomes. It only sees how much mass fails to overlap.

Watch Out

The factor 1/2 is convention, not substance

Some ML papers define TV as PQ1\|P-Q\|_1 without the 12\frac12. Probability texts almost always include the 12\frac12, giving range [0,1][0,1]. The two conventions differ by exactly a factor of two. Always check which one a paper uses before comparing constants.

Summary

  • TV(P,Q)=supAP(A)Q(A)\mathrm{TV}(P,Q)=\sup_A |P(A)-Q(A)|
  • with densities, TV(P,Q)=12pqdx=1min(p,q)dx\mathrm{TV}(P,Q)=\frac12\int |p-q|\,dx = 1-\int \min(p,q)\,dx
  • TV is the smallest disagreement probability over all couplings
  • TV is sensitive to support mismatch but blind to geometry
  • Pinsker links TV to KL: TVDKL/2\mathrm{TV}\le \sqrt{D_{\mathrm{KL}}/2}

Exercise

ExerciseCore

Problem

Let P=Bernoulli(0.7)P=\mathrm{Bernoulli}(0.7) and Q=Bernoulli(0.4)Q=\mathrm{Bernoulli}(0.4). Compute TV(P,Q)\mathrm{TV}(P,Q) using both the event-gap definition and the 12pq1\frac12\|p-q\|_1 formula.

References

Canonical:

  • Levin, Peres, and Wilmer, Markov Chains and Mixing Times (2009), Chapter 4
  • Villani, Optimal Transport: Old and New (2009), Chapter 6 for the contrast with Wasserstein

Current / standard texts:

  • Durrett, Probability: Theory and Examples (5th ed., 2019), sections on coupling and total variation
  • van der Vaart, Asymptotic Statistics (1998), Appendix and Chapter 7 for TV-KL relations

Last reviewed: April 20, 2026

Prerequisites

Foundations this topic depends on.

Next Topics