Foundations
Total Variation Distance
Total variation distance measures the largest possible discrepancy between two probability distributions: equivalently half the L1 gap, one minus the overlap mass, or the minimum disagreement probability under a coupling.
Why This Matters
Total variation distance is the cleanest notion of worst-case distributional discrepancy. If you give two models the freedom to disagree on any measurable event, total variation asks: what is the biggest probability gap they can produce?
That single quantity shows up everywhere:
- in coupling arguments, where TV is the disagreement probability under the best joint construction
- in KL divergence, via Pinsker's inequality
- in Wasserstein distance, as the contrast case where geometry is ignored
- in MCMC and mixing-time theory, where convergence is often stated directly in TV
Total variation has three equivalent readings and each teaches a different intuition
Total variation is the unmatched probability mass after overlap is removed.
Mental Model
Three equivalent readings are worth holding at once:
- Overlap view. Remove the shared mass ; what remains is TV.
- Event view. Let an adversary pick the measurable set where the two distributions disagree most. The biggest gap is TV.
- Coupling view. Draw and on the same probability space in the smartest possible way. The smallest possible probability that is TV.
Each view is mathematically equivalent, but each teaches a different instinct.
Core Definitions
Total Variation Distance
For two probability measures and on the same measurable space,
The supremum runs over all measurable sets .
When densities and exist with respect to a common dominating measure, this becomes
So TV is literally half the distance, or equivalently one minus the overlap mass.
Main Theorems
Equivalent Forms of Total Variation
Statement
The following are equivalent:
If and are densities, the extremizing event is
Intuition
The signed difference has positive and negative regions. If you want the biggest possible probability gap, you keep exactly the region where exceeds and ignore the region where it falls below. That converts the absolute-value integral into the measurable-event supremum.
Proof Sketch
Split the space into and its complement. On , the difference is positive; on the complement, it is negative. Therefore
This gives . The overlap identity follows from the pointwise equality and the fact that .
Why It Matters
This theorem is why TV is so interpretable. It is simultaneously an adversarial event gap, a density-overlap deficit, and an discrepancy. Different subfields pick different forms, but they are all the same object.
Failure Mode
TV ignores geometry. Two point masses at nearby locations can have even if the ambient metric says they are extremely close. That is exactly why Wasserstein distance exists.
Coupling Characterization of Total Variation
Statement
Among all couplings with marginals and ,
Moreover, there exists a maximal coupling attaining equality.
Intuition
The common overlap mass can be coupled to agree exactly. Only the leftover unmatched mass must disagree. So the best possible disagreement probability is precisely the amount of unmatched mass, which is TV.
Proof Sketch
Write the shared mass as . Sample from the normalized overlap and set there. The remaining residual masses of and are then sampled independently on the non-overlap part. Agreement happens exactly on the shared-mass component, whose total weight is . Hence the disagreement probability is .
Why It Matters
This is the bridge from abstract probability metrics to concrete stochastic processes. In mixing-time proofs, you often build a coupling and show the two chains have met by time with high probability; the theorem converts that meeting event directly into a TV bound.
Failure Mode
Not every naive coupling is maximal. A bad coupling can make the disagreement probability much larger than TV. The theorem says TV is the best possible disagreement rate, not the rate produced by an arbitrary coupling.
Pinsker's Inequality
TV and KL are linked by the classical bound
This is useful in lower bounds, concentration arguments, and asymptotic statistics because KL often tensorizes more easily than TV. But it is only a one-way control: small KL implies small TV, while the reverse is false without extra assumptions.
Why TV and Wasserstein Feel Different
TV is a support-sensitive metric: if two distributions place mass on disjoint sets, TV is already maximal. Wasserstein is a geometry-sensitive metric: if those disjoint sets are close in the ambient space, Wasserstein can still be small.
This difference explains the common dichotomy:
- TV is natural for hypothesis testing, coupling, and mixing.
- Wasserstein is natural for transport, generative modeling, and robustness with geometric structure.
Common Confusions
TV is a metric, but it is not geometric
TV satisfies symmetry and the triangle inequality, so it is a genuine metric. But it does not know about ambient distances between outcomes. It only sees how much mass fails to overlap.
The factor 1/2 is convention, not substance
Some ML papers define TV as without the . Probability texts almost always include the , giving range . The two conventions differ by exactly a factor of two. Always check which one a paper uses before comparing constants.
Summary
- with densities,
- TV is the smallest disagreement probability over all couplings
- TV is sensitive to support mismatch but blind to geometry
- Pinsker links TV to KL:
Exercise
Problem
Let and . Compute using both the event-gap definition and the formula.
References
Canonical:
- Levin, Peres, and Wilmer, Markov Chains and Mixing Times (2009), Chapter 4
- Villani, Optimal Transport: Old and New (2009), Chapter 6 for the contrast with Wasserstein
Current / standard texts:
- Durrett, Probability: Theory and Examples (5th ed., 2019), sections on coupling and total variation
- van der Vaart, Asymptotic Statistics (1998), Appendix and Chapter 7 for TV-KL relations
Last reviewed: April 20, 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Measure-Theoretic ProbabilityLayer 0B