Coupling Arguments and Mixing Time

Sneiderman, Robby

Sampling MCMC

Coupling Arguments and Mixing Time

Coupling constructs two Markov chains on the same probability space so they eventually meet, bounding total variation distance and mixing time. Spectral gap and coupling inequality are the main tools for proving how fast MCMC converges to stationarity.

AdvancedTier 3StableSupporting~50 min

Prerequisites

Metropolis Hastings Martingale Theory Total Variation Distance

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

sampling-mcmc | layer 3 | tier 3. This page has 3 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Burn-in and Convergence Diagnostics

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

When you run MCMC, the central question is: how many steps before the chain is close to the stationary distribution? If you stop too early, your samples are biased by the initial state. If you run too long, you waste computation. Mixing time gives a rigorous answer.

Coupling arguments are the sharpest general tool for bounding mixing time. They convert the abstract question "how close is the chain's distribution to stationarity?" into a concrete one: "how long until two coupled chains meet?"

Total Variation Distance

Definition

Total Variation Distance $d_{TV} (μ, ν)$

The total variation distance between two probability distributions $\mu$ and $\nu$ on a countable space $\Omega$ is:

$d_{\text{TV}}(\mu, \nu) = \frac{1}{2} \sum_{x \in \Omega} |\mu(x) - \nu(x)| = \max_{A \subseteq \Omega} |\mu(A) - \nu(A)|$

It ranges from 0 (identical distributions) to 1 (mutually singular distributions).

Definition

Mixing Time $t_{mix} (ϵ)$

The mixing time of an ergodic Markov chain with stationary distribution $\pi$ is:

$t_{\text{mix}}(\epsilon) = \min\left\{t : \max_{x_0} d_{\text{TV}}(P^t(x_0, \cdot), \pi) \leq \epsilon\right\}$

where $P^t(x_0, \cdot)$ is the distribution after $t$ steps starting from $x_0$ . By convention, $t_{\text{mix}} = t_{\text{mix}}(1/4)$ . The choice of $1/4$ is arbitrary; for any $\epsilon$ , $t_{\text{mix}}(\epsilon) \leq t_{\text{mix}} \cdot \lceil \log_2(1/\epsilon) \rceil$ .

Coupling

Definition

Coupling

A coupling of two distributions $\mu$ and $\nu$ on $\Omega$ is a joint distribution $(X, Y)$ on $\Omega \times \Omega$ such that $X \sim \mu$ and $Y \sim \nu$ marginally. Any two distributions can be coupled in many ways; the art is choosing a coupling that makes $X$ and $Y$ meet quickly.

For Markov chains: a coupling of two copies $(X_t)$ and $(Y_t)$ of a chain with transition kernel $P$ is a joint process on the same probability space such that each marginal is a valid copy of the chain. The coupling time is $\tau = \min\{t : X_t = Y_t\}$ .

A faithful coupling ensures that once $X_t = Y_t$ , the chains stay together: $X_s = Y_s$ for all $s \geq \tau$ .

Theorem

Coupling Inequality

Statement

For any coupling $(X, Y)$ of distributions $\mu$ and $\nu$ :

$d_{\text{TV}}(\mu, \nu) \leq P(X \neq Y)$

Moreover, there exists an optimal coupling that achieves equality.

Intuition

If $X$ and $Y$ are coupled so they agree with high probability, then $\mu$ and $\nu$ must be close. The coupling inequality makes this precise: the probability of disagreement upper bounds the total variation distance. You do not need to compute the distance directly; you just need to show the coupled chains meet quickly.

Proof Sketch

For any event $A$ : $\mu(A) - \nu(A) = P(X \in A) - P(Y \in A) \leq P(X \in A, X \neq Y) \leq P(X \neq Y)$ . Taking the max over $A$ gives $d_{\text{TV}} \leq P(X \neq Y)$ . The optimal coupling (which can be constructed explicitly) places mass $\min(\mu(x), \nu(x))$ on the diagonal and distributes the remainder optimally.

Why It Matters

This is the fundamental bridge between coupling and mixing. To bound the mixing time, start one chain from any state $x_0$ and the other from stationarity $\pi$ . Construct a coupling where they meet by time $t$ . Then $d_{\text{TV}}(P^t(x_0, \cdot), \pi) \leq P(\tau > t)$ . Bounding the coupling time bounds the mixing time.

Failure Mode

The coupling inequality gives an upper bound. A bad coupling (one where the chains rarely meet) gives a loose bound, not a proof of slow mixing. To prove lower bounds on mixing time, you need different techniques (e.g., bottleneck ratio, conductance). The inequality also requires constructing an explicit coupling, which can be non-trivial for complex chains.

report a correction →

Spectral Gap and Mixing

Definition

Spectral Gap $γ$

For a reversible Markov chain with transition matrix $P$ and stationary distribution $\pi$ , the spectral gap is:

$\gamma = 1 - \lambda_2$

where $\lambda_2$ is the second largest eigenvalue of $P$ (in absolute value, the second largest is $\max(|\lambda_2|, |\lambda_n|)$ , but for lazy chains $\lambda_n > -1$ so $\gamma = 1 - \lambda_2$ ). A larger spectral gap means faster mixing.

Theorem

Spectral Gap and Mixing Time

Statement

For a reversible, irreducible, aperiodic Markov chain on a finite state space with absolute spectral gap $\gamma_* = 1 - \max_{i \geq 2} |\lambda_i|$ and minimum stationary probability $\pi_{\min}$ :

$\frac{1}{\gamma_*}\left(1 - \frac{1}{2e}\right) \leq t_{\text{mix}} \leq \frac{1}{\gamma_*} \log\left(\frac{1}{\pi_{\min}}\right)$

The aperiodicity assumption (or equivalently, working with a lazy chain so that $\lambda_n > -1$ ) is required: the upper bound is on $|\lambda_i|$ , not just $\lambda_i$ , so a chain with $\lambda_n$ close to $-1$ can have a large spectral gap $1 - \lambda_2$ but still fail to mix because of period-2 oscillation.

Intuition

The spectral gap controls the rate of exponential convergence. The upper bound says mixing happens in roughly $\frac{1}{\gamma} \log(1/\pi_{\min})$ steps. The $\log(1/\pi_{\min})$ factor accounts for the worst-case starting state: if $\pi$ puts very small mass on some states, starting there requires more time to "diffuse" into the bulk of the distribution.

Proof Sketch

For the upper bound: after $t$ steps, $d_{\text{TV}}(P^t(x, \cdot), \pi) \leq \sqrt{(1-\gamma)^{2t} / \pi(x)}$ by the spectral decomposition of $P$ . Setting this to $1/4$ and solving gives $t \leq \frac{1}{2\gamma}\log(4/\pi_{\min})$ . The lower bound uses the variational characterization of $\gamma$ and the fact that certain test functions must converge at rate $1 - \gamma$ .

Why It Matters

The spectral gap gives a quantitative connection between the chain's linear algebra (eigenvalues) and its statistical behavior (mixing). For random walks on graphs, the spectral gap of the graph Laplacian determines mixing: well-connected graphs mix fast, poorly connected ones mix slowly.

Failure Mode

Computing the spectral gap exactly requires knowing all eigenvalues, which is feasible only for small state spaces or chains with special structure (e.g., random walks on known graphs). For continuous state spaces, the spectral gap is replaced by the Poincare constant, and bounding it requires functional inequalities (log-Sobolev, isoperimetry) that can be difficult to establish.

report a correction →

Common Confusions

Watch Out

Coupling time is not mixing time

The coupling time $\tau$ is a random variable for a specific coupling construction. The mixing time is a deterministic quantity. The coupling inequality says $t_{\text{mix}}(\epsilon) \leq \min\{t : \max_{x_0, y_0} P(\tau > t) \leq \epsilon\}$ . A good coupling gives a tight bound, but a bad coupling can overestimate the mixing time by an arbitrary amount.

Watch Out

Spectral gap bounds are for reversible chains

The spectral gap analysis assumes reversibility (detailed balance). Many practical MCMC algorithms (Metropolis-Hastings, Gibbs sampling) produce reversible chains. Non-reversible chains (e.g., Hamiltonian Monte Carlo, lifted Markov chains) can mix faster than any reversible chain on the same state space, and their analysis requires different tools.

Canonical Examples

Example

Coupling for lazy random walk on the hypercube

Consider a lazy random walk on $\{0, 1\}^n$ : at each step, pick a coordinate $i$ uniformly and with probability $1/2$ re-randomize $x_i$ ; otherwise stay. Coupling: run two copies $(X_t, Y_t)$ using the same coordinate choice $i$ and the same coin flip. Once $X_{t,i} = Y_{t,i}$ for some coordinate $i$ , that coordinate stays coupled. The coupling time is the time until all $n$ coordinates have been re-randomized, which is a coupon collector problem with expected time $O(n \log n)$ . By the coupling inequality, $t_{\text{mix}} = O(n \log n)$ .

Exercises

ExerciseCore

Problem

Two coins have bias $p = 0.6$ and $q = 0.4$ . Construct an explicit optimal coupling $(X, Y)$ where $X \sim \text{Bernoulli}(0.6)$ and $Y \sim \text{Bernoulli}(0.4)$ , and compute $P(X \neq Y)$ . Verify this equals $d_{\text{TV}}$ .

ExerciseAdvanced

Problem

A lazy random walk on the cycle $\mathbb{Z}_n$ stays put with probability $1/2$ and moves left or right each with probability $1/4$ . Use the spectral gap to bound the mixing time. The eigenvalues of the transition matrix are $\lambda_k = \frac{1}{2}(1 + \cos(2\pi k / n))$ for $k = 0, 1, \ldots, n-1$ .

References

Canonical:

Levin, Peres, Wilmer, Markov Chains and Mixing Times (2009), Chapters 4-5, 12-13
Lindvall, Lectures on the Coupling Method (2002), Chapters 1-3

Current:

Montenegro & Tetali, "Mathematical Aspects of Mixing Times in Markov Chains" (2006)
Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 3-7
Gelman et al., Bayesian Data Analysis (2013), Chapters 10-12
Brooks et al., Handbook of MCMC (2011), Chapters 1-5

Next Topics

Burn-in and convergence diagnostics: practical methods for deciding when MCMC has mixed

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Total Variation Distancelayer 1 · tier 1
Metropolis-Hastings Algorithmlayer 2 · tier 1
Martingale Theorylayer 0B · tier 2

Derived topics

1

Burn-in and Convergence Diagnosticslayer 2 · tier 2

Graph-backed continuations

Burn-in and Convergence Diagnostics