Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Sampling MCMC

Coupling Arguments and Mixing Time

Coupling constructs two Markov chains on the same probability space so they eventually meet, bounding total variation distance and mixing time. Spectral gap and coupling inequality are the main tools for proving how fast MCMC converges to stationarity.

AdvancedTier 3Stable~50 min
0

Why This Matters

When you run MCMC, the central question is: how many steps before the chain is close to the stationary distribution? If you stop too early, your samples are biased by the initial state. If you run too long, you waste computation. Mixing time gives a rigorous answer.

Coupling arguments are the sharpest general tool for bounding mixing time. They convert the abstract question "how close is the chain's distribution to stationarity?" into a concrete one: "how long until two coupled chains meet?"

Total Variation Distance

Definition

Total Variation Distance

The total variation distance between two probability distributions μ\mu and ν\nu on a countable space Ω\Omega is:

dTV(μ,ν)=12xΩμ(x)ν(x)=maxAΩμ(A)ν(A)d_{\text{TV}}(\mu, \nu) = \frac{1}{2} \sum_{x \in \Omega} |\mu(x) - \nu(x)| = \max_{A \subseteq \Omega} |\mu(A) - \nu(A)|

It ranges from 0 (identical distributions) to 1 (mutually singular distributions).

Definition

Mixing Time

The mixing time of an ergodic Markov chain with stationary distribution π\pi is:

tmix(ϵ)=min{t:maxx0dTV(Pt(x0,),π)ϵ}t_{\text{mix}}(\epsilon) = \min\left\{t : \max_{x_0} d_{\text{TV}}(P^t(x_0, \cdot), \pi) \leq \epsilon\right\}

where Pt(x0,)P^t(x_0, \cdot) is the distribution after tt steps starting from x0x_0. By convention, tmix=tmix(1/4)t_{\text{mix}} = t_{\text{mix}}(1/4). The choice of 1/41/4 is arbitrary; for any ϵ\epsilon, tmix(ϵ)tmixlog2(1/ϵ)t_{\text{mix}}(\epsilon) \leq t_{\text{mix}} \cdot \lceil \log_2(1/\epsilon) \rceil.

Coupling

Definition

Coupling

A coupling of two distributions μ\mu and ν\nu on Ω\Omega is a joint distribution (X,Y)(X, Y) on Ω×Ω\Omega \times \Omega such that XμX \sim \mu and YνY \sim \nu marginally. Any two distributions can be coupled in many ways; the art is choosing a coupling that makes XX and YY meet quickly.

For Markov chains: a coupling of two copies (Xt)(X_t) and (Yt)(Y_t) of a chain with transition kernel PP is a joint process on the same probability space such that each marginal is a valid copy of the chain. The coupling time is τ=min{t:Xt=Yt}\tau = \min\{t : X_t = Y_t\}.

A faithful coupling ensures that once Xt=YtX_t = Y_t, the chains stay together: Xs=YsX_s = Y_s for all sτs \geq \tau.

Theorem

Coupling Inequality

Statement

For any coupling (X,Y)(X, Y) of distributions μ\mu and ν\nu:

dTV(μ,ν)P(XY)d_{\text{TV}}(\mu, \nu) \leq P(X \neq Y)

Moreover, there exists an optimal coupling that achieves equality.

Intuition

If XX and YY are coupled so they agree with high probability, then μ\mu and ν\nu must be close. The coupling inequality makes this precise: the probability of disagreement upper bounds the total variation distance. You do not need to compute the distance directly; you just need to show the coupled chains meet quickly.

Proof Sketch

For any event AA: μ(A)ν(A)=P(XA)P(YA)P(XA,XY)P(XY)\mu(A) - \nu(A) = P(X \in A) - P(Y \in A) \leq P(X \in A, X \neq Y) \leq P(X \neq Y). Taking the max over AA gives dTVP(XY)d_{\text{TV}} \leq P(X \neq Y). The optimal coupling (which can be constructed explicitly) places mass min(μ(x),ν(x))\min(\mu(x), \nu(x)) on the diagonal and distributes the remainder optimally.

Why It Matters

This is the fundamental bridge between coupling and mixing. To bound the mixing time, start one chain from any state x0x_0 and the other from stationarity π\pi. Construct a coupling where they meet by time tt. Then dTV(Pt(x0,),π)P(τ>t)d_{\text{TV}}(P^t(x_0, \cdot), \pi) \leq P(\tau > t). Bounding the coupling time bounds the mixing time.

Failure Mode

The coupling inequality gives an upper bound. A bad coupling (one where the chains rarely meet) gives a loose bound, not a proof of slow mixing. To prove lower bounds on mixing time, you need different techniques (e.g., bottleneck ratio, conductance). The inequality also requires constructing an explicit coupling, which can be non-trivial for complex chains.

Spectral Gap and Mixing

Definition

Spectral Gap

For a reversible Markov chain with transition matrix PP and stationary distribution π\pi, the spectral gap is:

γ=1λ2\gamma = 1 - \lambda_2

where λ2\lambda_2 is the second largest eigenvalue of PP (in absolute value, the second largest is max(λ2,λn)\max(|\lambda_2|, |\lambda_n|), but for lazy chains λn>1\lambda_n > -1 so γ=1λ2\gamma = 1 - \lambda_2). A larger spectral gap means faster mixing.

Theorem

Spectral Gap and Mixing Time

Statement

For a reversible, irreducible Markov chain on a finite state space with spectral gap γ\gamma and minimum stationary probability πmin\pi_{\min}:

1γ(112e)tmix1γlog(1πmin)\frac{1}{\gamma}\left(1 - \frac{1}{2e}\right) \leq t_{\text{mix}} \leq \frac{1}{\gamma} \log\left(\frac{1}{\pi_{\min}}\right)

Intuition

The spectral gap controls the rate of exponential convergence. The upper bound says mixing happens in roughly 1γlog(1/πmin)\frac{1}{\gamma} \log(1/\pi_{\min}) steps. The log(1/πmin)\log(1/\pi_{\min}) factor accounts for the worst-case starting state: if π\pi puts very small mass on some states, starting there requires more time to "diffuse" into the bulk of the distribution.

Proof Sketch

For the upper bound: after tt steps, dTV(Pt(x,),π)(1γ)2t/π(x)d_{\text{TV}}(P^t(x, \cdot), \pi) \leq \sqrt{(1-\gamma)^{2t} / \pi(x)} by the spectral decomposition of PP. Setting this to 1/41/4 and solving gives t12γlog(4/πmin)t \leq \frac{1}{2\gamma}\log(4/\pi_{\min}). The lower bound uses the variational characterization of γ\gamma and the fact that certain test functions must converge at rate 1γ1 - \gamma.

Why It Matters

The spectral gap gives a quantitative connection between the chain's linear algebra (eigenvalues) and its statistical behavior (mixing). For random walks on graphs, the spectral gap of the graph Laplacian determines mixing: well-connected graphs mix fast, poorly connected ones mix slowly.

Failure Mode

Computing the spectral gap exactly requires knowing all eigenvalues, which is feasible only for small state spaces or chains with special structure (e.g., random walks on known graphs). For continuous state spaces, the spectral gap is replaced by the Poincare constant, and bounding it requires functional inequalities (log-Sobolev, isoperimetry) that can be difficult to establish.

Common Confusions

Watch Out

Coupling time is not mixing time

The coupling time τ\tau is a random variable for a specific coupling construction. The mixing time is a deterministic quantity. The coupling inequality says tmix(ϵ)min{t:maxx0,y0P(τ>t)ϵ}t_{\text{mix}}(\epsilon) \leq \min\{t : \max_{x_0, y_0} P(\tau > t) \leq \epsilon\}. A good coupling gives a tight bound, but a bad coupling can overestimate the mixing time by an arbitrary amount.

Watch Out

Spectral gap bounds are for reversible chains

The spectral gap analysis assumes reversibility (detailed balance). Many practical MCMC algorithms (Metropolis-Hastings, Gibbs sampling) produce reversible chains. Non-reversible chains (e.g., Hamiltonian Monte Carlo, lifted Markov chains) can mix faster than any reversible chain on the same state space, and their analysis requires different tools.

Canonical Examples

Example

Coupling for lazy random walk on the hypercube

Consider a lazy random walk on {0,1}n\{0, 1\}^n: at each step, pick a coordinate ii uniformly and with probability 1/21/2 re-randomize xix_i; otherwise stay. Coupling: run two copies (Xt,Yt)(X_t, Y_t) using the same coordinate choice ii and the same coin flip. Once Xt,i=Yt,iX_{t,i} = Y_{t,i} for some coordinate ii, that coordinate stays coupled. The coupling time is the time until all nn coordinates have been re-randomized, which is a coupon collector problem with expected time O(nlogn)O(n \log n). By the coupling inequality, tmix=O(nlogn)t_{\text{mix}} = O(n \log n).

Exercises

ExerciseCore

Problem

Two coins have bias p=0.6p = 0.6 and q=0.4q = 0.4. Construct an explicit optimal coupling (X,Y)(X, Y) where XBernoulli(0.6)X \sim \text{Bernoulli}(0.6) and YBernoulli(0.4)Y \sim \text{Bernoulli}(0.4), and compute P(XY)P(X \neq Y). Verify this equals dTVd_{\text{TV}}.

ExerciseAdvanced

Problem

A lazy random walk on the cycle Zn\mathbb{Z}_n stays put with probability 1/21/2 and moves left or right each with probability 1/41/4. Use the spectral gap to bound the mixing time. The eigenvalues of the transition matrix are λk=12(1+cos(2πk/n))\lambda_k = \frac{1}{2}(1 + \cos(2\pi k / n)) for k=0,1,,n1k = 0, 1, \ldots, n-1.

References

Canonical:

  • Levin, Peres, Wilmer, Markov Chains and Mixing Times (2009), Chapters 4-5, 12-13
  • Lindvall, Lectures on the Coupling Method (2002), Chapters 1-3

Current:

  • Montenegro & Tetali, "Mathematical Aspects of Mixing Times in Markov Chains" (2006)

  • Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 3-7

  • Gelman et al., Bayesian Data Analysis (2013), Chapters 10-12

  • Brooks et al., Handbook of MCMC (2011), Chapters 1-5

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics