Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Sampling MCMC

Burn-in and Convergence Diagnostics

Knowing when an MCMC chain has reached stationarity and when to trust its samples. Burn-in, Gelman-Rubin R-hat, effective sample size, trace plots, and autocorrelation.

CoreTier 2Stable~45 min

Prerequisites

0

Why This Matters

MCMC gives you samples from a target distribution, but only eventually. The chain starts from some arbitrary initial state and needs time to "forget" where it started. If you use samples from before the chain has converged, your estimates will be biased toward the initial conditions.

Convergence diagnostics answer the most important practical question in MCMC: are my samples trustworthy? There is no perfect answer, but there are reliable heuristics that catch most problems.

Mental Model

Imagine dropping a ball into a complex bowl. Initially, the ball rolls around chaotically depending on where you dropped it. After enough time, it settles into a pattern that depends only on the shape of the bowl, not on the starting position. Burn-in is the period of chaotic rolling that you throw away. Convergence diagnostics help you judge when the ball has "settled."

The hard truth: you can never prove a chain has converged. You can only detect certain kinds of non-convergence.

Formal Setup and Notation

Let {Xt}t=0T\{X_t\}_{t=0}^{T} be a Markov chain with stationary distribution π\pi. We want to estimate Eπ[f(X)]\mathbb{E}_\pi[f(X)] using the ergodic average:

μ^T=1TBt=B+1Tf(Xt)\hat{\mu}_T = \frac{1}{T - B} \sum_{t=B+1}^{T} f(X_t)

where BB is the burn-in period (number of initial samples discarded).

Definition

Burn-in

The burn-in period is the initial segment of the Markov chain that is discarded before computing estimates. The purpose is to reduce bias from the initial state X0X_0. After burn-in, the chain should be approximately sampling from the stationary distribution π\pi.

There is no universal formula for how long burn-in should be. It depends on how far X0X_0 is from the typical set of π\pi, the mixing rate of the chain, and the geometry of π\pi.

Definition

Mixing Time

The mixing time τmix(ϵ)\tau_{\text{mix}}(\epsilon) is the smallest tt such that the chain's distribution is within total variation distance ϵ\epsilon of π\pi, regardless of the starting state:

τmix(ϵ)=min{t:supX0Pt(X0,)πTVϵ}\tau_{\text{mix}}(\epsilon) = \min\{t : \sup_{X_0} \|P^t(X_0, \cdot) - \pi\|_{\text{TV}} \leq \epsilon\}

This is a property of the chain, not the initial state. A chain with small mixing time forgets its initial state quickly.

Definition

Effective Sample Size (ESS)

MCMC samples are autocorrelated, so TT samples carry less information than TT independent samples. The effective sample size is:

ESS=T1+2k=1ρk\text{ESS} = \frac{T}{1 + 2\sum_{k=1}^{\infty} \rho_k}

where ρk=Corr(f(Xt),f(Xt+k))\rho_k = \mathrm{Corr}(f(X_t), f(X_{t+k})) is the lag-kk autocorrelation. If consecutive samples are highly correlated, ESS can be much smaller than TT. The standard error of the MCMC estimate is approximately σ/ESS\sigma / \sqrt{\text{ESS}}.

Core Definitions

Trace plots show f(Xt)f(X_t) vs tt. A well-mixing chain looks like white noise: it moves rapidly through the support of π\pi with no visible trends or long excursions. A poorly mixing chain shows long plateaus (the chain is stuck) or slow drifts (the chain has not reached stationarity).

Autocorrelation plots show ρk\rho_k vs lag kk. Ideally, autocorrelations decay rapidly to zero. If they remain high at large lags, the chain is mixing slowly and you need many more samples (or a better sampler).

Running mean plots show the cumulative mean of f(Xt)f(X_t) vs tt. If the chain has converged, the running mean stabilizes. If it drifts, the chain has not mixed.

Main Theorems

Proposition

Gelman-Rubin R-hat Diagnostic

Statement

Run M2M \geq 2 chains, each of length NN (after burn-in), from overdispersed starting points. Let θˉm\bar{\theta}_m be the mean of chain mm and θˉ\bar{\theta} the grand mean. Define:

Between-chain variance: B=NM1m=1M(θˉmθˉ)2B = \frac{N}{M-1}\sum_{m=1}^{M}(\bar{\theta}_m - \bar{\theta})^2

Within-chain variance: W=1Mm=1Msm2W = \frac{1}{M}\sum_{m=1}^{M} s_m^2 where sm2s_m^2 is the sample variance of chain mm

The potential scale reduction factor is:

R^=N1NW+1NBW\hat{R} = \sqrt{\frac{\frac{N-1}{N}W + \frac{1}{N}B}{W}}

If the chains have converged, R^1\hat{R} \approx 1. Values of R^>1.01\hat{R} > 1.01 suggest non-convergence.

Intuition

If all chains have converged to π\pi, they should have similar means and variances. The between-chain variance BB captures disagreement among chains. The within-chain variance WW captures variation within each chain. If BWB \gg W, the chains disagree, suggesting they have not converged. R^\hat{R} measures the ratio of total estimated variance to within-chain variance.

Proof Sketch

The numerator N1NW+1NB\frac{N-1}{N}W + \frac{1}{N}B is an overestimate of Varπ(f)\mathrm{Var}_\pi(f) (because BB includes both true variance and between-chain disagreement), and WW is an underestimate (because finite chains have not explored the full distribution). Their ratio exceeds 1 when chains have not mixed. As NN \to \infty, both converge to Varπ(f)\mathrm{Var}_\pi(f) and R^1\hat{R} \to 1.

Why It Matters

R^\hat{R} is the most widely used convergence diagnostic. Every Bayesian software package (Stan, PyMC, JAGS) reports it. The rule R^<1.01\hat{R} < 1.01 (or the older threshold 1.11.1) is standard practice. It catches the most common failure mode: chains stuck in different modes of a multimodal distribution.

Failure Mode

R^1\hat{R} \approx 1 does not guarantee convergence. All chains could be stuck in the same mode while missing others. If the starting points are not overdispersed enough, the diagnostic has no power. Also, R^\hat{R} is defined for scalar quantities; for multivariate targets, you need to check R^\hat{R} for each marginal and for derived quantities.

Canonical Examples

Example

Bimodal target

Suppose π\pi is a mixture of two well-separated Gaussians. A single Metropolis-Hastings chain may get stuck in one mode for the entire run. The trace plot shows a flat line, and the running mean converges to the mode's mean rather than the mixture mean. Running 4 chains from different starting points reveals the problem: chains in different modes have different means, so R^1\hat{R} \gg 1.

Example

Well-mixing chain

For a 2D Gaussian target with moderate correlation, a well-tuned random-walk Metropolis chain with acceptance rate around 0.234 shows rapid mixing. The trace plot looks like white noise, autocorrelations decay to zero by lag 20, and R^\hat{R} from 4 chains is below 1.01 after a few hundred iterations.

Common Confusions

Watch Out

Burn-in is not a fixed fraction

Some textbooks suggest discarding the first 50% of samples as burn-in. This is wasteful for well-mixing chains and insufficient for slowly-mixing chains. Use diagnostics to judge when the chain has converged, then discard the pre-convergence samples. There is no one-size-fits-all burn-in length.

Watch Out

Low R-hat does not mean the chain has converged

R^<1.01\hat{R} < 1.01 is necessary but not sufficient for convergence. It is possible for all chains to be stuck in the same local mode, giving R^1\hat{R} \approx 1 while the chain has not explored the full target. Always supplement R^\hat{R} with trace plots, autocorrelation analysis, and domain knowledge.

Watch Out

More samples is not always the answer

If the chain is not mixing, running it longer just gives you more samples from the wrong distribution. Fix the sampler first (better proposal, HMC, reparameterization), then run longer.

Summary

  • Burn-in: discard initial samples before the chain reaches stationarity
  • There is no universal burn-in formula; use diagnostics
  • Gelman-Rubin R^\hat{R}: run multiple chains, compare within vs between variance; want R^<1.01\hat{R} < 1.01
  • Effective sample size: ESS=T/(1+2ρk)\text{ESS} = T / (1 + 2\sum \rho_k); accounts for autocorrelation
  • Trace plots, autocorrelation plots, and running means are essential visual diagnostics
  • You can never prove convergence; you can only detect non-convergence

Exercises

ExerciseCore

Problem

You run 4 MCMC chains for 10,000 iterations each. The within-chain variance is W=2.0W = 2.0 and the between-chain variance is B=20.0B = 20.0. Compute R^\hat{R} and interpret.

ExerciseAdvanced

Problem

A chain of 10,000 samples has lag-1 autocorrelation ρ1=0.95\rho_1 = 0.95 and autocorrelations decay geometrically: ρk=0.95k\rho_k = 0.95^k. What is the effective sample size?

References

Canonical:

  • Gelman & Rubin, "Inference from Iterative Simulation Using Multiple Sequences," Statistical Science (1992)
  • Geyer, "Practical Markov Chain Monte Carlo," Statistical Science (1992)

Current:

  • Vehtari, Gelman, Simpson, Carpenter, Burkner, "Rank-Normalization, Folding, and Localization: An Improved R-hat" (2021)

  • Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 3-7

  • Brooks et al., Handbook of MCMC (2011), Chapters 1-5

Next Topics

The natural next steps from burn-in and convergence diagnostics:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics