Sampling MCMC
Burn-in and Convergence Diagnostics
Knowing when an MCMC chain has reached stationarity and when to trust its samples. Burn-in, Gelman-Rubin R-hat, effective sample size, trace plots, and autocorrelation.
Prerequisites
Why This Matters
MCMC gives you samples from a target distribution, but only eventually. The chain starts from some arbitrary initial state and needs time to "forget" where it started. If you use samples from before the chain has converged, your estimates will be biased toward the initial conditions.
Convergence diagnostics answer the most important practical question in MCMC: are my samples trustworthy? There is no perfect answer, but there are reliable heuristics that catch most problems.
Mental Model
Imagine dropping a ball into a complex bowl. Initially, the ball rolls around chaotically depending on where you dropped it. After enough time, it settles into a pattern that depends only on the shape of the bowl, not on the starting position. Burn-in is the period of chaotic rolling that you throw away. Convergence diagnostics help you judge when the ball has "settled."
The hard truth: you can never prove a chain has converged. You can only detect certain kinds of non-convergence.
Formal Setup and Notation
Let be a Markov chain with stationary distribution . We want to estimate using the ergodic average:
where is the burn-in period (number of initial samples discarded).
Burn-in
The burn-in period is the initial segment of the Markov chain that is discarded before computing estimates. The purpose is to reduce bias from the initial state . After burn-in, the chain should be approximately sampling from the stationary distribution .
There is no universal formula for how long burn-in should be. It depends on how far is from the typical set of , the mixing rate of the chain, and the geometry of .
Mixing Time
The mixing time is the smallest such that the chain's distribution is within total variation distance of , regardless of the starting state:
This is a property of the chain, not the initial state. A chain with small mixing time forgets its initial state quickly.
Effective Sample Size (ESS)
MCMC samples are autocorrelated, so samples carry less information than independent samples. The effective sample size is:
where is the lag- autocorrelation. If consecutive samples are highly correlated, ESS can be much smaller than . The standard error of the MCMC estimate is approximately .
Core Definitions
Trace plots show vs . A well-mixing chain looks like white noise: it moves rapidly through the support of with no visible trends or long excursions. A poorly mixing chain shows long plateaus (the chain is stuck) or slow drifts (the chain has not reached stationarity).
Autocorrelation plots show vs lag . Ideally, autocorrelations decay rapidly to zero. If they remain high at large lags, the chain is mixing slowly and you need many more samples (or a better sampler).
Running mean plots show the cumulative mean of vs . If the chain has converged, the running mean stabilizes. If it drifts, the chain has not mixed.
Main Theorems
Gelman-Rubin R-hat Diagnostic
Statement
Run chains, each of length (after burn-in), from overdispersed starting points. Let be the mean of chain and the grand mean. Define:
Between-chain variance:
Within-chain variance: where is the sample variance of chain
The potential scale reduction factor is:
If the chains have converged, . Values of suggest non-convergence.
Intuition
If all chains have converged to , they should have similar means and variances. The between-chain variance captures disagreement among chains. The within-chain variance captures variation within each chain. If , the chains disagree, suggesting they have not converged. measures the ratio of total estimated variance to within-chain variance.
Proof Sketch
The numerator is an overestimate of (because includes both true variance and between-chain disagreement), and is an underestimate (because finite chains have not explored the full distribution). Their ratio exceeds 1 when chains have not mixed. As , both converge to and .
Why It Matters
is the most widely used convergence diagnostic. Every Bayesian software package (Stan, PyMC, JAGS) reports it. The rule (or the older threshold ) is standard practice. It catches the most common failure mode: chains stuck in different modes of a multimodal distribution.
Failure Mode
does not guarantee convergence. All chains could be stuck in the same mode while missing others. If the starting points are not overdispersed enough, the diagnostic has no power. Also, is defined for scalar quantities; for multivariate targets, you need to check for each marginal and for derived quantities.
Canonical Examples
Bimodal target
Suppose is a mixture of two well-separated Gaussians. A single Metropolis-Hastings chain may get stuck in one mode for the entire run. The trace plot shows a flat line, and the running mean converges to the mode's mean rather than the mixture mean. Running 4 chains from different starting points reveals the problem: chains in different modes have different means, so .
Well-mixing chain
For a 2D Gaussian target with moderate correlation, a well-tuned random-walk Metropolis chain with acceptance rate around 0.234 shows rapid mixing. The trace plot looks like white noise, autocorrelations decay to zero by lag 20, and from 4 chains is below 1.01 after a few hundred iterations.
Common Confusions
Burn-in is not a fixed fraction
Some textbooks suggest discarding the first 50% of samples as burn-in. This is wasteful for well-mixing chains and insufficient for slowly-mixing chains. Use diagnostics to judge when the chain has converged, then discard the pre-convergence samples. There is no one-size-fits-all burn-in length.
Low R-hat does not mean the chain has converged
is necessary but not sufficient for convergence. It is possible for all chains to be stuck in the same local mode, giving while the chain has not explored the full target. Always supplement with trace plots, autocorrelation analysis, and domain knowledge.
More samples is not always the answer
If the chain is not mixing, running it longer just gives you more samples from the wrong distribution. Fix the sampler first (better proposal, HMC, reparameterization), then run longer.
Summary
- Burn-in: discard initial samples before the chain reaches stationarity
- There is no universal burn-in formula; use diagnostics
- Gelman-Rubin : run multiple chains, compare within vs between variance; want
- Effective sample size: ; accounts for autocorrelation
- Trace plots, autocorrelation plots, and running means are essential visual diagnostics
- You can never prove convergence; you can only detect non-convergence
Exercises
Problem
You run 4 MCMC chains for 10,000 iterations each. The within-chain variance is and the between-chain variance is . Compute and interpret.
Problem
A chain of 10,000 samples has lag-1 autocorrelation and autocorrelations decay geometrically: . What is the effective sample size?
References
Canonical:
- Gelman & Rubin, "Inference from Iterative Simulation Using Multiple Sequences," Statistical Science (1992)
- Geyer, "Practical Markov Chain Monte Carlo," Statistical Science (1992)
Current:
-
Vehtari, Gelman, Simpson, Carpenter, Burkner, "Rank-Normalization, Folding, and Localization: An Improved R-hat" (2021)
-
Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 3-7
-
Brooks et al., Handbook of MCMC (2011), Chapters 1-5
Next Topics
The natural next steps from burn-in and convergence diagnostics:
- Variance reduction techniques: getting more from your MCMC samples
- Hamiltonian Monte Carlo: a sampler that mixes much faster
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Metropolis-Hastings AlgorithmLayer 2
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A