Burn-in and Convergence Diagnostics

Sneiderman, Robby

Sampling MCMC

Burn-in and Convergence Diagnostics

Burn-in is only the first filter. Modern MCMC trust comes from split rank-normalized R-hat, bulk and tail ESS, trace behavior, and sampler-specific warnings like divergences.

CoreTier 2StableSupporting~45 min

Prerequisites

Metropolis Hastings Markov Chains and Steady State Coupling Arguments and Mixing Time Gibbs Sampling

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

sampling-mcmc | layer 2 | tier 2. This page has 10 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Gibbs Sampling

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Four chains from overdispersed starts: trace plots (top) and running Gelman-Rubin R-hat (bottom)

MCMC gives you samples from a target distribution, but only eventually. The chain starts from some arbitrary initial state and needs time to "forget" where it started. If you use samples from before the chain has converged, your estimates will be biased toward the initial conditions.

Convergence diagnostics answer the most important practical question in MCMC: are my samples trustworthy? There is no perfect answer, but there is a modern diagnostic bundle that catches the failures that matter in practice: chains disagreeing with each other, chains mixing too slowly, chains missing the tails, and gradient-based samplers quietly breaking on bad geometry.

Mental Model

Imagine dropping a ball into a complex bowl. Initially, the ball rolls around chaotically depending on where you dropped it. After enough time, it settles into a pattern that depends only on the shape of the bowl, not on the starting position. Burn-in is the period of chaotic rolling that you throw away. Convergence diagnostics help you judge when the ball has "settled."

The hard truth: you can never prove a chain has converged. You can only detect certain kinds of non-convergence.

Formal Setup and Notation

Let $\{X_t\}_{t=0}^{T}$ be a Markov chain with stationary distribution $\pi$ . We want to estimate $\mathbb{E}_\pi[f(X)]$ using the ergodic average:

$\hat{\mu}_T = \frac{1}{T - B} \sum_{t=B+1}^{T} f(X_t)$

where $B$ is the burn-in period (number of initial samples discarded).

Definition

Burn-in

The burn-in period is the initial segment of the Markov chain that is discarded before computing estimates. The purpose is to reduce bias from the initial state $X_0$ . After burn-in, the chain should be approximately sampling from the stationary distribution $\pi$ .

There is no universal formula for how long burn-in should be. It depends on how far $X_0$ is from the typical set of $\pi$ , the mixing rate of the chain, and the geometry of $\pi$ .

Definition

Mixing Time

The mixing time $\tau_{\text{mix}}(\epsilon)$ is the smallest $t$ such that the chain's distribution is within total variation distance $\epsilon$ of $\pi$ , regardless of the starting state:

$\tau_{\text{mix}}(\epsilon) = \min\{t : \sup_{X_0} \|P^t(X_0, \cdot) - \pi\|_{\text{TV}} \leq \epsilon\}$

This is a property of the chain, not the initial state. A chain with small mixing time forgets its initial state quickly.

Definition

Effective Sample Size (ESS)

MCMC samples are autocorrelated, so $T$ samples carry less information than $T$ independent samples. The effective sample size is:

$\text{ESS} = \frac{T}{1 + 2\sum_{k=1}^{\infty} \rho_k}$

where $\rho_k = \mathrm{Corr}(f(X_t), f(X_{t+k}))$ is the lag- $k$ autocorrelation. If consecutive samples are highly correlated, ESS can be much smaller than $T$ . The standard error of the MCMC estimate is approximately $\sigma / \sqrt{\text{ESS}}$ .

Core Definitions

Trace plots show $f(X_t)$ vs $t$ . A well-mixing chain looks like white noise: it moves rapidly through the support of $\pi$ with no visible trends or long excursions. A poorly mixing chain shows long plateaus (the chain is stuck) or slow drifts (the chain has not reached stationarity).

Autocorrelation plots show $\rho_k$ vs lag $k$ . Ideally, autocorrelations decay rapidly to zero. If they remain high at large lags, the chain is mixing slowly and you need many more samples (or a better sampler).

Running mean plots show the cumulative mean of $f(X_t)$ vs $t$ . If the chain has converged, the running mean stabilizes. If it drifts, the chain has not mixed.

The Modern Diagnostic Bundle

Modern Bayesian software does not treat convergence as "burn-in plus one $\hat{R}$ value." The practical bundle is:

Split, rank-normalized $\hat{R}$ : catch chains that still disagree in location or scale even after long runs. This is the default modern replacement for the original Gelman-Rubin statistic.
Bulk ESS: ask whether the central mass of the posterior has enough effectively independent draws for means, variances, and ordinary posterior summaries.
Tail ESS: ask whether the tails are explored well enough for intervals, extreme quantiles, and rare-event posterior statements.
Trace and pair plots: show whether the chain is still drifting, switching modes, or getting trapped along a curved ridge.
Sampler-specific warnings: for Hamiltonian Monte Carlo and NUTS, divergences and low energy exploration often reveal geometric failure even when scalar summaries look superficially fine.

The deeper lesson is that diagnostics come in layers. Burn-in fights initialization bias. $\hat{R}$ checks cross-chain agreement. ESS checks Monte Carlo precision. Divergences and energy diagnostics check whether the sampler itself is numerically respecting the target geometry.

Main Theorems

Proposition

Classical Gelman-Rubin R-hat Diagnostic

Statement

Run $M \geq 2$ chains, each of length $N$ (after burn-in), from overdispersed starting points. Let $\bar{\theta}_m$ be the mean of chain $m$ and $\bar{\theta}$ the grand mean. Define:

Between-chain variance: $B = \frac{N}{M-1}\sum_{m=1}^{M}(\bar{\theta}_m - \bar{\theta})^2$

Within-chain variance: $W = \frac{1}{M}\sum_{m=1}^{M} s_m^2$ where $s_m^2$ is the sample variance of chain $m$

The potential scale reduction factor is:

$\hat{R} = \sqrt{\frac{\frac{N-1}{N}W + \frac{1}{N}B}{W}}$

If the chains have converged, $\hat{R} \approx 1$ . Values of $\hat{R} > 1.01$ suggest non-convergence.

Intuition

If all chains have converged to $\pi$ , they should have similar means and variances. The between-chain variance $B$ captures disagreement among chains. The within-chain variance $W$ captures variation within each chain. If $B \gg W$ , the chains disagree, suggesting they have not converged. $\hat{R}$ measures the ratio of total estimated variance to within-chain variance.

Proof Sketch

The numerator $\frac{N-1}{N}W + \frac{1}{N}B$ is an overestimate of $\mathrm{Var}_\pi(f)$ (because $B$ includes both true variance and between-chain disagreement), and $W$ is an underestimate (because finite chains have not explored the full distribution). Their ratio exceeds 1 when chains have not mixed. As $N \to \infty$ , both converge to $\mathrm{Var}_\pi(f)$ and $\hat{R} \to 1$ .

Why It Matters

$\hat{R}$ is the most widely used convergence diagnostic. Every Bayesian software package (Stan, PyMC, JAGS) reports it. The rule $\hat{R} < 1.01$ (or the older threshold $1.1$ ) is standard practice. It catches the most common failure mode: chains stuck in different modes of a multimodal distribution.

Failure Mode

$\hat{R} \approx 1$ does not guarantee convergence. All chains could be stuck in the same mode while missing others. If the starting points are not overdispersed enough, the diagnostic has no power. Also, $\hat{R}$ is defined for scalar quantities; for multivariate targets, you need to check $\hat{R}$ for each marginal and for derived quantities.

report a correction →

The original $\hat{R}$ above is historically important, but modern packages usually report split, rank-normalized $\hat{R}$ (Vehtari et al. 2021). That update matters because the classic statistic can miss heavy tails, skewness, and chains that agree on the mean while disagreeing elsewhere. In practice, if a package exposes only one convergence number, you want the modern split/rank-normalized version, not the 1992 formula by itself.

Divergences, Geometry, and Why Burn-In Is Not Enough

For random-walk samplers, non-convergence usually appears as sticky traces, large autocorrelations, or chains parked in different modes. For HMC-style samplers, there is another failure mode: the numerical integrator can start breaking at the same parts of the posterior over and over again.

A divergence is not "just another rejected proposal." It is a sign that the leapfrog integrator encountered geometry too sharp for the current step size or parameterization. Neal's funnel and weakly identified hierarchical models are the canonical examples. In those settings, more burn-in is not the fix. The fix is usually better geometry: reparameterization, stronger prior structure, or a more faithful mass matrix adaptation. That is why pages like No-U-Turn Sampler and Neal's Funnel and Centered vs. Non-Centered Hierarchical Models matter downstream of diagnostics.

Canonical Examples

Example

Bimodal target

Suppose $\pi$ is a mixture of two well-separated Gaussians. A single Metropolis-Hastings chain may get stuck in one mode for the entire run. The trace plot shows a flat line, and the running mean converges to the mode's mean rather than the mixture mean. Running 4 chains from different starting points reveals the problem: chains in different modes have different means, so $\hat{R} \gg 1$ .

Example

Well-mixing chain

For a 2D Gaussian target with moderate correlation, a well-tuned random-walk Metropolis chain with acceptance rate around 0.234 shows rapid mixing. The trace plot looks like white noise, autocorrelations decay to zero by lag 20, and $\hat{R}$ from 4 chains is below 1.01 after a few hundred iterations.

Common Confusions

Watch Out

Burn-in is not a fixed fraction

Some textbooks suggest discarding the first 50% of samples as burn-in. This is wasteful for well-mixing chains and insufficient for slowly-mixing chains. Use diagnostics to judge when the chain has converged, then discard the pre-convergence samples. There is no one-size-fits-all burn-in length.

Watch Out

Low R-hat does not mean the chain has converged

$\hat{R} < 1.01$ is necessary but not sufficient for convergence. It is possible for all chains to be stuck in the same local mode, giving $\hat{R} \approx 1$ while the chain has not explored the full target. Always supplement $\hat{R}$ with trace plots, autocorrelation analysis, and domain knowledge.

Watch Out

Burn-in is not the cure for bad geometry

If a chain shows persistent divergences, very low ESS, or clear funnel-like pair plots, discarding a larger prefix does not solve the problem. Burn-in can remove initialization bias, but it cannot repair a centered hierarchical parameterization that keeps forcing the sampler through a narrow neck. That is why diagnostics should trigger model surgery, not just a longer run.

Summary

Burn-in removes initialization bias; it does not certify convergence
Modern practice uses split rank-normalized $\hat{R}$ plus bulk and tail ESS, not a single scalar heuristic
Classical $\hat{R}$ compares within-chain to between-chain variance and should be close to 1, but it is not the whole story
ESS translates autocorrelation into Monte Carlo precision
Trace plots and pair plots still matter because they expose modes, drift, and funnel geometry that scalar summaries can miss
Divergences are geometry alarms, not cosmetic warnings
You can never prove convergence from finite output; you can only assemble enough evidence that obvious failure modes are absent

Exercises

ExerciseCore

Problem

You run 4 MCMC chains for 10,000 iterations each. The within-chain variance is $W = 2.0$ and the between-chain variance is $B = 20.0$ . Compute $\hat{R}$ and interpret.

ExerciseAdvanced

Problem

A chain of 10,000 samples has lag-1 autocorrelation $\rho_1 = 0.95$ and autocorrelations decay geometrically: $\rho_k = 0.95^k$ . What is the effective sample size?

References

Canonical:

Gelman & Rubin, "Inference from Iterative Simulation Using Multiple Sequences," Statistical Science (1992)
Geyer, "Practical Markov Chain Monte Carlo," Statistical Science (1992)

Current / practical:

Vehtari, Gelman, Simpson, Carpenter, Burkner, "Rank-Normalization, Folding, and Localization: An Improved R-hat for Assessing Convergence of MCMC," Bayesian Analysis 16(2) (2021), 667-718. Modern split rank-normalized $\hat{R}$ , bulk ESS, and tail ESS.
Stan Development Team, Stan Reference Manual, diagnostics chapters. Practical source for divergences, energy diagnostics, ESS, and $\hat{R}$ in modern HMC workflows.
Betancourt, "A Conceptual Introduction to Hamiltonian Monte Carlo" (2017, arXiv:1701.02434). Why divergences are geometry warnings rather than generic MCMC noise.
Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 3-7.
Brooks et al., Handbook of MCMC (2011), Chapters 1-5.
Levin, Peres, Wilmer, Markov Chains and Mixing Times (2nd edition, 2017), Chapters 4-5.

Next Topics

The natural next steps from burn-in and convergence diagnostics:

Gibbs Sampling: what slow autocorrelation looks like in a conditional-update sampler
Hamiltonian Monte Carlo: a sampler that mixes much faster when the geometry is smooth
No-U-Turn Sampler and Neal's Funnel: why good diagnostics still matter even after NUTS chooses path length for you

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

10

Gibbs Samplinglayer 2 · tier 1
Markov Chain Monte Carlolayer 2 · tier 1
Metropolis-Hastings Algorithmlayer 2 · tier 1
Markov Chains and Steady Statelayer 1 · tier 2
Variance Reduction Techniqueslayer 2 · tier 2

Derived topics

1

No-U-Turn Sampler and Neal's Funnellayer 3 · tier 2

Graph-backed continuations

No-U-Turn Sampler and Neal's Funnel