Sampling MCMC
Markov Chain Monte Carlo
When you can evaluate a target up to a normalizing constant but cannot sample from it, build a Markov chain whose stationary distribution is the target. Detailed balance, ergodicity, mixing time, autocorrelation, and the variance penalty for non-iid samples.
Prerequisites
Why This Matters
Plain Monte Carlo requires that you can already sample from the target . The typical Bayesian situation is the opposite: you know as a function of but the normalizing constant is exactly the intractable thing. Direct sampling is impossible.
MCMC is the workaround that breaks the impasse. Instead of producing iid samples, build a Markov chain whose stationary distribution is . Run the chain long enough and the marginal distribution of approaches . Once it has, time-averages along the trajectory estimate correctly. The crucial trick is that you only ever need up to a multiplicative constant, because constructions like Metropolis-Hastings ratios cancel the constant out.
That single idea unlocks Bayesian neural networks, Latent Dirichlet Allocation, posterior inference in probabilistic graphical models, lattice QCD, and most of statistical physics. The price you pay relative to iid Monte Carlo is correlation: successive samples are not independent, so the variance of the sample mean carries an autocorrelation factor, and chains need a burn-in period before their marginal is close to .
Mental Model
Picture a walker on a landscape whose height is the unnormalized log density. From each point the walker takes a randomized step that biases toward higher density but occasionally moves downhill. Run the walker long enough and it visits each region in proportion to the true density. The recipe for designing those random steps is exactly what Metropolis-Hastings, Gibbs, and Hamiltonian Monte Carlo provide.
Three things have to be true for the walker's trajectory to be useful:
- The chain must have as a stationary distribution. This is the correctness requirement; detailed balance is the standard way to guarantee it.
- The chain must be irreducible and aperiodic, so the marginal at time converges to regardless of starting point. This is the ergodicity requirement.
- The convergence and mixing must be fast enough that the chain explores the target in a reasonable wall-clock budget. This is the practicality requirement, controlled by the spectral gap, geometric ergodicity, or stronger conditions like log-Sobolev inequalities.
The correctness requirements are usually free; the ergodicity requirements are usually free in continuous state spaces with positive density; the mixing requirement is where every honest discussion of MCMC ends up.
Formal Setup and Notation
Let be the target distribution on a measurable state space . We assume we can evaluate pointwise, but we cannot sample from directly. A Markov chain on is specified by a .
Stationary Distribution
A distribution on is stationary for a Markov chain with kernel if it is left invariant by :
Equivalently, if and , then . The MCMC design problem is to construct so that the target is stationary.
Detailed Balance (Reversibility)
A kernel satisfies with respect to if for every measurable :
In density form (assuming and have densities), for all . A chain satisfying detailed balance is reversible: the time-reversed process has the same law.
Reversibility implies stationarity: integrate the detailed-balance equation in to get . The converse is not true. Stationarity does not require reversibility, and important MCMC kernels (systematic-scan Gibbs, lifted-walk samplers, non-reversible HMC variants) are stationary without being reversible.
Irreducibility, Aperiodicity, Ergodicity
A chain is -irreducible if for every state and every set with , there exists with . The chain can reach any positive-probability region from any starting point in finitely many steps.
A chain is aperiodic if the gcd of return times to any state (or -positive set) is 1. Periodicity would force the marginal to oscillate rather than converge.
Together with the existence of a stationary distribution, these conditions make the chain ergodic: as for every starting state . On general state spaces the proper notion is Harris recurrence, which guarantees the marginal convergence under irreducibility plus aperiodicity.
Main Theorems
Detailed Balance Implies Stationarity
Statement
If a Markov chain with transition kernel satisfies detailed balance with respect to , then is a stationary distribution for . That is, .
Intuition
Detailed balance says forward and reverse flows match locally between every pair of states. The total inflow into any region equals the total outflow, so the mass distribution does not change. That is exactly what "stationary" means.
The converse is not true. Stationarity is a global balance condition; you can have local imbalances that cancel in aggregate. Non-reversible chains exploit exactly this slack to mix faster than reversible ones.
Proof Sketch
Integrate detailed balance with respect to . For any measurable :
The first equality unpacks ; the second uses detailed balance with ; the third uses that is a probability measure, so .
Why It Matters
This is the reason every textbook MCMC proof starts with verifying detailed balance. It is the cleanest sufficient condition for the chain to preserve . Metropolis-Hastings is engineered to satisfy detailed balance by construction. The acceptance ratio is exactly the correction that turns an arbitrary proposal kernel into a -reversible kernel.
Failure Mode
Failing detailed balance does not invalidate a sampler; many practical samplers are stationary but not reversible. But verifying stationarity without detailed balance is much harder, requiring a direct check of . For non-reversible designs, the standard tool is to write the kernel as a composition of reversible pieces and use that each piece preserves .
MCMC Ergodic Theorem
Statement
Let be a Markov chain with transition kernel , and assume is -irreducible, aperiodic, and has stationary distribution . Then for any and any starting state :
Furthermore, the marginal converges to in total variation: .
Intuition
This is the MCMC analog of the law of large numbers. For iid samples, sample averages converge to expectations. For ergodic Markov chains, time-averages along a trajectory converge to the -expectation. The chain explores the state space in proportion to , and the long-run frequency of visits to any region converges to its -mass.
The catch: this is an asymptotic statement. For finite , the chain has not converged, the early samples are biased toward the starting state, and the variance of the time-average carries autocorrelation.
Proof Sketch
Standard ergodic theory for Harris-recurrent chains. The hard work is the zero-one law for tail events and the construction of regeneration times (or coupling) that show the marginal converges to . Once that is in place, the sample-average convergence follows from the standard ergodic theorem applied to the shift-invariant process under its stationary law.
See Meyn & Tweedie (2009), Chapter 17 for the discrete-time treatment on general state spaces, and Roberts & Rosenthal (2004) for the MCMC-tailored version.
Why It Matters
This is what justifies running an MCMC chain and reporting time-averages as estimates of -expectations. It is the legal foundation for every posterior summary in Stan, PyMC, BUGS, JAGS, NUTS, and friends.
For finite-sample variance, one wants a Markov-chain CLT: where and is the integrated autocorrelation time. This requires geometric ergodicity (or an spectral gap) plus a moment condition, not just irreducibility plus aperiodicity. Without geometric ergodicity, the Markov-chain CLT can fail and confidence intervals based on the iid formula are unreliable.
Failure Mode
Reducible chains can be stuck in one mode and never see the rest of the state space; periodic chains oscillate without converging. On continuous state spaces with disjoint positive-density components and a local-proposal kernel, the chain can be reducible without obvious warning, and the time-average converges to the wrong answer. Multimodal posteriors in Bayesian models are the canonical practical risk: a chain trapped in one mode is consistent with all standard convergence diagnostics yet still wrong.
The Big Three MCMC Constructions
The standard MCMC algorithms differ in how they propose moves and how they correct for proposal-target mismatch. All three guarantee detailed balance (or -invariance) by construction.
Metropolis-Hastings
Propose from any conditional distribution , then accept with probability
The key property: enters only as a ratio, so the unknown normalizing constant cancels. This is the reason MCMC works on unnormalized targets. See Metropolis-Hastings for the full derivation, including detailed-balance verification and the random-walk and independence proposal special cases.
Gibbs Sampling
Update each variable from its full conditional in turn. Equivalent to MH with proposal equal to the full conditional, giving acceptance probability 1. Excellent when conditionals have closed form (conjugate priors, hierarchical models with Gaussian random effects); useless when they do not. See Gibbs Sampling for closed-form derivations of the bivariate Gaussian, normal-normal, and beta-binomial cases, and the canonical "100% acceptance does not mean fast mixing" example.
Hamiltonian Monte Carlo
Augment the state with a momentum variable , simulate Hamiltonian dynamics on the joint distribution, and Metropolis-correct for discretization error. The dynamics use the gradient to make distant proposals that still land in high-density regions, achieving scaling versus the of random-walk Metropolis. See Hamiltonian Monte Carlo for the leapfrog integrator, the symplectic-geometry argument for volume preservation, and the No-U-Turn Sampler (NUTS) that auto-tunes trajectory length.
| Method | Cost per step | Tuning needed | Handles correlation | Multimodal targets |
|---|---|---|---|---|
| Random-walk MH | density | proposal scale | Poor | Slow |
| Independence MH | density | proposal | Depends on | If covers modes |
| Gibbs (full conditionals) | conditional draws | none | Poor without blocking | Slow |
| HMC / NUTS | gradients | (auto in NUTS) | Good | Slow |
| Langevin MCMC (MALA) | gradient | step size | Moderate | Slow |
Burn-In, Mixing Time, and Effective Sample Size
The asymptotic ergodic theorem hides three practical problems that every MCMC user faces.
Burn-In Period
The is the initial portion of the chain that is discarded because the marginal has not yet converged to . There is no universal "correct" burn-in length: it depends on the starting point, the chain's mixing time, and the geometry of .
In practice, run multiple chains from overdispersed starting points and discard a number of samples chosen from convergence diagnostics (split-, trace plots, energy plots for HMC). For NUTS, modern Stan and PyMC default to discarding a per-run warmup segment that also adapts the step size and mass matrix; the discarded warmup is wider than just burn-in.
Mixing Time
For a chain on a finite (or compact) state space, the is
For reversible chains, where is the spectral gap (1 minus the second-largest eigenvalue in absolute value of ). Small spectral gap means slow mixing.
For continuous state spaces, the analog is the spectral gap or a logarithmic Sobolev inequality. Geometric ergodicity is the qualitative version: the marginal converges to at an exponential rate, with rate constant determined by the spectral gap.
Autocorrelation and Effective Sample Size
For a stationary chain , the of at lag is . The integrated autocorrelation time is
For an iid sample, for and . For an autocorrelated chain, , often much larger.
The Markov-chain CLT gives . Compared to iid sampling with the same , you need times more samples. The effective sample size is the iid-equivalent count: a chain of length carries information equivalent to iid draws. Always report , not .
The interaction of these three concepts is the practical subject of Burn-In and Convergence Diagnostics.
ML Applications: Where MCMC Earns Its Keep
Bayesian neural networks. Posterior over weights given training data is intractable; HMC and stochastic-gradient MCMC variants are the gold-standard tool. Practical scale is limited (hundreds of thousands of parameters, not billions), so variational inference often dominates at LLM scale, but BNN-MCMC remains the right call when calibrated uncertainty matters more than throughput.
Latent Dirichlet Allocation (LDA). The classic topic model has a collapsed-Gibbs sampler that integrates out topic-word distributions analytically, leaving only document-topic assignments. The closed-form conditionals make Gibbs sampling extremely effective.
Probabilistic graphical models. Once you have a directed or undirected graphical model with conjugate-or-near-conjugate structure, Gibbs and Metropolis-Hastings are the workhorses. PyMC, Stan, and Edward are built around this paradigm.
Statistical physics and lattice QCD. The original setting of HMC ("Hybrid Monte Carlo") in 1987 was lattice gauge theory, where the target is a high-dimensional, highly correlated, gradient-known density. The techniques transferred to ML almost intact.
Common Confusions
MCMC samples are not independent
Successive states in a Markov chain are correlated by construction. The ergodic theorem guarantees correct asymptotic time-averages, but the variance of those averages is inflated by . Always report effective sample size, never raw chain length, when summarizing MCMC results. A chain of 10000 samples with is worth 100 iid samples, not 10000.
Convergence diagnostics are necessary, not sufficient
, ESS, and trace plots can certify a chain is not converged but cannot certify that it is. A chain trapped in one mode of a multimodal posterior will pass every convergence diagnostic and still produce arbitrarily wrong answers. The only protections against this are running multiple chains from overdispersed starts, knowing the model well enough to anticipate multimodality, and using techniques designed for multimodal targets (parallel tempering, simulated annealing variants).
Detailed balance is sufficient, not necessary
Some important MCMC kernels are stationary without satisfying detailed balance. Systematic-scan Gibbs is the most common example: it is -invariant as a composition of reversible component-wise updates, but the composition itself is not reversible. Non-reversible MCMC ("lifted" walks, generalized HMC) can mix strictly faster than reversible designs by exploiting the slack between stationarity and reversibility.
The normalizing constant cancels, but not always
Standard MH only needs up to a constant, because the acceptance ratio is a ratio of values. Methods that estimate the normalizing constant itself (marginal likelihood for Bayes factors, evidence in model selection) do require dealing with it: thermodynamic integration, bridge sampling, and stepping-stone sampling are the classical tools. They are not free.
Burn-in is not a fixed fraction
"Discard the first half" is folklore, not theory. The right burn-in length depends on the starting state's distance from typical-set regions of and on the chain's mixing time. For HMC with NUTS, the warmup phase is a single tuning interval (default 1000 iterations in Stan) where step size and mass matrix adapt; samples from warmup are not MCMC samples and must be discarded entirely, regardless of "convergence."
Canonical Examples
Two-state chain
with transition matrix for . The stationary distribution is . Detailed balance holds: becomes .
Mixing time: . The spectral gap is ; small means slow mixing. This is the simplest setting in which to see all of stationarity, ergodicity, and mixing time concretely.
Random walk MH on a Gaussian target
with proposal . Symmetric proposal cancels in the acceptance ratio; only matters. Roberts-Gelman-Gilks (1997) shows the optimal step size is giving optimal acceptance rate in the diffusion limit. Random-walk MH on an iid Gaussian target needs steps per effective sample, which is the canonical example of dimension-induced slow mixing.
Exercises
Problem
Consider a Metropolis-Hastings chain on with target (unnormalized standard normal) and symmetric random-walk proposal . Verify explicitly that this kernel satisfies detailed balance with respect to .
Problem
A chain has integrated autocorrelation time and you draw samples. What is the effective sample size? Roughly how many iid samples would you need to match the same standard error for estimating a posterior mean?
Problem
Show that for a reversible chain with spectral gap , the integrated autocorrelation time of the eigenfunction associated with the second-largest eigenvalue is . Conclude that small spectral gap implies large autocorrelation.
References
Canonical:
- Robert & Casella, Monte Carlo Statistical Methods (2nd ed., 2004), Chapters 6-12. The standard graduate-level reference, with detailed-balance derivations, geometric ergodicity, and CLTs.
- Brooks, Gelman, Jones, Meng (eds.), Handbook of Markov Chain Monte Carlo (2011). Survey volume covering MH, Gibbs, HMC, slice sampling, parallel tempering, sequential Monte Carlo. Chapter 1 (Geyer, "Introduction") and Chapter 5 (Neal, "MCMC Using Hamiltonian Dynamics") are the must-read entries.
- Gelman, Carlin, Stern, Dunson, Vehtari, Rubin, Bayesian Data Analysis (3rd ed., 2013), Chapters 11-12.
Theory:
- Meyn & Tweedie, Markov Chains and Stochastic Stability (2nd ed., 2009), Chapters 13-17. Drift conditions, geometric ergodicity, and the general-state-space ergodic theorem.
- Roberts & Rosenthal, "General state space Markov chains and MCMC algorithms," Probability Surveys 1 (2004), 20-71. Concise survey of ergodicity conditions tailored to MCMC.
- Levin, Peres, Wilmer, Markov Chains and Mixing Times (2nd ed., 2017). Spectral-gap and coupling techniques, mostly for finite chains but with continuous-state extensions.
Original papers:
- Metropolis, Rosenbluth, Rosenbluth, Teller, Teller, "Equation of State Calculations by Fast Computing Machines," Journal of Chemical Physics 21 (1953), 1087-1092. The original Metropolis algorithm.
- Hastings, "Monte Carlo Sampling Methods Using Markov Chains and Their Applications," Biometrika 57 (1970), 97-109. Generalization to non-symmetric proposals.
- Geman & Geman, "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images," IEEE TPAMI 6 (1984), 721-741. Origin of the modern Gibbs sampler.
Practical:
- Vehtari, Gelman, Simpson, Carpenter, Bürkner, "Rank-normalization, folding, and localization: An improved for assessing convergence of MCMC" (2021, arXiv:1903.08008). Modern definition used by Stan and PyMC.
Next Topics
- Metropolis-Hastings: the foundational MCMC construction. Proposal distributions, acceptance ratios, and detailed-balance verification in full.
- Gibbs Sampling: full-conditional updates, special case of MH with acceptance 1, conjugate models, and blocking.
- Hamiltonian Monte Carlo: gradient-based MCMC with scaling. Leapfrog, NUTS, and divergences.
- Burn-In and Convergence Diagnostics: , ESS, trace plots, and the practical workflow for trusting an MCMC run.
- Langevin Dynamics: continuous-time analog of MCMC; the link between SGD-with-noise and posterior sampling.
Last reviewed: May 6, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Monte Carlo Methodslayer 2 · tier 1
- Markov Chains and Steady Statelayer 1 · tier 2
Derived topics
3- Gibbs Samplinglayer 2 · tier 1
- Metropolis-Hastings Algorithmlayer 2 · tier 1
- Hamiltonian Monte Carlolayer 3 · tier 2
Graph-backed continuations