Markov Chain Monte Carlo

Sneiderman, Robby

Sampling MCMC

Markov Chain Monte Carlo

When you can evaluate a target up to a normalizing constant but cannot sample from it, build a Markov chain whose stationary distribution is the target. Detailed balance, ergodicity, mixing time, autocorrelation, and the variance penalty for non-iid samples.

CoreTier 1StableCore spine~65 min

Prerequisites

Markov Chains and Steady State Monte Carlo Methods

Quiz (7)Pulse Check Prereq Map

Why This Matters

Plain Monte Carlo requires that you can already sample from the target $\pi$ . The typical Bayesian situation is the opposite: you know $\pi(\theta) \propto p(\mathcal{D} \mid \theta) p(\theta)$ as a function of $\theta$ but the normalizing constant $\int p(\mathcal{D} \mid \theta) p(\theta)\, d\theta$ is exactly the intractable thing. Direct sampling is impossible.

MCMC is the workaround that breaks the impasse. Instead of producing iid samples, build a Markov chain $X_0, X_1, X_2, \ldots$ whose stationary distribution is $\pi$ . Run the chain long enough and the marginal distribution of $X_t$ approaches $\pi$ . Once it has, time-averages along the trajectory estimate $\mathbb{E}_\pi[f(X)]$ correctly. The crucial trick is that you only ever need $\pi$ up to a multiplicative constant, because constructions like Metropolis-Hastings ratios cancel the constant out.

That single idea unlocks Bayesian neural networks, Latent Dirichlet Allocation, posterior inference in probabilistic graphical models, lattice QCD, and most of statistical physics. The price you pay relative to iid Monte Carlo is correlation: successive samples are not independent, so the variance of the sample mean carries an autocorrelation factor, and chains need a burn-in period before their marginal is close to $\pi$ .

Mental Model

Picture a walker on a landscape whose height is the unnormalized log density. From each point the walker takes a randomized step that biases toward higher density but occasionally moves downhill. Run the walker long enough and it visits each region in proportion to the true density. The recipe for designing those random steps is exactly what Metropolis-Hastings, Gibbs, and Hamiltonian Monte Carlo provide.

Three things have to be true for the walker's trajectory to be useful:

The chain must have $\pi$ as a stationary distribution. This is the correctness requirement; detailed balance is the standard way to guarantee it.
The chain must be irreducible and aperiodic, so the marginal at time $t$ converges to $\pi$ regardless of starting point. This is the ergodicity requirement.
The convergence and mixing must be fast enough that the chain explores the target in a reasonable wall-clock budget. This is the practicality requirement, controlled by the spectral gap, geometric ergodicity, or stronger conditions like log-Sobolev inequalities.

The correctness requirements are usually free; the ergodicity requirements are usually free in continuous state spaces with positive density; the mixing requirement is where every honest discussion of MCMC ends up.

Formal Setup and Notation

Let $\pi$ be the target distribution on a measurable state space $\mathcal{X}$ . We assume we can evaluate $\tilde\pi(x) \propto \pi(x)$ pointwise, but we cannot sample from $\pi$ directly. A Markov chain on $\mathcal{X}$ is specified by a $P(x, dy)$ .

Definition

Stationary Distribution $π P = π$

A distribution $\pi$ on $\mathcal{X}$ is stationary for a Markov chain with kernel $P$ if it is left invariant by $P$ :

$\pi(A) = \int_\mathcal{X} P(x, A)\, \pi(dx) \quad \text{for every measurable } A.$

Equivalently, if $X_0 \sim \pi$ and $X_1 \mid X_0 \sim P(X_0, \cdot)$ , then $X_1 \sim \pi$ . The MCMC design problem is to construct $P$ so that the target $\pi$ is stationary.

Definition

Detailed Balance (Reversibility)

A kernel $P$ satisfies with respect to $\pi$ if for every measurable $A, B$ :

$\int_A \int_B P(x, dy)\, \pi(dx) = \int_B \int_A P(y, dx)\, \pi(dy).$

In density form (assuming $\pi$ and $P(x, \cdot)$ have densities), $\pi(x) p(x, y) = \pi(y) p(y, x)$ for all $x, y$ . A chain satisfying detailed balance is reversible: the time-reversed process has the same law.

Reversibility implies stationarity: integrate the detailed-balance equation in $x$ to get $\pi P = \pi$ . The converse is not true. Stationarity does not require reversibility, and important MCMC kernels (systematic-scan Gibbs, lifted-walk samplers, non-reversible HMC variants) are stationary without being reversible.

Definition

Irreducibility, Aperiodicity, Ergodicity

A chain is $\pi$ -irreducible if for every state $x$ and every set $A$ with $\pi(A) > 0$ , there exists $n \geq 1$ with $P^n(x, A) > 0$ . The chain can reach any positive-probability region from any starting point in finitely many steps.

A chain is aperiodic if the gcd of return times to any state (or $\pi$ -positive set) is 1. Periodicity would force the marginal to oscillate rather than converge.

Together with the existence of a stationary distribution, these conditions make the chain ergodic: $\|P^n(x, \cdot) - \pi(\cdot)\|_{\text{TV}} \to 0$ as $n \to \infty$ for every starting state $x$ . On general state spaces the proper notion is Harris recurrence, which guarantees the marginal convergence under irreducibility plus aperiodicity.

Main Theorems

Theorem

Detailed Balance Implies Stationarity

Statement

If a Markov chain with transition kernel $P$ satisfies detailed balance with respect to $\pi$ , then $\pi$ is a stationary distribution for $P$ . That is, $\pi P = \pi$ .

Intuition

Detailed balance says forward and reverse flows match locally between every pair of states. The total inflow into any region $A$ equals the total outflow, so the mass distribution does not change. That is exactly what "stationary" means.

The converse is not true. Stationarity is a global balance condition; you can have local imbalances that cancel in aggregate. Non-reversible chains exploit exactly this slack to mix faster than reversible ones.

Proof Sketch

Integrate detailed balance with respect to $x$ . For any measurable $A$ :

$\int_\mathcal{X} P(x, A)\, \pi(dx) = \int_\mathcal{X} \int_A P(x, dy)\, \pi(dx) = \int_A \int_\mathcal{X} P(y, dx)\, \pi(dy) = \int_A \pi(dy) = \pi(A).$

The first equality unpacks $P(x, A)$ ; the second uses detailed balance with $B = \mathcal{X}$ ; the third uses that $P(y, \cdot)$ is a probability measure, so $\int_\mathcal{X} P(y, dx) = 1$ .

Why It Matters

This is the reason every textbook MCMC proof starts with verifying detailed balance. It is the cleanest sufficient condition for the chain to preserve $\pi$ . Metropolis-Hastings is engineered to satisfy detailed balance by construction. The acceptance ratio is exactly the correction that turns an arbitrary proposal kernel into a $\pi$ -reversible kernel.

Failure Mode

Failing detailed balance does not invalidate a sampler; many practical samplers are stationary but not reversible. But verifying stationarity without detailed balance is much harder, requiring a direct check of $\pi P = \pi$ . For non-reversible designs, the standard tool is to write the kernel as a composition of reversible pieces and use that each piece preserves $\pi$ .

report a correction →

Theorem

MCMC Ergodic Theorem

Statement

Let $X_0, X_1, X_2, \ldots$ be a Markov chain with transition kernel $P$ , and assume $P$ is $\pi$ -irreducible, aperiodic, and has stationary distribution $\pi$ . Then for any $f \in L^1(\pi)$ and any starting state $x$ :

$\frac{1}{n}\sum_{t=1}^n f(X_t) \xrightarrow{a.s.} \mathbb{E}_\pi[f(X)].$

Furthermore, the marginal $P^n(x, \cdot)$ converges to $\pi$ in total variation: $\|P^n(x, \cdot) - \pi\|_{\text{TV}} \to 0$ .

Intuition

This is the MCMC analog of the law of large numbers. For iid samples, sample averages converge to expectations. For ergodic Markov chains, time-averages along a trajectory converge to the $\pi$ -expectation. The chain explores the state space in proportion to $\pi$ , and the long-run frequency of visits to any region converges to its $\pi$ -mass.

The catch: this is an asymptotic statement. For finite $n$ , the chain has not converged, the early samples are biased toward the starting state, and the variance of the time-average carries autocorrelation.

Proof Sketch

Standard ergodic theory for Harris-recurrent chains. The hard work is the zero-one law for tail events and the construction of regeneration times (or coupling) that show the marginal converges to $\pi$ . Once that is in place, the sample-average convergence follows from the standard ergodic theorem applied to the shift-invariant process $(X_t)_{t \geq 0}$ under its stationary law.

See Meyn & Tweedie (2009), Chapter 17 for the discrete-time treatment on general state spaces, and Roberts & Rosenthal (2004) for the MCMC-tailored version.

Why It Matters

This is what justifies running an MCMC chain and reporting time-averages as estimates of $\pi$ -expectations. It is the legal foundation for every posterior summary in Stan, PyMC, BUGS, JAGS, NUTS, and friends.

For finite-sample variance, one wants a Markov-chain CLT: $\sqrt{n}\,(\hat\mu_n - \mu) \to \mathcal{N}(0, \sigma^2_{\text{MCMC}})$ where $\sigma^2_{\text{MCMC}} = \sigma^2(f) \cdot \tau_{\text{int}}$ and $\tau_{\text{int}}$ is the integrated autocorrelation time. This requires geometric ergodicity (or an $L^2$ spectral gap) plus a moment condition, not just irreducibility plus aperiodicity. Without geometric ergodicity, the Markov-chain CLT can fail and confidence intervals based on the iid formula are unreliable.

Failure Mode

Reducible chains can be stuck in one mode and never see the rest of the state space; periodic chains oscillate without converging. On continuous state spaces with disjoint positive-density components and a local-proposal kernel, the chain can be reducible without obvious warning, and the time-average converges to the wrong answer. Multimodal posteriors in Bayesian models are the canonical practical risk: a chain trapped in one mode is consistent with all standard convergence diagnostics yet still wrong.

report a correction →

The Big Three MCMC Constructions

The standard MCMC algorithms differ in how they propose moves and how they correct for proposal-target mismatch. All three guarantee detailed balance (or $\pi$ -invariance) by construction.

Metropolis-Hastings

Propose $x' \sim q(\cdot \mid x)$ from any conditional distribution $q$ , then accept with probability

$\alpha(x, x') = \min\!\left(1, \frac{\tilde\pi(x')\, q(x \mid x')}{\tilde\pi(x)\, q(x' \mid x)}\right).$

The key property: $\tilde\pi$ enters only as a ratio, so the unknown normalizing constant cancels. This is the reason MCMC works on unnormalized targets. See Metropolis-Hastings for the full derivation, including detailed-balance verification and the random-walk and independence proposal special cases.

Gibbs Sampling

Update each variable $x_i$ from its full conditional $\pi(x_i \mid x_{-i})$ in turn. Equivalent to MH with proposal equal to the full conditional, giving acceptance probability 1. Excellent when conditionals have closed form (conjugate priors, hierarchical models with Gaussian random effects); useless when they do not. See Gibbs Sampling for closed-form derivations of the bivariate Gaussian, normal-normal, and beta-binomial cases, and the canonical "100% acceptance does not mean fast mixing" example.

Hamiltonian Monte Carlo

Augment the state with a momentum variable $p$ , simulate Hamiltonian dynamics on the joint distribution, and Metropolis-correct for discretization error. The dynamics use the gradient $\nabla \log \pi(q)$ to make distant proposals that still land in high-density regions, achieving $O(d^{5/4})$ scaling versus the $O(d^2)$ of random-walk Metropolis. See Hamiltonian Monte Carlo for the leapfrog integrator, the symplectic-geometry argument for volume preservation, and the No-U-Turn Sampler (NUTS) that auto-tunes trajectory length.

Method	Cost per step	Tuning needed	Handles correlation	Multimodal targets
Random-walk MH	$O(1)$ density	proposal scale	Poor	Slow
Independence MH	$O(1)$ density	proposal $q$	Depends on $q$	If $q$ covers modes
Gibbs (full conditionals)	$O(d)$ conditional draws	none	Poor without blocking	Slow
HMC / NUTS	$O(L)$ gradients	$\epsilon, L$ (auto in NUTS)	Good	Slow
Langevin MCMC (MALA)	$O(1)$ gradient	step size	Moderate	Slow

Burn-In, Mixing Time, and Effective Sample Size

The asymptotic ergodic theorem hides three practical problems that every MCMC user faces.

Definition

Burn-In Period

The is the initial portion of the chain that is discarded because the marginal $P^t(x_0, \cdot)$ has not yet converged to $\pi$ . There is no universal "correct" burn-in length: it depends on the starting point, the chain's mixing time, and the geometry of $\pi$ .

In practice, run multiple chains from overdispersed starting points and discard a number of samples chosen from convergence diagnostics (split- $\hat R$ , trace plots, energy plots for HMC). For NUTS, modern Stan and PyMC default to discarding a per-run warmup segment that also adapts the step size and mass matrix; the discarded warmup is wider than just burn-in.

Definition

Mixing Time $t_{mix} (ϵ)$

For a chain on a finite (or compact) state space, the is

$t_{\text{mix}}(\epsilon) = \min\{ t \geq 0 : \max_x \|P^t(x, \cdot) - \pi\|_{\text{TV}} \leq \epsilon \}.$

For reversible chains, $t_{\text{mix}}(\epsilon) \asymp \frac{1}{\gamma} \log \frac{1}{\epsilon \pi_{\min}}$ where $\gamma$ is the spectral gap (1 minus the second-largest eigenvalue in absolute value of $P$ ). Small spectral gap means slow mixing.

For continuous state spaces, the analog is the $L^2(\pi)$ spectral gap or a logarithmic Sobolev inequality. Geometric ergodicity is the qualitative version: the marginal converges to $\pi$ at an exponential rate, with rate constant determined by the spectral gap.

Definition

Autocorrelation and Effective Sample Size $τ_{int}, N_{eff}$

For a stationary chain $X_t$ , the of $f(X_t)$ at lag $k$ is $\rho_k = \text{Corr}(f(X_t), f(X_{t+k}))$ . The integrated autocorrelation time is

$\tau_{\text{int}} = 1 + 2 \sum_{k=1}^\infty \rho_k.$

For an iid sample, $\rho_k = 0$ for $k \geq 1$ and $\tau_{\text{int}} = 1$ . For an autocorrelated chain, $\tau_{\text{int}} > 1$ , often much larger.

The Markov-chain CLT gives $\text{Var}[\hat\mu_n] \approx \tau_{\text{int}} \cdot \sigma^2(f) / n$ . Compared to iid sampling with the same $\sigma^2$ , you need $\tau_{\text{int}}$ times more samples. The effective sample size $N_{\text{eff}} = n / \tau_{\text{int}}$ is the iid-equivalent count: a chain of length $n$ carries information equivalent to $N_{\text{eff}}$ iid draws. Always report $N_{\text{eff}}$ , not $n$ .

The interaction of these three concepts is the practical subject of Burn-In and Convergence Diagnostics.

ML Applications: Where MCMC Earns Its Keep

Bayesian neural networks. Posterior over weights given training data is intractable; HMC and stochastic-gradient MCMC variants are the gold-standard tool. Practical scale is limited (hundreds of thousands of parameters, not billions), so variational inference often dominates at LLM scale, but BNN-MCMC remains the right call when calibrated uncertainty matters more than throughput.

Latent Dirichlet Allocation (LDA). The classic topic model has a collapsed-Gibbs sampler that integrates out topic-word distributions analytically, leaving only document-topic assignments. The closed-form conditionals make Gibbs sampling extremely effective.

Probabilistic graphical models. Once you have a directed or undirected graphical model with conjugate-or-near-conjugate structure, Gibbs and Metropolis-Hastings are the workhorses. PyMC, Stan, and Edward are built around this paradigm.

Statistical physics and lattice QCD. The original setting of HMC ("Hybrid Monte Carlo") in 1987 was lattice gauge theory, where the target is a high-dimensional, highly correlated, gradient-known density. The techniques transferred to ML almost intact.

Common Confusions

Watch Out

MCMC samples are not independent

Successive states in a Markov chain are correlated by construction. The ergodic theorem guarantees correct asymptotic time-averages, but the variance of those averages is inflated by $\tau_{\text{int}}$ . Always report effective sample size, never raw chain length, when summarizing MCMC results. A chain of 10000 samples with $\tau_{\text{int}} = 100$ is worth 100 iid samples, not 10000.

Watch Out

Convergence diagnostics are necessary, not sufficient

$\hat R$ , ESS, and trace plots can certify a chain is not converged but cannot certify that it is. A chain trapped in one mode of a multimodal posterior will pass every convergence diagnostic and still produce arbitrarily wrong answers. The only protections against this are running multiple chains from overdispersed starts, knowing the model well enough to anticipate multimodality, and using techniques designed for multimodal targets (parallel tempering, simulated annealing variants).

Watch Out

Detailed balance is sufficient, not necessary

Some important MCMC kernels are stationary without satisfying detailed balance. Systematic-scan Gibbs is the most common example: it is $\pi$ -invariant as a composition of reversible component-wise updates, but the composition itself is not reversible. Non-reversible MCMC ("lifted" walks, generalized HMC) can mix strictly faster than reversible designs by exploiting the slack between stationarity and reversibility.

Watch Out

The normalizing constant cancels, but not always

Standard MH only needs $\pi$ up to a constant, because the acceptance ratio is a ratio of $\pi$ values. Methods that estimate the normalizing constant itself (marginal likelihood for Bayes factors, evidence in model selection) do require dealing with it: thermodynamic integration, bridge sampling, and stepping-stone sampling are the classical tools. They are not free.

Watch Out

Burn-in is not a fixed fraction

"Discard the first half" is folklore, not theory. The right burn-in length depends on the starting state's distance from typical-set regions of $\pi$ and on the chain's mixing time. For HMC with NUTS, the warmup phase is a single tuning interval (default 1000 iterations in Stan) where step size and mass matrix adapt; samples from warmup are not MCMC samples and must be discarded entirely, regardless of "convergence."

Canonical Examples

Example

Two-state chain

$\mathcal{X} = \{0, 1\}$ with transition matrix $P = \begin{pmatrix} 1 - a & a \\ b & 1 - b \end{pmatrix}$ for $a, b \in (0, 1)$ . The stationary distribution is $\pi = (b/(a+b),\; a/(a+b))$ . Detailed balance holds: $\pi_0 P_{01} = \pi_1 P_{10}$ becomes $b/(a+b) \cdot a = a/(a+b) \cdot b$ .

Mixing time: $\|P^t(0, \cdot) - \pi\|_{\text{TV}} = (1-a-b)^t \cdot a/(a+b)$ . The spectral gap is $a + b$ ; small $a, b$ means slow mixing. This is the simplest setting in which to see all of stationarity, ergodicity, and mixing time concretely.

Example

Random walk MH on a Gaussian target

$\pi = \mathcal{N}(0, I_d)$ with proposal $q(\cdot \mid x) = \mathcal{N}(x, \sigma^2 I_d)$ . Symmetric proposal cancels in the acceptance ratio; only $\pi(x')/\pi(x)$ matters. Roberts-Gelman-Gilks (1997) shows the optimal step size is $\sigma_* = 2.38 / \sqrt{d}$ giving optimal acceptance rate $0.234$ in the diffusion limit. Random-walk MH on an iid Gaussian target needs $O(d^2)$ steps per effective sample, which is the canonical example of dimension-induced slow mixing.

Exercises

ExerciseCore

Problem

Consider a Metropolis-Hastings chain on $\mathbb{R}$ with target $\tilde\pi(x) = e^{-x^2/2}$ (unnormalized standard normal) and symmetric random-walk proposal $q(x' \mid x) = \mathcal{N}(x, \sigma^2)$ . Verify explicitly that this kernel satisfies detailed balance with respect to $\pi(x) = (2\pi)^{-1/2} e^{-x^2/2}$ .

ExerciseCore

Problem

A chain has integrated autocorrelation time $\tau_{\text{int}} = 50$ and you draw $n = 10000$ samples. What is the effective sample size? Roughly how many iid samples would you need to match the same standard error for estimating a posterior mean?

ExerciseAdvanced

Problem

Show that for a reversible chain with spectral gap $\gamma$ , the integrated autocorrelation time of the eigenfunction $\psi_2$ associated with the second-largest eigenvalue $\lambda_2 = 1 - \gamma$ is $\tau_{\text{int}}(\psi_2) = (1 + \lambda_2) / (1 - \lambda_2) = (2 - \gamma)/\gamma$ . Conclude that small spectral gap implies large autocorrelation.

References

Canonical:

Robert & Casella, Monte Carlo Statistical Methods (2nd ed., 2004), Chapters 6-12. The standard graduate-level reference, with detailed-balance derivations, geometric ergodicity, and CLTs.
Brooks, Gelman, Jones, Meng (eds.), Handbook of Markov Chain Monte Carlo (2011). Survey volume covering MH, Gibbs, HMC, slice sampling, parallel tempering, sequential Monte Carlo. Chapter 1 (Geyer, "Introduction") and Chapter 5 (Neal, "MCMC Using Hamiltonian Dynamics") are the must-read entries.
Gelman, Carlin, Stern, Dunson, Vehtari, Rubin, Bayesian Data Analysis (3rd ed., 2013), Chapters 11-12.

Theory:

Meyn & Tweedie, Markov Chains and Stochastic Stability (2nd ed., 2009), Chapters 13-17. Drift conditions, geometric ergodicity, and the general-state-space ergodic theorem.
Roberts & Rosenthal, "General state space Markov chains and MCMC algorithms," Probability Surveys 1 (2004), 20-71. Concise survey of ergodicity conditions tailored to MCMC.
Levin, Peres, Wilmer, Markov Chains and Mixing Times (2nd ed., 2017). Spectral-gap and coupling techniques, mostly for finite chains but with continuous-state extensions.

Original papers:

Metropolis, Rosenbluth, Rosenbluth, Teller, Teller, "Equation of State Calculations by Fast Computing Machines," Journal of Chemical Physics 21 (1953), 1087-1092. The original Metropolis algorithm.
Hastings, "Monte Carlo Sampling Methods Using Markov Chains and Their Applications," Biometrika 57 (1970), 97-109. Generalization to non-symmetric proposals.
Geman & Geman, "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images," IEEE TPAMI 6 (1984), 721-741. Origin of the modern Gibbs sampler.

Practical:

Vehtari, Gelman, Simpson, Carpenter, Bürkner, "Rank-normalization, folding, and localization: An improved $\hat R$ for assessing convergence of MCMC" (2021, arXiv:1903.08008). Modern $\hat R$ definition used by Stan and PyMC.

Next Topics

Metropolis-Hastings: the foundational MCMC construction. Proposal distributions, acceptance ratios, and detailed-balance verification in full.
Gibbs Sampling: full-conditional updates, special case of MH with acceptance 1, conjugate models, and blocking.
Hamiltonian Monte Carlo: gradient-based MCMC with $O(d^{5/4})$ scaling. Leapfrog, NUTS, and divergences.
Burn-In and Convergence Diagnostics: $\hat R$ , ESS, trace plots, and the practical workflow for trusting an MCMC run.
Langevin Dynamics: continuous-time analog of MCMC; the link between SGD-with-noise and posterior sampling.

Last reviewed: May 6, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Monte Carlo Methodslayer 2 · tier 1
Markov Chains and Steady Statelayer 1 · tier 2

Derived topics

3

Gibbs Samplinglayer 2 · tier 1
Metropolis-Hastings Algorithmlayer 2 · tier 1
Hamiltonian Monte Carlolayer 3 · tier 2

Graph-backed continuations

Metropolis-Hastings Algorithm Gibbs Sampling Hamiltonian Monte Carlo