Skip to main content

Sampling MCMC

Markov Chain Monte Carlo

When you can evaluate a target up to a normalizing constant but cannot sample from it, build a Markov chain whose stationary distribution is the target. Detailed balance, ergodicity, mixing time, autocorrelation, and the variance penalty for non-iid samples.

CoreTier 1StableCore spine~65 min

Why This Matters

Plain Monte Carlo requires that you can already sample from the target π\pi. The typical Bayesian situation is the opposite: you know π(θ)p(Dθ)p(θ)\pi(\theta) \propto p(\mathcal{D} \mid \theta) p(\theta) as a function of θ\theta but the normalizing constant p(Dθ)p(θ)dθ\int p(\mathcal{D} \mid \theta) p(\theta)\, d\theta is exactly the intractable thing. Direct sampling is impossible.

MCMC is the workaround that breaks the impasse. Instead of producing iid samples, build a Markov chain X0,X1,X2,X_0, X_1, X_2, \ldots whose stationary distribution is π\pi. Run the chain long enough and the marginal distribution of XtX_t approaches π\pi. Once it has, time-averages along the trajectory estimate Eπ[f(X)]\mathbb{E}_\pi[f(X)] correctly. The crucial trick is that you only ever need π\pi up to a multiplicative constant, because constructions like Metropolis-Hastings ratios cancel the constant out.

That single idea unlocks Bayesian neural networks, Latent Dirichlet Allocation, posterior inference in probabilistic graphical models, lattice QCD, and most of statistical physics. The price you pay relative to iid Monte Carlo is correlation: successive samples are not independent, so the variance of the sample mean carries an autocorrelation factor, and chains need a burn-in period before their marginal is close to π\pi.

Mental Model

Picture a walker on a landscape whose height is the unnormalized log density. From each point the walker takes a randomized step that biases toward higher density but occasionally moves downhill. Run the walker long enough and it visits each region in proportion to the true density. The recipe for designing those random steps is exactly what Metropolis-Hastings, Gibbs, and Hamiltonian Monte Carlo provide.

Three things have to be true for the walker's trajectory to be useful:

  1. The chain must have π\pi as a stationary distribution. This is the correctness requirement; detailed balance is the standard way to guarantee it.
  2. The chain must be irreducible and aperiodic, so the marginal at time tt converges to π\pi regardless of starting point. This is the ergodicity requirement.
  3. The convergence and mixing must be fast enough that the chain explores the target in a reasonable wall-clock budget. This is the practicality requirement, controlled by the spectral gap, geometric ergodicity, or stronger conditions like log-Sobolev inequalities.

The correctness requirements are usually free; the ergodicity requirements are usually free in continuous state spaces with positive density; the mixing requirement is where every honest discussion of MCMC ends up.

Formal Setup and Notation

Let π\pi be the target distribution on a measurable state space X\mathcal{X}. We assume we can evaluate π~(x)π(x)\tilde\pi(x) \propto \pi(x) pointwise, but we cannot sample from π\pi directly. A Markov chain on X\mathcal{X} is specified by a P(x,dy)P(x, dy).

Definition

Stationary Distribution

A distribution π\pi on X\mathcal{X} is stationary for a Markov chain with kernel PP if it is left invariant by PP:

π(A)=XP(x,A)π(dx)for every measurable A.\pi(A) = \int_\mathcal{X} P(x, A)\, \pi(dx) \quad \text{for every measurable } A.

Equivalently, if X0πX_0 \sim \pi and X1X0P(X0,)X_1 \mid X_0 \sim P(X_0, \cdot), then X1πX_1 \sim \pi. The MCMC design problem is to construct PP so that the target π\pi is stationary.

Definition

Detailed Balance (Reversibility)

A kernel PP satisfies with respect to π\pi if for every measurable A,BA, B:

ABP(x,dy)π(dx)=BAP(y,dx)π(dy).\int_A \int_B P(x, dy)\, \pi(dx) = \int_B \int_A P(y, dx)\, \pi(dy).

In density form (assuming π\pi and P(x,)P(x, \cdot) have densities), π(x)p(x,y)=π(y)p(y,x)\pi(x) p(x, y) = \pi(y) p(y, x) for all x,yx, y. A chain satisfying detailed balance is reversible: the time-reversed process has the same law.

Reversibility implies stationarity: integrate the detailed-balance equation in xx to get πP=π\pi P = \pi. The converse is not true. Stationarity does not require reversibility, and important MCMC kernels (systematic-scan Gibbs, lifted-walk samplers, non-reversible HMC variants) are stationary without being reversible.

Definition

Irreducibility, Aperiodicity, Ergodicity

A chain is π\pi-irreducible if for every state xx and every set AA with π(A)>0\pi(A) > 0, there exists n1n \geq 1 with Pn(x,A)>0P^n(x, A) > 0. The chain can reach any positive-probability region from any starting point in finitely many steps.

A chain is aperiodic if the gcd of return times to any state (or π\pi-positive set) is 1. Periodicity would force the marginal to oscillate rather than converge.

Together with the existence of a stationary distribution, these conditions make the chain ergodic: Pn(x,)π()TV0\|P^n(x, \cdot) - \pi(\cdot)\|_{\text{TV}} \to 0 as nn \to \infty for every starting state xx. On general state spaces the proper notion is Harris recurrence, which guarantees the marginal convergence under irreducibility plus aperiodicity.

Main Theorems

Theorem

Detailed Balance Implies Stationarity

Statement

If a Markov chain with transition kernel PP satisfies detailed balance with respect to π\pi, then π\pi is a stationary distribution for PP. That is, πP=π\pi P = \pi.

Intuition

Detailed balance says forward and reverse flows match locally between every pair of states. The total inflow into any region AA equals the total outflow, so the mass distribution does not change. That is exactly what "stationary" means.

The converse is not true. Stationarity is a global balance condition; you can have local imbalances that cancel in aggregate. Non-reversible chains exploit exactly this slack to mix faster than reversible ones.

Proof Sketch

Integrate detailed balance with respect to xx. For any measurable AA:

XP(x,A)π(dx)=XAP(x,dy)π(dx)=AXP(y,dx)π(dy)=Aπ(dy)=π(A).\int_\mathcal{X} P(x, A)\, \pi(dx) = \int_\mathcal{X} \int_A P(x, dy)\, \pi(dx) = \int_A \int_\mathcal{X} P(y, dx)\, \pi(dy) = \int_A \pi(dy) = \pi(A).

The first equality unpacks P(x,A)P(x, A); the second uses detailed balance with B=XB = \mathcal{X}; the third uses that P(y,)P(y, \cdot) is a probability measure, so XP(y,dx)=1\int_\mathcal{X} P(y, dx) = 1.

Why It Matters

This is the reason every textbook MCMC proof starts with verifying detailed balance. It is the cleanest sufficient condition for the chain to preserve π\pi. Metropolis-Hastings is engineered to satisfy detailed balance by construction. The acceptance ratio is exactly the correction that turns an arbitrary proposal kernel into a π\pi-reversible kernel.

Failure Mode

Failing detailed balance does not invalidate a sampler; many practical samplers are stationary but not reversible. But verifying stationarity without detailed balance is much harder, requiring a direct check of πP=π\pi P = \pi. For non-reversible designs, the standard tool is to write the kernel as a composition of reversible pieces and use that each piece preserves π\pi.

Theorem

MCMC Ergodic Theorem

Statement

Let X0,X1,X2,X_0, X_1, X_2, \ldots be a Markov chain with transition kernel PP, and assume PP is π\pi-irreducible, aperiodic, and has stationary distribution π\pi. Then for any fL1(π)f \in L^1(\pi) and any starting state xx:

1nt=1nf(Xt)a.s.Eπ[f(X)].\frac{1}{n}\sum_{t=1}^n f(X_t) \xrightarrow{a.s.} \mathbb{E}_\pi[f(X)].

Furthermore, the marginal Pn(x,)P^n(x, \cdot) converges to π\pi in total variation: Pn(x,)πTV0\|P^n(x, \cdot) - \pi\|_{\text{TV}} \to 0.

Intuition

This is the MCMC analog of the law of large numbers. For iid samples, sample averages converge to expectations. For ergodic Markov chains, time-averages along a trajectory converge to the π\pi-expectation. The chain explores the state space in proportion to π\pi, and the long-run frequency of visits to any region converges to its π\pi-mass.

The catch: this is an asymptotic statement. For finite nn, the chain has not converged, the early samples are biased toward the starting state, and the variance of the time-average carries autocorrelation.

Proof Sketch

Standard ergodic theory for Harris-recurrent chains. The hard work is the zero-one law for tail events and the construction of regeneration times (or coupling) that show the marginal converges to π\pi. Once that is in place, the sample-average convergence follows from the standard ergodic theorem applied to the shift-invariant process (Xt)t0(X_t)_{t \geq 0} under its stationary law.

See Meyn & Tweedie (2009), Chapter 17 for the discrete-time treatment on general state spaces, and Roberts & Rosenthal (2004) for the MCMC-tailored version.

Why It Matters

This is what justifies running an MCMC chain and reporting time-averages as estimates of π\pi-expectations. It is the legal foundation for every posterior summary in Stan, PyMC, BUGS, JAGS, NUTS, and friends.

For finite-sample variance, one wants a Markov-chain CLT: n(μ^nμ)N(0,σMCMC2)\sqrt{n}\,(\hat\mu_n - \mu) \to \mathcal{N}(0, \sigma^2_{\text{MCMC}}) where σMCMC2=σ2(f)τint\sigma^2_{\text{MCMC}} = \sigma^2(f) \cdot \tau_{\text{int}} and τint\tau_{\text{int}} is the integrated autocorrelation time. This requires geometric ergodicity (or an L2L^2 spectral gap) plus a moment condition, not just irreducibility plus aperiodicity. Without geometric ergodicity, the Markov-chain CLT can fail and confidence intervals based on the iid formula are unreliable.

Failure Mode

Reducible chains can be stuck in one mode and never see the rest of the state space; periodic chains oscillate without converging. On continuous state spaces with disjoint positive-density components and a local-proposal kernel, the chain can be reducible without obvious warning, and the time-average converges to the wrong answer. Multimodal posteriors in Bayesian models are the canonical practical risk: a chain trapped in one mode is consistent with all standard convergence diagnostics yet still wrong.

The Big Three MCMC Constructions

The standard MCMC algorithms differ in how they propose moves and how they correct for proposal-target mismatch. All three guarantee detailed balance (or π\pi-invariance) by construction.

Metropolis-Hastings

Propose xq(x)x' \sim q(\cdot \mid x) from any conditional distribution qq, then accept with probability

α(x,x)=min ⁣(1,π~(x)q(xx)π~(x)q(xx)).\alpha(x, x') = \min\!\left(1, \frac{\tilde\pi(x')\, q(x \mid x')}{\tilde\pi(x)\, q(x' \mid x)}\right).

The key property: π~\tilde\pi enters only as a ratio, so the unknown normalizing constant cancels. This is the reason MCMC works on unnormalized targets. See Metropolis-Hastings for the full derivation, including detailed-balance verification and the random-walk and independence proposal special cases.

Gibbs Sampling

Update each variable xix_i from its full conditional π(xixi)\pi(x_i \mid x_{-i}) in turn. Equivalent to MH with proposal equal to the full conditional, giving acceptance probability 1. Excellent when conditionals have closed form (conjugate priors, hierarchical models with Gaussian random effects); useless when they do not. See Gibbs Sampling for closed-form derivations of the bivariate Gaussian, normal-normal, and beta-binomial cases, and the canonical "100% acceptance does not mean fast mixing" example.

Hamiltonian Monte Carlo

Augment the state with a momentum variable pp, simulate Hamiltonian dynamics on the joint distribution, and Metropolis-correct for discretization error. The dynamics use the gradient logπ(q)\nabla \log \pi(q) to make distant proposals that still land in high-density regions, achieving O(d5/4)O(d^{5/4}) scaling versus the O(d2)O(d^2) of random-walk Metropolis. See Hamiltonian Monte Carlo for the leapfrog integrator, the symplectic-geometry argument for volume preservation, and the No-U-Turn Sampler (NUTS) that auto-tunes trajectory length.

MethodCost per stepTuning neededHandles correlationMultimodal targets
Random-walk MHO(1)O(1) densityproposal scalePoorSlow
Independence MHO(1)O(1) densityproposal qqDepends on qqIf qq covers modes
Gibbs (full conditionals)O(d)O(d) conditional drawsnonePoor without blockingSlow
HMC / NUTSO(L)O(L) gradientsϵ,L\epsilon, L (auto in NUTS)GoodSlow
Langevin MCMC (MALA)O(1)O(1) gradientstep sizeModerateSlow

Burn-In, Mixing Time, and Effective Sample Size

The asymptotic ergodic theorem hides three practical problems that every MCMC user faces.

Definition

Burn-In Period

The is the initial portion of the chain that is discarded because the marginal Pt(x0,)P^t(x_0, \cdot) has not yet converged to π\pi. There is no universal "correct" burn-in length: it depends on the starting point, the chain's mixing time, and the geometry of π\pi.

In practice, run multiple chains from overdispersed starting points and discard a number of samples chosen from convergence diagnostics (split-R^\hat R, trace plots, energy plots for HMC). For NUTS, modern Stan and PyMC default to discarding a per-run warmup segment that also adapts the step size and mass matrix; the discarded warmup is wider than just burn-in.

Definition

Mixing Time

For a chain on a finite (or compact) state space, the is

tmix(ϵ)=min{t0:maxxPt(x,)πTVϵ}.t_{\text{mix}}(\epsilon) = \min\{ t \geq 0 : \max_x \|P^t(x, \cdot) - \pi\|_{\text{TV}} \leq \epsilon \}.

For reversible chains, tmix(ϵ)1γlog1ϵπmint_{\text{mix}}(\epsilon) \asymp \frac{1}{\gamma} \log \frac{1}{\epsilon \pi_{\min}} where γ\gamma is the spectral gap (1 minus the second-largest eigenvalue in absolute value of PP). Small spectral gap means slow mixing.

For continuous state spaces, the analog is the L2(π)L^2(\pi) spectral gap or a logarithmic Sobolev inequality. Geometric ergodicity is the qualitative version: the marginal converges to π\pi at an exponential rate, with rate constant determined by the spectral gap.

Definition

Autocorrelation and Effective Sample Size

For a stationary chain XtX_t, the of f(Xt)f(X_t) at lag kk is ρk=Corr(f(Xt),f(Xt+k))\rho_k = \text{Corr}(f(X_t), f(X_{t+k})). The integrated autocorrelation time is

τint=1+2k=1ρk.\tau_{\text{int}} = 1 + 2 \sum_{k=1}^\infty \rho_k.

For an iid sample, ρk=0\rho_k = 0 for k1k \geq 1 and τint=1\tau_{\text{int}} = 1. For an autocorrelated chain, τint>1\tau_{\text{int}} > 1, often much larger.

The Markov-chain CLT gives Var[μ^n]τintσ2(f)/n\text{Var}[\hat\mu_n] \approx \tau_{\text{int}} \cdot \sigma^2(f) / n. Compared to iid sampling with the same σ2\sigma^2, you need τint\tau_{\text{int}} times more samples. The effective sample size Neff=n/τintN_{\text{eff}} = n / \tau_{\text{int}} is the iid-equivalent count: a chain of length nn carries information equivalent to NeffN_{\text{eff}} iid draws. Always report NeffN_{\text{eff}}, not nn.

The interaction of these three concepts is the practical subject of Burn-In and Convergence Diagnostics.

ML Applications: Where MCMC Earns Its Keep

Bayesian neural networks. Posterior over weights given training data is intractable; HMC and stochastic-gradient MCMC variants are the gold-standard tool. Practical scale is limited (hundreds of thousands of parameters, not billions), so variational inference often dominates at LLM scale, but BNN-MCMC remains the right call when calibrated uncertainty matters more than throughput.

Latent Dirichlet Allocation (LDA). The classic topic model has a collapsed-Gibbs sampler that integrates out topic-word distributions analytically, leaving only document-topic assignments. The closed-form conditionals make Gibbs sampling extremely effective.

Probabilistic graphical models. Once you have a directed or undirected graphical model with conjugate-or-near-conjugate structure, Gibbs and Metropolis-Hastings are the workhorses. PyMC, Stan, and Edward are built around this paradigm.

Statistical physics and lattice QCD. The original setting of HMC ("Hybrid Monte Carlo") in 1987 was lattice gauge theory, where the target is a high-dimensional, highly correlated, gradient-known density. The techniques transferred to ML almost intact.

Common Confusions

Watch Out

MCMC samples are not independent

Successive states in a Markov chain are correlated by construction. The ergodic theorem guarantees correct asymptotic time-averages, but the variance of those averages is inflated by τint\tau_{\text{int}}. Always report effective sample size, never raw chain length, when summarizing MCMC results. A chain of 10000 samples with τint=100\tau_{\text{int}} = 100 is worth 100 iid samples, not 10000.

Watch Out

Convergence diagnostics are necessary, not sufficient

R^\hat R, ESS, and trace plots can certify a chain is not converged but cannot certify that it is. A chain trapped in one mode of a multimodal posterior will pass every convergence diagnostic and still produce arbitrarily wrong answers. The only protections against this are running multiple chains from overdispersed starts, knowing the model well enough to anticipate multimodality, and using techniques designed for multimodal targets (parallel tempering, simulated annealing variants).

Watch Out

Detailed balance is sufficient, not necessary

Some important MCMC kernels are stationary without satisfying detailed balance. Systematic-scan Gibbs is the most common example: it is π\pi-invariant as a composition of reversible component-wise updates, but the composition itself is not reversible. Non-reversible MCMC ("lifted" walks, generalized HMC) can mix strictly faster than reversible designs by exploiting the slack between stationarity and reversibility.

Watch Out

The normalizing constant cancels, but not always

Standard MH only needs π\pi up to a constant, because the acceptance ratio is a ratio of π\pi values. Methods that estimate the normalizing constant itself (marginal likelihood for Bayes factors, evidence in model selection) do require dealing with it: thermodynamic integration, bridge sampling, and stepping-stone sampling are the classical tools. They are not free.

Watch Out

Burn-in is not a fixed fraction

"Discard the first half" is folklore, not theory. The right burn-in length depends on the starting state's distance from typical-set regions of π\pi and on the chain's mixing time. For HMC with NUTS, the warmup phase is a single tuning interval (default 1000 iterations in Stan) where step size and mass matrix adapt; samples from warmup are not MCMC samples and must be discarded entirely, regardless of "convergence."

Canonical Examples

Example

Two-state chain

X={0,1}\mathcal{X} = \{0, 1\} with transition matrix P=(1aab1b)P = \begin{pmatrix} 1 - a & a \\ b & 1 - b \end{pmatrix} for a,b(0,1)a, b \in (0, 1). The stationary distribution is π=(b/(a+b),  a/(a+b))\pi = (b/(a+b),\; a/(a+b)). Detailed balance holds: π0P01=π1P10\pi_0 P_{01} = \pi_1 P_{10} becomes b/(a+b)a=a/(a+b)bb/(a+b) \cdot a = a/(a+b) \cdot b.

Mixing time: Pt(0,)πTV=(1ab)ta/(a+b)\|P^t(0, \cdot) - \pi\|_{\text{TV}} = (1-a-b)^t \cdot a/(a+b). The spectral gap is a+ba + b; small a,ba, b means slow mixing. This is the simplest setting in which to see all of stationarity, ergodicity, and mixing time concretely.

Example

Random walk MH on a Gaussian target

π=N(0,Id)\pi = \mathcal{N}(0, I_d) with proposal q(x)=N(x,σ2Id)q(\cdot \mid x) = \mathcal{N}(x, \sigma^2 I_d). Symmetric proposal cancels in the acceptance ratio; only π(x)/π(x)\pi(x')/\pi(x) matters. Roberts-Gelman-Gilks (1997) shows the optimal step size is σ=2.38/d\sigma_* = 2.38 / \sqrt{d} giving optimal acceptance rate 0.2340.234 in the diffusion limit. Random-walk MH on an iid Gaussian target needs O(d2)O(d^2) steps per effective sample, which is the canonical example of dimension-induced slow mixing.

Exercises

ExerciseCore

Problem

Consider a Metropolis-Hastings chain on R\mathbb{R} with target π~(x)=ex2/2\tilde\pi(x) = e^{-x^2/2} (unnormalized standard normal) and symmetric random-walk proposal q(xx)=N(x,σ2)q(x' \mid x) = \mathcal{N}(x, \sigma^2). Verify explicitly that this kernel satisfies detailed balance with respect to π(x)=(2π)1/2ex2/2\pi(x) = (2\pi)^{-1/2} e^{-x^2/2}.

ExerciseCore

Problem

A chain has integrated autocorrelation time τint=50\tau_{\text{int}} = 50 and you draw n=10000n = 10000 samples. What is the effective sample size? Roughly how many iid samples would you need to match the same standard error for estimating a posterior mean?

ExerciseAdvanced

Problem

Show that for a reversible chain with spectral gap γ\gamma, the integrated autocorrelation time of the eigenfunction ψ2\psi_2 associated with the second-largest eigenvalue λ2=1γ\lambda_2 = 1 - \gamma is τint(ψ2)=(1+λ2)/(1λ2)=(2γ)/γ\tau_{\text{int}}(\psi_2) = (1 + \lambda_2) / (1 - \lambda_2) = (2 - \gamma)/\gamma. Conclude that small spectral gap implies large autocorrelation.

References

Canonical:

  • Robert & Casella, Monte Carlo Statistical Methods (2nd ed., 2004), Chapters 6-12. The standard graduate-level reference, with detailed-balance derivations, geometric ergodicity, and CLTs.
  • Brooks, Gelman, Jones, Meng (eds.), Handbook of Markov Chain Monte Carlo (2011). Survey volume covering MH, Gibbs, HMC, slice sampling, parallel tempering, sequential Monte Carlo. Chapter 1 (Geyer, "Introduction") and Chapter 5 (Neal, "MCMC Using Hamiltonian Dynamics") are the must-read entries.
  • Gelman, Carlin, Stern, Dunson, Vehtari, Rubin, Bayesian Data Analysis (3rd ed., 2013), Chapters 11-12.

Theory:

  • Meyn & Tweedie, Markov Chains and Stochastic Stability (2nd ed., 2009), Chapters 13-17. Drift conditions, geometric ergodicity, and the general-state-space ergodic theorem.
  • Roberts & Rosenthal, "General state space Markov chains and MCMC algorithms," Probability Surveys 1 (2004), 20-71. Concise survey of ergodicity conditions tailored to MCMC.
  • Levin, Peres, Wilmer, Markov Chains and Mixing Times (2nd ed., 2017). Spectral-gap and coupling techniques, mostly for finite chains but with continuous-state extensions.

Original papers:

  • Metropolis, Rosenbluth, Rosenbluth, Teller, Teller, "Equation of State Calculations by Fast Computing Machines," Journal of Chemical Physics 21 (1953), 1087-1092. The original Metropolis algorithm.
  • Hastings, "Monte Carlo Sampling Methods Using Markov Chains and Their Applications," Biometrika 57 (1970), 97-109. Generalization to non-symmetric proposals.
  • Geman & Geman, "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images," IEEE TPAMI 6 (1984), 721-741. Origin of the modern Gibbs sampler.

Practical:

  • Vehtari, Gelman, Simpson, Carpenter, Bürkner, "Rank-normalization, folding, and localization: An improved R^\hat R for assessing convergence of MCMC" (2021, arXiv:1903.08008). Modern R^\hat R definition used by Stan and PyMC.

Next Topics

  • Metropolis-Hastings: the foundational MCMC construction. Proposal distributions, acceptance ratios, and detailed-balance verification in full.
  • Gibbs Sampling: full-conditional updates, special case of MH with acceptance 1, conjugate models, and blocking.
  • Hamiltonian Monte Carlo: gradient-based MCMC with O(d5/4)O(d^{5/4}) scaling. Leapfrog, NUTS, and divergences.
  • Burn-In and Convergence Diagnostics: R^\hat R, ESS, trace plots, and the practical workflow for trusting an MCMC run.
  • Langevin Dynamics: continuous-time analog of MCMC; the link between SGD-with-noise and posterior sampling.

Last reviewed: May 6, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

3