Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Sampling MCMC

Metropolis-Hastings Algorithm

The foundational MCMC algorithm: construct a Markov chain whose stationary distribution is your target by accepting or rejecting proposed moves according to a carefully chosen ratio.

CoreTier 1Stable~70 min

Why This Matters

If you need to sample from a probability distribution that you can only evaluate up to a normalizing constant. which describes nearly every posterior distribution in Bayesian statistics. The Metropolis-Hastings algorithm is the tool that makes this possible. It is the foundation on which Gibbs sampling, Hamiltonian Monte Carlo, and all of modern MCMC rest.

Before MH, Bayesian inference was largely restricted to conjugate models where closed-form posteriors exist. MH broke that barrier and made Bayesian computation general-purpose.

startMode 1Mode 2AcceptedRejectedChain explores high-density regions; rejections prevent low-density drift

Mental Model

You are exploring a landscape where the height at each point represents the (unnormalized) probability density of your target distribution. You stand at some point xx. A friend proposes a new location xx' according to some rule. You evaluate how much more (or less) probable xx' is compared to xx. If xx' is more probable, you always move there. If xx' is less probable, you move there with a probability proportional to how much less probable it is. Over time, you spend more time in high-probability regions. exactly in proportion to their probability.

Formal Setup and Notation

Let π(x)\pi(x) be the target distribution we wish to sample from. We assume we can evaluate π(x)\pi(x) up to a normalizing constant. that is, we can compute π~(x)\tilde{\pi}(x) where π(x)=π~(x)/Z\pi(x) = \tilde{\pi}(x)/Z for an unknown constant ZZ.

Let q(xx)q(x' \mid x) be a proposal distribution: a conditional density from which we can easily draw candidates.

Definition

Proposal Distribution

The proposal distribution q(xx)q(x' \mid x) is a conditional density that, given the current state xx, generates a candidate next state xx'. The proposal must be chosen so that it is easy to sample from and (ideally) explores the target distribution efficiently. Common choices include:

  • Random walk: q(xx)=N(x,σ2I)q(x' \mid x) = \mathcal{N}(x, \sigma^2 I). propose near the current state
  • Independence sampler: q(xx)=q(x)q(x' \mid x) = q(x'). propose independently of the current state
Definition

Acceptance Ratio

The acceptance ratio (or acceptance probability) for moving from state xx to proposed state xx' is:

α(x,x)=min ⁣(1,  π(x)q(xx)π(x)q(xx))\alpha(x, x') = \min\!\left(1,\; \frac{\pi(x')\, q(x \mid x')}{\pi(x)\, q(x' \mid x)}\right)

When the proposal is symmetric, i.e., q(xx)=q(xx)q(x' \mid x) = q(x \mid x'), this simplifies to the original Metropolis ratio:

α(x,x)=min ⁣(1,  π(x)π(x))\alpha(x, x') = \min\!\left(1,\; \frac{\pi(x')}{\pi(x)}\right)

Definition

Metropolis-Hastings Algorithm

Given target π\pi, proposal qq, and initial state x0x_0:

  1. Propose: Draw xq(xt)x' \sim q(\cdot \mid x_t)
  2. Compute the acceptance ratio using the unnormalized target and proposal densities
  3. Accept/Reject: Draw uUniform(0,1)u \sim \text{Uniform}(0,1). If uαu \leq \alpha, set xt+1=xx_{t+1} = x'; otherwise set xt+1=xtx_{t+1} = x_t
  4. Repeat from step 1

We use π~\tilde{\pi} (unnormalized) because ZZ cancels in the ratio.

Why the Algorithm Works

The key insight is that MH constructs a Markov chain whose transition kernel satisfies detailed balance with respect to π\pi. Detailed balance is a sufficient condition for π\pi to be the stationary distribution of the chain.

Definition

Detailed Balance

A Markov chain with transition kernel K(xx)K(x \to x') satisfies detailed balance with respect to π\pi if:

π(x)K(xx)=π(x)K(xx)\pi(x)\, K(x \to x') = \pi(x')\, K(x' \to x)

for all states x,xx, x'. This says: the probability flow from xx to xx' under π\pi equals the flow from xx' to xx. Detailed balance implies π\pi is stationary (integrate both sides over xx), but is stronger. it says the chain is reversible.

Main Theorems

Theorem

MH Satisfies Detailed Balance

Statement

The Metropolis-Hastings transition kernel

K(xx)=q(xx)α(x,x)for xxK(x \to x') = q(x' \mid x)\,\alpha(x, x') \quad \text{for } x' \neq x

satisfies detailed balance with respect to π\pi:

π(x)q(xx)α(x,x)=π(x)q(xx)α(x,x)\pi(x)\, q(x' \mid x)\,\alpha(x, x') = \pi(x')\, q(x \mid x')\,\alpha(x', x)

Therefore π\pi is a stationary distribution of the MH chain.

Intuition

The acceptance ratio is specifically designed so that the "excess flow" from xx to xx' is throttled back to match the flow in the reverse direction. If π(x)q(xx)>π(x)q(xx)\pi(x')q(x \mid x') > \pi(x)q(x' \mid x), then the move from xx to xx' is accepted with probability 1, and the reverse move is accepted with a probability less than 1. exactly the right amount less to balance the flows.

Proof Sketch

Without loss of generality, assume π(x)q(xx)π(x)q(xx)\pi(x)\,q(x' \mid x) \leq \pi(x')\,q(x \mid x').

Then α(x,x)=1\alpha(x, x') = 1 and α(x,x)=π(x)q(xx)π(x)q(xx)\alpha(x', x) = \frac{\pi(x)\,q(x' \mid x)}{\pi(x')\,q(x \mid x')}.

The left side of the detailed balance equation becomes: π(x)q(xx)1=π(x)q(xx)\pi(x)\,q(x' \mid x) \cdot 1 = \pi(x)\,q(x' \mid x).

The right side becomes: π(x)q(xx)π(x)q(xx)π(x)q(xx)=π(x)q(xx)\pi(x')\,q(x \mid x') \cdot \frac{\pi(x)\,q(x' \mid x)}{\pi(x')\,q(x \mid x')} = \pi(x)\,q(x' \mid x).

The two sides are equal. The symmetric case follows identically.

Why It Matters

This is the theoretical guarantee that MH actually samples from the correct distribution in the long run. Without detailed balance, you would have no reason to believe the chain converges to π\pi.

Failure Mode

Detailed balance alone only guarantees π\pi is a stationary distribution. For π\pi to be the unique stationary distribution and for the chain to converge to it from any starting point, you also need the chain to be irreducible and aperiodic. which is the content of the ergodicity theorem.

Theorem

Ergodicity of the MH Chain

Statement

If the MH chain is irreducible and aperiodic, then for any initial distribution μ0\mu_0:

limtKt(μ0,)π()TV=0\lim_{t \to \infty} \| K^t(\mu_0, \cdot) - \pi(\cdot) \|_{\text{TV}} = 0

Moreover, for any integrable function ff:

1Tt=1Tf(xt)a.s.Eπ[f(X)]\frac{1}{T}\sum_{t=1}^{T} f(x_t) \xrightarrow{a.s.} \mathbb{E}_\pi[f(X)]

Intuition

Irreducibility means the chain can reach any state from any other state. Aperiodicity means it does not get trapped in deterministic cycles. Together with detailed balance, these conditions ensure the chain "forgets" its starting point and converges to π\pi. The ergodic theorem then says that time averages along the chain converge to expectations under π\pi.

Proof Sketch

Detailed balance implies π\pi-reversibility, which implies π\pi is stationary. Irreducibility and aperiodicity together with the existence of a stationary distribution imply convergence in total variation by the fundamental theorem of Markov chains. The ergodic theorem for Markov chains then gives the almost sure convergence of time averages.

Why It Matters

This theorem is what lets you use the output of an MH chain to compute expectations, posterior means, credible intervals, and any other quantity that can be expressed as Eπ[f(X)]\mathbb{E}_\pi[f(X)]. It is the theoretical justification for all of MCMC-based Bayesian inference.

Failure Mode

If the proposal is too narrow, the chain explores slowly and may not effectively reach all regions of high probability within a practical number of iterations. The chain is still ergodic in theory, but convergence may be so slow that your finite sample is useless. Diagnosing this requires convergence diagnostics (trace plots, R^\hat{R}, effective sample size).

Random Walk MH vs Independence Sampler

Definition

Random Walk Metropolis

In random walk MH, the proposal is centered at the current state:

q(xx)=g(xx)q(x' \mid x) = g(x' - x)

for some symmetric density gg. A common choice is g=N(0,σ2I)g = \mathcal{N}(0, \sigma^2 I). The acceptance ratio simplifies to α=min(1,π(x)/π(x))\alpha = \min(1, \pi(x')/\pi(x)) because gg is symmetric.

The step size σ\sigma controls a tradeoff: too small means slow exploration, too large means most proposals are rejected. The optimal acceptance rate for random walk MH in high dimensions is approximately 0.234 (Roberts, Gelman, and Gilks, 1997).

Definition

Independence Sampler

In the independence sampler, the proposal ignores the current state:

q(xx)=q(x)q(x' \mid x) = q(x')

The acceptance ratio becomes α=min(1,π(x)q(x)/(π(x)q(x)))\alpha = \min(1, \pi(x')q(x)/(\pi(x)q(x'))), which involves the ratio of importance weights. This works well only if qq is a good approximation to π\pi with heavier tails.

Burn-in

The chain's initial samples are influenced by the starting point x0x_0 and do not yet represent draws from π\pi. The burn-in period is the initial segment of the chain that is discarded before collecting samples for inference. Choosing the burn-in length is an art informed by convergence diagnostics.

Formally, burn-in discards samples x1,,xBx_1, \ldots, x_B and uses only xB+1,,xTx_{B+1}, \ldots, x_T for estimation. There is no universal formula for BB. It depends on the mixing rate of the chain.

Canonical Examples

Example

Sampling from a mixture of Gaussians

Target: π(x)=0.3N(3,1)+0.7N(3,1)\pi(x) = 0.3\,\mathcal{N}(-3, 1) + 0.7\,\mathcal{N}(3, 1).

Using random walk MH with proposal q(xx)=N(x,σ2)q(x' \mid x) = \mathcal{N}(x, \sigma^2):

  • If σ=0.1\sigma = 0.1: most proposals are accepted but the chain moves slowly and may get stuck in one mode for long stretches
  • If σ=10\sigma = 10: the chain proposes large jumps but most are rejected because they land in low-probability regions
  • If σ2\sigma \approx 2: the chain efficiently explores both modes

At each step, compute α=min(1,π(x)/π(x))\alpha = \min(1, \pi(x')/\pi(x)) (since the proposal is symmetric). If the chain is at x=3x = -3 and proposes x=3x' = 3, then α=min(1,0.7/0.3)=1\alpha = \min(1, 0.7/0.3) = 1. The move is always accepted. If the chain is at x=3x = 3 and proposes x=3x' = -3, then α=min(1,0.3/0.7)0.43\alpha = \min(1, 0.3/0.7) \approx 0.43.

Example

Bayesian inference for a normal mean

Prior: μN(0,τ2)\mu \sim \mathcal{N}(0, \tau^2). Likelihood: xiμN(μ,σ2)x_i \mid \mu \sim \mathcal{N}(\mu, \sigma^2) for i=1,,ni = 1, \ldots, n.

The posterior is:

π(μx)exp ⁣(μ22τ2(xiμ)22σ2)\pi(\mu \mid x) \propto \exp\!\left(-\frac{\mu^2}{2\tau^2} - \frac{\sum(x_i - \mu)^2}{2\sigma^2}\right)

This is a case where the posterior is available in closed form (it is normal), so we can verify that MH gives the right answer. Using random walk MH, the chain samples converge to the known posterior normal distribution with a precision-weighted mean.

Common Confusions

Watch Out

MH does NOT require knowing the normalizing constant

This is the single most important practical feature of MH. Because the acceptance ratio involves π(x)/π(x)\pi(x')/\pi(x), any normalizing constant ZZ cancels:

π(x)π(x)=π~(x)/Zπ~(x)/Z=π~(x)π~(x)\frac{\pi(x')}{\pi(x)} = \frac{\tilde{\pi}(x')/Z}{\tilde{\pi}(x)/Z} = \frac{\tilde{\pi}(x')}{\tilde{\pi}(x)}

You only need to evaluate the unnormalized target density. In Bayesian inference, this means you only need the prior times the likelihood. you never need to compute the evidence (marginal likelihood) p(data)p(\text{data}).

Watch Out

Rejected proposals are NOT wasted

When a proposal is rejected and the chain stays at xtx_t, this is not a failure. It is part of the algorithm working correctly. Repeated copies of xtx_t in the chain are needed to correctly represent the probability mass at that point. An acceptance rate of 100% would mean the chain is not properly weighting different regions of the state space.

Watch Out

MH samples are NOT independent

Consecutive samples xt,xt+1x_t, x_{t+1} are correlated because each depends on the previous. This autocorrelation reduces the effective sample size. If you need approximately independent samples, you can thin the chain (keep every kk-th sample), though it is generally more efficient to simply run the chain longer and report the effective sample size.

Summary

  • MH constructs a Markov chain with target π\pi as stationary distribution
  • Acceptance ratio: α=min(1,π(x)q(xx)/(π(x)q(xx)))\alpha = \min(1, \pi(x')q(x \mid x')/(\pi(x)q(x' \mid x)))
  • Only need unnormalized target density. normalizing constants cancel
  • Detailed balance is the core theoretical guarantee
  • Ergodicity (irreducibility + aperiodicity) ensures convergence from any start
  • Random walk MH: optimal acceptance rate 0.234\approx 0.234 in high dimensions
  • Burn-in period must be discarded before using samples for inference

Exercises

ExerciseCore

Problem

Suppose the proposal distribution is symmetric, i.e., q(xx)=q(xx)q(x' \mid x) = q(x \mid x') for all x,xx, x'. Derive the acceptance ratio and show that it depends only on the ratio π(x)/π(x)\pi(x')/\pi(x).

ExerciseCore

Problem

Verify the detailed balance condition for MH directly. That is, show:

π(x)q(xx)α(x,x)=π(x)q(xx)α(x,x)\pi(x)\,q(x' \mid x)\,\alpha(x, x') = \pi(x')\,q(x \mid x')\,\alpha(x', x)

for the general (asymmetric) case.

ExerciseAdvanced

Problem

Consider an independence sampler with proposal q(x)=N(0,1)q(x') = \mathcal{N}(0, 1) targeting π(x)ex\pi(x) \propto e^{-|x|} (a Laplace distribution). Write down the acceptance ratio. Will this sampler work well? Why or why not?

References

Canonical:

  • Metropolis, Rosenbluth, Rosenbluth, Teller, Teller (1953), "Equation of State Calculations by Fast Computing Machines"
  • Hastings (1970), "Monte Carlo Sampling Methods Using Markov Chains and Their Applications"

Current:

  • Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 6-7
  • Brooks, Gelman, Jones, Meng, Handbook of Markov Chain Monte Carlo (2011), Chapter 1

Next Topics

The natural next steps from Metropolis-Hastings:

  • Gibbs sampling: a special case of MH where the acceptance rate is always 1
  • Hamiltonian Monte Carlo: using gradient information for efficient proposals in high dimensions

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics