Sampling MCMC
Metropolis-Hastings Algorithm
The foundational MCMC algorithm: construct a Markov chain whose stationary distribution is your target by accepting or rejecting proposed moves according to a carefully chosen ratio.
Prerequisites
Why This Matters
If you need to sample from a probability distribution that you can only evaluate up to a normalizing constant. which describes nearly every posterior distribution in Bayesian statistics. The Metropolis-Hastings algorithm is the tool that makes this possible. It is the foundation on which Gibbs sampling, Hamiltonian Monte Carlo, and all of modern MCMC rest.
Before MH, Bayesian inference was largely restricted to conjugate models where closed-form posteriors exist. MH broke that barrier and made Bayesian computation general-purpose.
Mental Model
You are exploring a landscape where the height at each point represents the (unnormalized) probability density of your target distribution. You stand at some point . A friend proposes a new location according to some rule. You evaluate how much more (or less) probable is compared to . If is more probable, you always move there. If is less probable, you move there with a probability proportional to how much less probable it is. Over time, you spend more time in high-probability regions. exactly in proportion to their probability.
Formal Setup and Notation
Let be the target distribution we wish to sample from. We assume we can evaluate up to a normalizing constant. that is, we can compute where for an unknown constant .
Let be a proposal distribution: a conditional density from which we can easily draw candidates.
Proposal Distribution
The proposal distribution is a conditional density that, given the current state , generates a candidate next state . The proposal must be chosen so that it is easy to sample from and (ideally) explores the target distribution efficiently. Common choices include:
- Random walk: . propose near the current state
- Independence sampler: . propose independently of the current state
Acceptance Ratio
The acceptance ratio (or acceptance probability) for moving from state to proposed state is:
When the proposal is symmetric, i.e., , this simplifies to the original Metropolis ratio:
Metropolis-Hastings Algorithm
Given target , proposal , and initial state :
- Propose: Draw
- Compute the acceptance ratio using the unnormalized target and proposal densities
- Accept/Reject: Draw . If , set ; otherwise set
- Repeat from step 1
We use (unnormalized) because cancels in the ratio.
Why the Algorithm Works
The key insight is that MH constructs a Markov chain whose transition kernel satisfies detailed balance with respect to . Detailed balance is a sufficient condition for to be the stationary distribution of the chain.
Detailed Balance
A Markov chain with transition kernel satisfies detailed balance with respect to if:
for all states . This says: the probability flow from to under equals the flow from to . Detailed balance implies is stationary (integrate both sides over ), but is stronger. it says the chain is reversible.
Main Theorems
MH Satisfies Detailed Balance
Statement
The Metropolis-Hastings transition kernel
satisfies detailed balance with respect to :
Therefore is a stationary distribution of the MH chain.
Intuition
The acceptance ratio is specifically designed so that the "excess flow" from to is throttled back to match the flow in the reverse direction. If , then the move from to is accepted with probability 1, and the reverse move is accepted with a probability less than 1. exactly the right amount less to balance the flows.
Proof Sketch
Without loss of generality, assume .
Then and .
The left side of the detailed balance equation becomes: .
The right side becomes: .
The two sides are equal. The symmetric case follows identically.
Why It Matters
This is the theoretical guarantee that MH actually samples from the correct distribution in the long run. Without detailed balance, you would have no reason to believe the chain converges to .
Failure Mode
Detailed balance alone only guarantees is a stationary distribution. For to be the unique stationary distribution and for the chain to converge to it from any starting point, you also need the chain to be irreducible and aperiodic. which is the content of the ergodicity theorem.
Ergodicity of the MH Chain
Statement
If the MH chain is irreducible and aperiodic, then for any initial distribution :
Moreover, for any integrable function :
Intuition
Irreducibility means the chain can reach any state from any other state. Aperiodicity means it does not get trapped in deterministic cycles. Together with detailed balance, these conditions ensure the chain "forgets" its starting point and converges to . The ergodic theorem then says that time averages along the chain converge to expectations under .
Proof Sketch
Detailed balance implies -reversibility, which implies is stationary. Irreducibility and aperiodicity together with the existence of a stationary distribution imply convergence in total variation by the fundamental theorem of Markov chains. The ergodic theorem for Markov chains then gives the almost sure convergence of time averages.
Why It Matters
This theorem is what lets you use the output of an MH chain to compute expectations, posterior means, credible intervals, and any other quantity that can be expressed as . It is the theoretical justification for all of MCMC-based Bayesian inference.
Failure Mode
If the proposal is too narrow, the chain explores slowly and may not effectively reach all regions of high probability within a practical number of iterations. The chain is still ergodic in theory, but convergence may be so slow that your finite sample is useless. Diagnosing this requires convergence diagnostics (trace plots, , effective sample size).
Random Walk MH vs Independence Sampler
Random Walk Metropolis
In random walk MH, the proposal is centered at the current state:
for some symmetric density . A common choice is . The acceptance ratio simplifies to because is symmetric.
The step size controls a tradeoff: too small means slow exploration, too large means most proposals are rejected. The optimal acceptance rate for random walk MH in high dimensions is approximately 0.234 (Roberts, Gelman, and Gilks, 1997).
Independence Sampler
In the independence sampler, the proposal ignores the current state:
The acceptance ratio becomes , which involves the ratio of importance weights. This works well only if is a good approximation to with heavier tails.
Burn-in
The chain's initial samples are influenced by the starting point and do not yet represent draws from . The burn-in period is the initial segment of the chain that is discarded before collecting samples for inference. Choosing the burn-in length is an art informed by convergence diagnostics.
Formally, burn-in discards samples and uses only for estimation. There is no universal formula for . It depends on the mixing rate of the chain.
Canonical Examples
Sampling from a mixture of Gaussians
Target: .
Using random walk MH with proposal :
- If : most proposals are accepted but the chain moves slowly and may get stuck in one mode for long stretches
- If : the chain proposes large jumps but most are rejected because they land in low-probability regions
- If : the chain efficiently explores both modes
At each step, compute (since the proposal is symmetric). If the chain is at and proposes , then . The move is always accepted. If the chain is at and proposes , then .
Bayesian inference for a normal mean
Prior: . Likelihood: for .
The posterior is:
This is a case where the posterior is available in closed form (it is normal), so we can verify that MH gives the right answer. Using random walk MH, the chain samples converge to the known posterior normal distribution with a precision-weighted mean.
Common Confusions
MH does NOT require knowing the normalizing constant
This is the single most important practical feature of MH. Because the acceptance ratio involves , any normalizing constant cancels:
You only need to evaluate the unnormalized target density. In Bayesian inference, this means you only need the prior times the likelihood. you never need to compute the evidence (marginal likelihood) .
Rejected proposals are NOT wasted
When a proposal is rejected and the chain stays at , this is not a failure. It is part of the algorithm working correctly. Repeated copies of in the chain are needed to correctly represent the probability mass at that point. An acceptance rate of 100% would mean the chain is not properly weighting different regions of the state space.
MH samples are NOT independent
Consecutive samples are correlated because each depends on the previous. This autocorrelation reduces the effective sample size. If you need approximately independent samples, you can thin the chain (keep every -th sample), though it is generally more efficient to simply run the chain longer and report the effective sample size.
Summary
- MH constructs a Markov chain with target as stationary distribution
- Acceptance ratio:
- Only need unnormalized target density. normalizing constants cancel
- Detailed balance is the core theoretical guarantee
- Ergodicity (irreducibility + aperiodicity) ensures convergence from any start
- Random walk MH: optimal acceptance rate in high dimensions
- Burn-in period must be discarded before using samples for inference
Exercises
Problem
Suppose the proposal distribution is symmetric, i.e., for all . Derive the acceptance ratio and show that it depends only on the ratio .
Problem
Verify the detailed balance condition for MH directly. That is, show:
for the general (asymmetric) case.
Problem
Consider an independence sampler with proposal targeting (a Laplace distribution). Write down the acceptance ratio. Will this sampler work well? Why or why not?
References
Canonical:
- Metropolis, Rosenbluth, Rosenbluth, Teller, Teller (1953), "Equation of State Calculations by Fast Computing Machines"
- Hastings (1970), "Monte Carlo Sampling Methods Using Markov Chains and Their Applications"
Current:
- Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 6-7
- Brooks, Gelman, Jones, Meng, Handbook of Markov Chain Monte Carlo (2011), Chapter 1
Next Topics
The natural next steps from Metropolis-Hastings:
- Gibbs sampling: a special case of MH where the acceptance rate is always 1
- Hamiltonian Monte Carlo: using gradient information for efficient proposals in high dimensions
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
Builds on This
- Burn-in and Convergence DiagnosticsLayer 2
- Coupling Arguments and Mixing TimeLayer 3
- Gibbs SamplingLayer 2
- Hamiltonian Monte CarloLayer 3
- Particle FiltersLayer 3
- Perfect SamplingLayer 3
- Reversible Jump MCMCLayer 3
- Slice SamplingLayer 2