Metropolis-Hastings Algorithm

Sneiderman, Robby

Sampling MCMC

Metropolis-Hastings Algorithm

The foundational MCMC algorithm: build a Markov chain with the right stationary distribution by accepting or rejecting proposals according to a carefully balanced ratio. The real story is not only correctness, but also proposal geometry, diagnostics, and where plain MH becomes painfully slow.

CoreTier 1StableSupporting~70 min

Prerequisites

Common Probability Distributions Markov Chain Monte Carlo Markov Chains and Steady State Monte Carlo Methods

Start 8-question practice · 10 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

sampling-mcmc | layer 2 | tier 1. This page has 4 direct prerequisites and 8 published dependents.

Open Atlas Prerequisites Leads to

What next

Gibbs Sampling

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

If you need to sample from a probability distribution that you can evaluate only up to a normalizing constant, you are already in Metropolis-Hastings territory. That describes most posterior distributions in Bayesian statistics. The Metropolis-Hastings algorithm is the foundational construction that makes such problems computationally tractable. It sits underneath Gibbs sampling, motivates why Hamiltonian Monte Carlo had to be invented, and still provides the cleanest introduction to how MCMC achieves correctness at all.

Before MH, Bayesian inference was largely restricted to conjugate models where closed-form posteriors exist. MH broke that barrier and made Bayesian computation general-purpose. The modern lesson is slightly different: MH gives you the correctness template, but its practical efficiency depends almost entirely on proposal design and posterior geometry.

Mental Model

You are exploring a landscape where the height at each point represents the (unnormalized) probability density of your target distribution. You stand at some point $x$ . A friend proposes a new location $x'$ according to some rule. You evaluate how much more (or less) probable $x'$ is compared to $x$ . If $x'$ is more probable, you always move there. If $x'$ is less probable, you move there with a probability proportional to how much less probable it is. Over time, you spend more time in high-probability regions, exactly in proportion to their probability.

Formal Setup and Notation

Let $\pi(x)$ be the target distribution we wish to sample from. We assume we can evaluate $\pi(x)$ up to a normalizing constant; that is, we can compute $\tilde{\pi}(x)$ where $\pi(x) = \tilde{\pi}(x)/Z$ for an unknown constant $Z$ .

Let $q(x' \mid x)$ be a proposal distribution: a conditional density from which we can easily draw candidates.

Definition

Proposal Distribution $q (x^{'} ∣ x)$

The proposal distribution $q(x' \mid x)$ is a conditional density that, given the current state $x$ , generates a candidate next state $x'$ . The proposal must be chosen so that it is easy to sample from and (ideally) explores the target distribution efficiently. Common choices include:

Random walk: $q(x' \mid x) = \mathcal{N}(x, \sigma^2 I)$ . propose near the current state
Independence sampler: $q(x' \mid x) = q(x')$ . propose independently of the current state

Definition

Acceptance Ratio $α (x, x^{'})$

The acceptance ratio (or acceptance probability) for moving from state $x$ to proposed state $x'$ is:

$\alpha(x, x') = \min\!\left(1,\; \frac{\pi(x')\, q(x \mid x')}{\pi(x)\, q(x' \mid x)}\right)$

When the proposal is symmetric, i.e., $q(x' \mid x) = q(x \mid x')$ , this simplifies to the original Metropolis ratio:

$\alpha(x, x') = \min\!\left(1,\; \frac{\pi(x')}{\pi(x)}\right)$

Definition

Metropolis-Hastings Algorithm

Given target $\pi$ , proposal $q$ , and initial state $x_0$ :

Propose: Draw $x' \sim q(\cdot \mid x_t)$
Compute the acceptance ratio using the unnormalized target and proposal densities
Accept/Reject: Draw $u \sim \text{Uniform}(0,1)$ . If $u \leq \alpha$ , set $x_{t+1} = x'$ ; otherwise set $x_{t+1} = x_t$
Repeat from step 1

We use $\tilde{\pi}$ (unnormalized) because $Z$ cancels in the ratio.

Why the Algorithm Works

The key insight is that MH constructs a Markov chain whose transition kernel satisfies detailed balance with respect to $\pi$ . Detailed balance is a sufficient condition for $\pi$ to be the stationary distribution of the chain.

Definition

Detailed Balance

A Markov chain with transition kernel $K(x \to x')$ satisfies detailed balance with respect to $\pi$ if and only if:

$\pi(x)\, K(x \to x') = \pi(x')\, K(x' \to x)$

for all states $x, x'$ . This says: the probability flow from $x$ to $x'$ under $\pi$ equals the flow from $x'$ to $x$ . Detailed balance implies $\pi$ is stationary (integrate both sides over $x$ ), but is stronger: it says the chain is reversible.

Main Theorems

Theorem

MH Satisfies Detailed Balance

Statement

The Metropolis-Hastings transition kernel

$K(x \to x') = q(x' \mid x)\,\alpha(x, x') \quad \text{for } x' \neq x$

satisfies detailed balance with respect to $\pi$ :

$\pi(x)\, q(x' \mid x)\,\alpha(x, x') = \pi(x')\, q(x \mid x')\,\alpha(x', x)$

Therefore $\pi$ is a stationary distribution of the MH chain.

Intuition

The acceptance ratio is specifically designed so that the "excess flow" from $x$ to $x'$ is throttled back to match the flow in the reverse direction. If $\pi(x')q(x \mid x') > \pi(x)q(x' \mid x)$ , then the move from $x$ to $x'$ is accepted with probability 1, and the reverse move is accepted with a probability less than 1, exactly the right amount less to balance the flows.

Proof Sketch

Without loss of generality, assume $\pi(x)\,q(x' \mid x) \leq \pi(x')\,q(x \mid x')$ .

Then $\alpha(x, x') = 1$ and $\alpha(x', x) = \frac{\pi(x)\,q(x' \mid x)}{\pi(x')\,q(x \mid x')}$ .

The left side of the detailed balance equation becomes: $\pi(x)\,q(x' \mid x) \cdot 1 = \pi(x)\,q(x' \mid x)$ .

The right side becomes: $\pi(x')\,q(x \mid x') \cdot \frac{\pi(x)\,q(x' \mid x)}{\pi(x')\,q(x \mid x')} = \pi(x)\,q(x' \mid x)$ .

The two sides are equal. The symmetric case follows identically.

Why It Matters

This is the theoretical guarantee that MH actually samples from the correct distribution in the long run. Without detailed balance, you would have no reason to believe the chain converges to $\pi$ .

Failure Mode

Detailed balance alone only guarantees $\pi$ is a stationary distribution. For $\pi$ to be the unique stationary distribution and for the chain to converge to it from any starting point, you also need the chain to be irreducible and aperiodic, which is the content of the ergodicity theorem.

report a correction →

Theorem

Ergodicity of the MH Chain

Statement

If the MH chain is irreducible and aperiodic, then for any initial distribution $\mu_0$ :

$\lim_{t \to \infty} \| K^t(\mu_0, \cdot) - \pi(\cdot) \|_{\text{TV}} = 0$

Moreover, for any integrable function $f$ :

$\frac{1}{T}\sum_{t=1}^{T} f(x_t) \xrightarrow{a.s.} \mathbb{E}_\pi[f(X)]$

Intuition

Irreducibility means the chain can reach any state from any other state. Aperiodicity means it does not get trapped in deterministic cycles. Together with detailed balance, these conditions ensure the chain "forgets" its starting point and converges to $\pi$ . The ergodic theorem then says that time averages along the chain converge to expectations under $\pi$ .

Proof Sketch

Detailed balance implies $\pi$ -reversibility, which implies $\pi$ is stationary. Irreducibility and aperiodicity together with the existence of a stationary distribution imply convergence in total variation by the fundamental theorem of Markov chains. The ergodic theorem for Markov chains then gives the almost sure convergence of time averages.

Why It Matters

This theorem is what lets you use the output of an MH chain to compute expectations, posterior means, credible intervals, and any other quantity that can be expressed as $\mathbb{E}_\pi[f(X)]$ . It is the theoretical justification for all of MCMC-based Bayesian inference.

Failure Mode

If the proposal is too narrow, the chain explores slowly and may not effectively reach all regions of high probability within a practical number of iterations. The chain is still ergodic in theory, but convergence may be so slow that your finite sample is useless. Diagnosing this requires convergence diagnostics (trace plots, $\hat{R}$ , effective sample size).

report a correction →

Random Walk MH vs Independence Sampler

Definition

Random Walk Metropolis

In random walk MH, the proposal is centered at the current state:

$q(x' \mid x) = g(x' - x)$

for some symmetric density $g$ . A common choice is $g = \mathcal{N}(0, \sigma^2 I)$ . The acceptance ratio simplifies to $\alpha = \min(1, \pi(x')/\pi(x))$ because $g$ is symmetric.

The step size $\sigma$ controls a tradeoff: too small means slow exploration, too large means most proposals are rejected. The optimal acceptance rate for random walk MH in the specific high-dimensional diffusion-limit analysis of Roberts, Gelman, and Gilks (1997) is approximately 0.234. That is a useful benchmark for Gaussian random walks, not a universal target for every MH implementation.

Definition

Independence Sampler

In the independence sampler, the proposal ignores the current state:

$q(x' \mid x) = q(x')$

The acceptance ratio becomes $\alpha = \min(1, \pi(x')q(x)/(\pi(x)q(x')))$ , which involves the ratio of importance weights. This works well only if $q$ is a good approximation to $\pi$ with heavier tails.

Burn-in

The chain's initial samples are influenced by the starting point $x_0$ and do not yet represent draws from $\pi$ . The burn-in period is the initial segment of the chain that is discarded before collecting samples for inference. Choosing the burn-in length is an art informed by convergence diagnostics.

Formally, burn-in discards samples $x_1, \ldots, x_B$ and uses only $x_{B+1}, \ldots, x_T$ for estimation. There is no universal formula for $B$ . It depends on the mixing rate of the chain.

Diagnostics and Proposal Geometry

The practical failure mode of MH is not incorrectness; it is slow mixing. Two chains can both target the right stationary distribution and yet differ by orders of magnitude in usable effective sample size.

Three diagnostics matter more than raw acceptance rate:

Trace behavior and between-chain agreement. If chains started from dispersed initial values still occupy different regions, you do not yet have evidence of practical convergence. See burn-in and convergence diagnostics.
Effective sample size (ESS). High autocorrelation means thousands of MH steps may correspond to only tens of effectively independent draws.
Mode traversal. On multimodal targets, a random-walk proposal can look locally healthy while almost never crossing between important regions.

This is why plain random-walk MH becomes brittle in higher dimension. Proposal scales large enough to move globally are often rejected, while scales small enough to be accepted move only locally. Hamiltonian Monte Carlo was invented to escape this random-walk regime by proposing long, geometry-informed moves.

Where Plain MH Still Matters

Metropolis-Hastings remains valuable even when you would not deploy vanilla random-walk MH on a serious posterior:

it is the cleanest proof template for why MCMC can target an unnormalized distribution;
it still works well on low-dimensional targets, discrete spaces, and carefully engineered independence proposals;
many specialized samplers are best understood as "better proposals inside the MH framework," including pseudo-marginal MH and some reversible-jump constructions.

Canonical Examples

Example

Sampling from a mixture of Gaussians

Target: $\pi(x) = 0.3\,\mathcal{N}(-3, 1) + 0.7\,\mathcal{N}(3, 1)$ .

Using random walk MH with proposal $q(x' \mid x) = \mathcal{N}(x, \sigma^2)$ :

If $\sigma = 0.1$ : most proposals are accepted but the chain moves slowly and may get stuck in one mode for long stretches
If $\sigma = 10$ : the chain proposes large jumps but most are rejected because they land in low-probability regions
If $\sigma \approx 2$ : the chain efficiently explores both modes

At each step, compute $\alpha = \min(1, \pi(x')/\pi(x))$ (since the proposal is symmetric). If the chain is at $x = -3$ and proposes $x' = 3$ , then $\alpha = \min(1, 0.7/0.3) = 1$ . The move is always accepted. If the chain is at $x = 3$ and proposes $x' = -3$ , then $\alpha = \min(1, 0.3/0.7) \approx 0.43$ .

Example

Bayesian inference for a normal mean

Prior: $\mu \sim \mathcal{N}(0, \tau^2)$ . Likelihood: $x_i \mid \mu \sim \mathcal{N}(\mu, \sigma^2)$ for $i = 1, \ldots, n$ .

The posterior is:

$\pi(\mu \mid x) \propto \exp\!\left(-\frac{\mu^2}{2\tau^2} - \frac{\sum(x_i - \mu)^2}{2\sigma^2}\right)$

This is a case where the posterior is available in closed form (it is normal), so we can verify that MH gives the right answer. Using random walk MH, the chain samples converge to the known posterior normal distribution with a precision-weighted mean.

Common Confusions

Watch Out

MH does NOT require knowing the normalizing constant

This is the single most important practical feature of MH. Because the acceptance ratio involves $\pi(x')/\pi(x)$ , any normalizing constant $Z$ cancels:

$\frac{\pi(x')}{\pi(x)} = \frac{\tilde{\pi}(x')/Z}{\tilde{\pi}(x)/Z} = \frac{\tilde{\pi}(x')}{\tilde{\pi}(x)}$

You only need to evaluate the unnormalized target density. In Bayesian inference, this means you only need the prior times the likelihood; you never need to compute the evidence (marginal likelihood) $p(\text{data})$ .

Watch Out

Rejected proposals are NOT wasted

When a proposal is rejected and the chain stays at $x_t$ , this is not a failure. It is part of the algorithm working correctly. Repeated copies of $x_t$ in the chain are needed to correctly represent the probability mass at that point. An acceptance rate of 100% would mean the chain is not properly weighting different regions of the state space.

Watch Out

MH samples are NOT independent

Consecutive samples $x_t, x_{t+1}$ are correlated because each depends on the previous. This autocorrelation reduces the effective sample size. If you need approximately independent samples, you can thin the chain (keep every $k$ -th sample), though it is generally more efficient to simply run the chain longer and report the effective sample size.

Watch Out

A 0.234 acceptance rate is not a universal law

The famous $0.234$ number applies to one asymptotic regime: Gaussian random-walk proposals on high-dimensional product targets. It is a useful benchmark for that setting, not a universal optimum. Independence samplers, discrete proposals, heavy-tailed proposals, and geometry-aware methods can all have very different healthy acceptance rates.

Watch Out

High acceptance can still mean a bad sampler

If the proposal scale is tiny, acceptance can be near 100% while the chain barely moves. That is not good mixing; it is local dithering. Acceptance rate must be read together with ESS, trace behavior, and whether the chain actually traverses the relevant posterior mass.

Summary

MH constructs a Markov chain with target $\pi$ as stationary distribution
Acceptance ratio: $\alpha = \min(1, \pi(x')q(x \mid x')/(\pi(x)q(x' \mid x)))$
Only need unnormalized target density. normalizing constants cancel
Detailed balance is the core theoretical guarantee
Ergodicity (irreducibility + aperiodicity) ensures convergence from any start
Random walk MH: optimal acceptance rate $\approx 0.234$ in high dimensions
Burn-in period must be discarded before using samples for inference
Finite-time usefulness depends on proposal geometry, ESS, and whether the chain traverses modes in practice

Exercises

ExerciseCore

Problem

Suppose the proposal distribution is symmetric, i.e., $q(x' \mid x) = q(x \mid x')$ for all $x, x'$ . Derive the acceptance ratio and show that it depends only on the ratio $\pi(x')/\pi(x)$ .

ExerciseCore

Problem

Verify the detailed balance condition for MH directly. That is, show:

$\pi(x)\,q(x' \mid x)\,\alpha(x, x') = \pi(x')\,q(x \mid x')\,\alpha(x', x)$

for the general (asymmetric) case.

ExerciseAdvanced

Problem

Consider an independence sampler with proposal $q(x') = \mathcal{N}(0, 1)$ targeting $\pi(x) \propto e^{-|x|}$ (a Laplace distribution). Write down the acceptance ratio. Will this sampler work well? Why or why not?

References

Canonical:

Metropolis, Rosenbluth, Rosenbluth, Teller, Teller (1953), "Equation of State Calculations by Fast Computing Machines"
Hastings (1970), "Monte Carlo Sampling Methods Using Markov Chains and Their Applications"

Current:

Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 6-7
Brooks, Gelman, Jones, Meng, Handbook of Markov Chain Monte Carlo (2011), Chapter 1
Tierney, "Markov Chains for Exploring Posterior Distributions" (Annals of Statistics, 1994)
Roberts, Gelman, and Gilks, "Weak convergence and optimal scaling of random walk Metropolis algorithms" (Annals of Applied Probability, 1997)
Gelman et al., Bayesian Data Analysis (2013), Chapters 11-12

Next Topics

The natural next steps from Metropolis-Hastings:

Gibbs sampling: a special case of MH where the acceptance rate is always 1
Burn-in and convergence diagnostics: how to tell a theoretically correct chain from a practically useless one
Hamiltonian Monte Carlo: using gradient information for efficient proposals in high dimensions

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Common Probability Distributionslayer 0A · tier 1
Markov Chain Monte Carlolayer 2 · tier 1
Monte Carlo Methodslayer 2 · tier 1
Markov Chains and Steady Statelayer 1 · tier 2

Derived topics

8

Gibbs Samplinglayer 2 · tier 1
Burn-in and Convergence Diagnosticslayer 2 · tier 2
Hamiltonian Monte Carlolayer 3 · tier 2
Slice Samplinglayer 2 · tier 3
Coupling Arguments and Mixing Timelayer 3 · tier 3

+3 more on the derived-topics page.

Graph-backed continuations

Gibbs Sampling Burn-in and Convergence Diagnostics Hamiltonian Monte Carlo Coupling Arguments and Mixing Time Particle Filters Perfect Sampling Reversible Jump MCMC Slice Sampling