Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Sampling MCMC

Importance Sampling

Estimate expectations under one distribution by sampling from another and reweighting: a technique that is powerful when done right and catastrophically unreliable when done wrong.

CoreTier 1Stable~50 min

Why This Matters

Importance sampling: sample from q(x), reweight by p(x)/q(x)

-2024xTarget p(x)Proposal q(x)Samples (size = weight p/q)high weight here(p large, q small)

Importance sampling is one of the most fundamental ideas in computational statistics. It underlies particle filtering (sequential Monte Carlo), off-policy evaluation in reinforcement learning, rare event simulation, and variational inference diagnostics (PSIS-LOO). Understanding importance sampling. And especially understanding when and why it fails. is essential for anyone working with Monte Carlo methods.

The core idea is simple: if you cannot sample from pp, sample from something else qq and correct for the mismatch using weights. But this simplicity is deceptive. In high dimensions, importance sampling can fail spectacularly, and understanding why is a gateway to understanding the curse of dimensionality.

Mental Model

You want to compute the average height of people in Country A, but you can only survey people in Country B. If you know the ratio of population densities (how much more or less likely each "type" of person is in A vs B), you can reweight each measurement: people who are over-represented in B get lower weight, and people who are under-represented get higher weight. The reweighted average estimates the Country A average.

This is importance sampling: sample from qq (Country B), reweight by p/qp/q (the density ratio), and compute weighted averages.

Formal Setup and Notation

Let p(x)p(x) be the target distribution and f(x)f(x) a function whose expectation we wish to compute:

I=Ep[f(X)]=f(x)p(x)dxI = \mathbb{E}_p[f(X)] = \int f(x)\, p(x)\, dx

We cannot sample from pp directly (or it is inefficient to do so), but we can sample from a proposal distribution q(x)q(x).

Definition

Importance Sampling Estimator

The importance sampling estimator of Ep[f(X)]\mathbb{E}_p[f(X)] is:

I^IS=1Ni=1Nf(xi)w(xi),xiq\hat{I}_{\text{IS}} = \frac{1}{N}\sum_{i=1}^{N} f(x_i)\, w(x_i), \quad x_i \sim q

where the importance weights are:

w(x)=p(x)q(x)w(x) = \frac{p(x)}{q(x)}

This works because Eq[f(X)w(X)]=f(x)p(x)q(x)q(x)dx=f(x)p(x)dx=Ep[f(X)]\mathbb{E}_q[f(X)\,w(X)] = \int f(x)\frac{p(x)}{q(x)}\,q(x)\,dx = \int f(x)\,p(x)\,dx = \mathbb{E}_p[f(X)].

Definition

Importance Weight

The importance weight w(x)=p(x)/q(x)w(x) = p(x)/q(x) corrects for the mismatch between proposal qq and target pp. Where qq over-samples relative to pp (i.e., q(x)>p(x)q(x) > p(x)), the weight w(x)<1w(x) < 1 down-weights. Where qq under-samples (i.e., q(x)<p(x)q(x) < p(x)), the weight w(x)>1w(x) > 1 up-weights.

The support condition is critical: q(x)>0q(x) > 0 whenever p(x)f(x)0p(x)f(x) \neq 0. Otherwise the estimator is biased.

Definition

Self-Normalized Importance Sampling

When p(x)p(x) is known only up to a normalizing constant. i.e., we can evaluate p~(x)\tilde{p}(x) where p(x)=p~(x)/Zp(x) = \tilde{p}(x)/Z. WE use unnormalized weights w~(x)=p~(x)/q(x)\tilde{w}(x) = \tilde{p}(x)/q(x) and the self-normalized estimator:

I^SNIS=i=1Nf(xi)w~(xi)i=1Nw~(xi)\hat{I}_{\text{SNIS}} = \frac{\sum_{i=1}^{N} f(x_i)\,\tilde{w}(x_i)}{\sum_{i=1}^{N} \tilde{w}(x_i)}

This estimator is biased (the ratio of two unbiased estimators is not unbiased) but consistent, and often has lower variance than the unnormalized version even when ZZ is known.

Definition

Effective Sample Size

The effective sample size measures how many independent samples from pp your weighted samples are "worth":

ESS=1i=1Nwˉi2\text{ESS} = \frac{1}{\sum_{i=1}^{N} \bar{w}_i^2}

where wˉi=w~i/jw~j\bar{w}_i = \tilde{w}_i / \sum_j \tilde{w}_j are the normalized weights. The ESS satisfies 1ESSN1 \leq \text{ESS} \leq N. When all weights are equal, ESS=N\text{ESS} = N. When one weight dominates, ESS1\text{ESS} \approx 1.

An ESS much smaller than NN signals that the proposal is a poor match for the target, and the estimate is unreliable.

Main Theorems

Theorem

Unbiasedness of Importance Sampling

Statement

The importance sampling estimator I^IS=1Ni=1Nf(xi)w(xi)\hat{I}_{\text{IS}} = \frac{1}{N}\sum_{i=1}^{N} f(x_i)\,w(x_i) with xiiidqx_i \stackrel{iid}{\sim} q is an unbiased estimator of Ep[f(X)]\mathbb{E}_p[f(X)]:

Eq[I^IS]=Ep[f(X)]\mathbb{E}_q[\hat{I}_{\text{IS}}] = \mathbb{E}_p[f(X)]

Its variance is:

Varq(I^IS)=1NVarq(f(X)w(X))=1N(Eq[f(X)2w(X)2]I2)\text{Var}_q(\hat{I}_{\text{IS}}) = \frac{1}{N}\text{Var}_q(f(X)\,w(X)) = \frac{1}{N}\left(\mathbb{E}_q[f(X)^2\,w(X)^2] - I^2\right)

Intuition

The reweighting by w(x)=p(x)/q(x)w(x) = p(x)/q(x) exactly corrects for sampling from the "wrong" distribution. Multiplying by p/qp/q inside the qq-expectation converts it to a pp-expectation. The variance depends on how variable the product f(x)w(x)f(x)w(x) is under qq. IF the weights w(x)w(x) are highly variable, the variance can be enormous.

Proof Sketch

Eq[I^IS]=1Ni=1NEq[f(xi)w(xi)]=Eq ⁣[f(X)p(X)q(X)]=f(x)p(x)q(x)q(x)dx=f(x)p(x)dx=I\mathbb{E}_q[\hat{I}_{\text{IS}}] = \frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_q[f(x_i)w(x_i)] = \mathbb{E}_q\!\left[f(X)\frac{p(X)}{q(X)}\right] = \int f(x)\frac{p(x)}{q(x)}q(x)\,dx = \int f(x)p(x)\,dx = I

The q(x)q(x) in the denominator of ww cancels with the q(x)q(x) in the integral. This is a direct application of the change-of-measure identity.

Why It Matters

Unbiasedness means the estimator is correct on average, regardless of the proposal choice (as long as the support condition holds). This is a fundamental guarantee that allows importance sampling to be used in a wide variety of settings. However, unbiasedness says nothing about variance. The estimator can be unbiased but have variance so large that individual estimates are useless.

Failure Mode

If q(x)=0q(x) = 0 for some xx where p(x)f(x)0p(x)f(x) \neq 0, the estimator is biased. those regions are never sampled. Even when the support condition holds, if qq has lighter tails than pp, the weights w(x)=p(x)/q(x)w(x) = p(x)/q(x) can be unbounded, and Eq[f(X)2w(X)2]\mathbb{E}_q[f(X)^2 w(X)^2] may be infinite, giving the estimator infinite variance.

Theorem

Optimal Proposal and Variance Lower Bound

Statement

The variance of the importance sampling estimator is minimized when the proposal is:

q(x)=f(x)p(x)f(x)p(x)dx=f(x)p(x)Ep[f(X)]q^*(x) = \frac{|f(x)|\, p(x)}{\int |f(x)|\, p(x)\, dx} = \frac{|f(x)|\, p(x)}{\mathbb{E}_p[|f(X)|]}

For f(x)0f(x) \geq 0, the resulting minimum variance is:

Varq(I^IS)=1N((Ep[f(X)])2(Ep[f(X)])2)=0\text{Var}_{q^*}(\hat{I}_{\text{IS}}) = \frac{1}{N}\left((\mathbb{E}_p[f(X)])^2 - (\mathbb{E}_p[f(X)])^2\right) = 0

Wait. This gives zero variance! The optimal proposal produces a constant weight w(x)=Ep[f(X)]w(x) = \mathbb{E}_p[f(X)] for all xx, meaning every sample gives exactly the right answer. But this requires knowing Ep[f(X)]\mathbb{E}_p[f(X)]. The very quantity we are trying to estimate.

Intuition

The optimal proposal concentrates samples where f(x)p(x)|f(x)|p(x) is large, exactly the regions that contribute most to the integral. Regions where f(x)f(x) is small or p(x)p(x) is small are sampled less. This is the "importance" in importance sampling: sample where it matters most.

While the exact optimal proposal is impractical (it requires the answer), it guides proposal design: a good proposal should be roughly proportional to f(x)p(x)|f(x)|p(x).

Proof Sketch

Write Varq(fw)=Eq[(fw)2]I2\text{Var}_q(fw) = \mathbb{E}_q[(fw)^2] - I^2. Minimize Eq[(fw)2]=f(x)2p(x)2q(x)dx\mathbb{E}_q[(fw)^2] = \int \frac{f(x)^2 p(x)^2}{q(x)}dx subject to q(x)dx=1\int q(x)dx = 1. By Cauchy-Schwarz (or Lagrange multipliers):

f(x)2p(x)2q(x)dx(f(x)p(x)dx)2\int \frac{f(x)^2 p(x)^2}{q(x)}dx \geq \left(\int |f(x)| p(x) dx\right)^2

with equality when q(x)f(x)p(x)q(x) \propto |f(x)| p(x).

Why It Matters

This result provides the theoretical benchmark for proposal design. In practice, you approximate qq^* by choosing proposals that emphasize the same regions as f(x)p(x)|f(x)|p(x). It also explains why importance sampling for rare events is hard: f(x)p(x)|f(x)|p(x) is concentrated in a tiny region, and finding a good proposal for that region requires problem-specific knowledge.

Failure Mode

In practice, you cannot use qq^* because computing it requires the very integral you are trying to estimate. Practical proposals are always approximations. If your approximation has lighter tails than the target, the variance can be infinite even though an optimal zero-variance proposal exists.

Weight Degeneracy in High Dimensions

This is the most important practical limitation of importance sampling.

In dd dimensions, even if qq is a reasonable approximation to pp, the importance weights tend to become extremely variable. Consider the case where p=N(0,Id)p = \mathcal{N}(0, I_d) and q=N(0,σ2Id)q = \mathcal{N}(0, \sigma^2 I_d). The log-weight is:

logw(x)=logp(x)logq(x)=d2logσ2+(σ21)2σ2x2\log w(x) = \log p(x) - \log q(x) = \frac{d}{2}\log\sigma^2 + \frac{(\sigma^2 - 1)}{2\sigma^2}\|x\|^2

Since x2dσ2\|x\|^2 \approx d\sigma^2 under qq, the log-weight has mean O(d)O(d) and standard deviation O(d)O(\sqrt{d}). By a CLT argument, the weights are approximately log-normal with variance growing linearly in dd. This means:

  • A few samples get enormous weights
  • The vast majority get negligible weights
  • The ESS collapses to O(1)O(1) as dd grows

This is why importance sampling in its basic form does not scale to high dimensions. Particle filtering, which applies importance sampling sequentially, partially mitigates this through resampling steps.

Canonical Examples

Example

Estimating a tail probability

Goal: estimate P(X>4)P(X > 4) where XN(0,1)X \sim \mathcal{N}(0, 1).

Naive Monte Carlo: draw xiN(0,1)x_i \sim \mathcal{N}(0,1) and compute p^=1N1(xi>4)\hat{p} = \frac{1}{N}\sum \mathbf{1}(x_i > 4). Since P(X>4)3.2×105P(X > 4) \approx 3.2 \times 10^{-5}, you need roughly N106N \approx 10^6 samples to see even a few hits.

Importance sampling: use q=N(4,1)q = \mathcal{N}(4, 1) (shifted to the tail). Weights: w(x)=ϕ(x)ϕ(x4)=exp(4x+8)w(x) = \frac{\phi(x)}{\phi(x-4)} = \exp(-4x + 8). Estimator: p^IS=1N1(xi>4)w(xi)\hat{p}_{\text{IS}} = \frac{1}{N}\sum \mathbf{1}(x_i > 4)\,w(x_i).

Now most samples fall in the region of interest, and the weights correct for the shifted proposal. With N=1000N = 1000 importance samples, you get a reliable estimate that would require 10610^6 naive samples.

Example

Bayesian posterior estimation

Prior: θN(0,1)\theta \sim \mathcal{N}(0, 1). Likelihood: xθN(θ,1)x \mid \theta \sim \mathcal{N}(\theta, 1). Observe x=3x = 3.

Posterior: p(θx=3)eθ2/2e(3θ)2/2e(θ1.5)2p(\theta \mid x=3) \propto e^{-\theta^2/2}\, e^{-(3-\theta)^2/2} \propto e^{-(\theta - 1.5)^2}.

Using proposal q(θ)=N(0,1)q(\theta) = \mathcal{N}(0, 1) (the prior):

w(θ)=p(θx=3)q(θ)eθ2/2e(3θ)2/2eθ2/2=e(3θ)2/2w(\theta) = \frac{p(\theta \mid x=3)}{q(\theta)} \propto \frac{e^{-\theta^2/2}e^{-(3-\theta)^2/2}}{e^{-\theta^2/2}} = e^{-(3-\theta)^2/2}

The weights emphasize θ\theta values near 3 (where the likelihood is high). The self-normalized estimator with these weights estimates posterior expectations. However, the prior proposal is centered at 0 while the posterior is centered at 1.5, so the ESS will be moderate.

Common Confusions

Watch Out

Importance sampling can have INFINITE variance

If the proposal qq has lighter tails than the target pp, the second moment Eq[f(X)2w(X)2]\mathbb{E}_q[f(X)^2 w(X)^2] can diverge, giving the estimator infinite variance. For example, using a Gaussian proposal to estimate a Cauchy target: w(x)=p(x)/q(x)(1+x2)1/ex2/2w(x) = p(x)/q(x) \propto (1+x^2)^{-1}/e^{-x^2/2} grows exponentially in the tails. With infinite variance, the CLT does not apply, sample averages converge to the wrong value, and the estimator gives no warning that it is failing. This is the single most dangerous failure mode of importance sampling.

Rule of thumb: the proposal should have tails at least as heavy as f(x)p(x)|f(x)|p(x).

Watch Out

Self-normalized IS is biased but often better

The self-normalized estimator I^SNIS\hat{I}_{\text{SNIS}} is biased because it is a ratio of two random quantities. However, it has two advantages: (1) it only requires unnormalized weights, so you do not need to know the normalizing constant of pp; (2) it is invariant to the normalization of the weights, which can reduce variance. In many practical settings, SNIS has lower MSE than the unnormalized estimator, especially when the weights are highly variable.

Watch Out

High ESS does not guarantee correctness

The ESS measures weight uniformity, not whether the proposal covers the important regions of the target. You can have high ESS with a proposal that completely misses a mode of pp. The weights in the sampled region are uniform, but the missing mode is never represented. Always combine ESS diagnostics with visual checks when possible.

Summary

  • IS formula: Ep[f(X)]=Eq[f(X)p(X)/q(X)]\mathbb{E}_p[f(X)] = \mathbb{E}_q[f(X)\,p(X)/q(X)]
  • Importance weights w(x)=p(x)/q(x)w(x) = p(x)/q(x) correct for sampling mismatch
  • Self-normalized IS handles unknown normalizing constants
  • ESS =1/wˉi2= 1/\sum \bar{w}_i^2 measures effective number of samples
  • Optimal proposal: q(x)f(x)p(x)q^*(x) \propto |f(x)|p(x) (impractical but guides design)
  • Weight degeneracy in high dimensions: ESS collapses as dd grows
  • Proposal must have heavier tails than target to avoid infinite variance

Exercises

ExerciseCore

Problem

You want to estimate I=01exdxI = \int_0^1 e^x\,dx using importance sampling with proposal q(x)=1q(x) = 1 (uniform on [0,1][0,1]) and target p(x)=ex/(e1)p(x) = e^x / (e-1).

Wait. This is the same as naive Monte Carlo. Now use the proposal q(x)=2xq(x) = 2x on [0,1][0,1]. Compute the importance weights and the IS estimator. Which proposal gives lower variance?

ExerciseCore

Problem

Derive the effective sample size formula. Start with NN samples with normalized weights wˉi=w~i/jw~j\bar{w}_i = \tilde{w}_i/\sum_j \tilde{w}_j. The ESS should equal NN when all weights are equal and 1 when one weight dominates. Show that ESS=1/iwˉi2\text{ESS} = 1/\sum_i \bar{w}_i^2 satisfies these properties.

ExerciseAdvanced

Problem

Show that if p(x)ex2/2p(x) \propto e^{-x^2/2} (standard normal) and q(x)(1+x2)1q(x) \propto (1 + x^2)^{-1} (standard Cauchy), then the importance sampling estimator of Ep[X2]\mathbb{E}_p[X^2] has finite variance. What if the roles are reversed. pp is Cauchy and qq is Gaussian?

References

Canonical:

  • Kahn & Marshall (1953), "Methods of Reducing Sample Size in Monte Carlo Computations"
  • Geweke (1989), "Bayesian Inference in Econometric Models Using Monte Carlo Integration"

Current:

  • Owen, Monte Carlo Theory, Methods and Examples (2013), Chapters 9-10

  • Vehtari, Gelman, Gabry (2017), "Pareto Smoothed Importance Sampling"

  • Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 3-7

  • Brooks et al., Handbook of MCMC (2011), Chapters 1-5

Next Topics

The natural next steps from importance sampling:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics