Sampling MCMC
Importance Sampling
Estimate expectations under one distribution by sampling from another and reweighting: a technique that is powerful when done right and catastrophically unreliable when done wrong.
Prerequisites
Why This Matters
Importance sampling: sample from q(x), reweight by p(x)/q(x)
Importance sampling is one of the most fundamental ideas in computational statistics. It underlies particle filtering (sequential Monte Carlo), off-policy evaluation in reinforcement learning, rare event simulation, and variational inference diagnostics (PSIS-LOO). Understanding importance sampling. And especially understanding when and why it fails. is essential for anyone working with Monte Carlo methods.
The core idea is simple: if you cannot sample from , sample from something else and correct for the mismatch using weights. But this simplicity is deceptive. In high dimensions, importance sampling can fail spectacularly, and understanding why is a gateway to understanding the curse of dimensionality.
Mental Model
You want to compute the average height of people in Country A, but you can only survey people in Country B. If you know the ratio of population densities (how much more or less likely each "type" of person is in A vs B), you can reweight each measurement: people who are over-represented in B get lower weight, and people who are under-represented get higher weight. The reweighted average estimates the Country A average.
This is importance sampling: sample from (Country B), reweight by (the density ratio), and compute weighted averages.
Formal Setup and Notation
Let be the target distribution and a function whose expectation we wish to compute:
We cannot sample from directly (or it is inefficient to do so), but we can sample from a proposal distribution .
Importance Sampling Estimator
The importance sampling estimator of is:
where the importance weights are:
This works because .
Importance Weight
The importance weight corrects for the mismatch between proposal and target . Where over-samples relative to (i.e., ), the weight down-weights. Where under-samples (i.e., ), the weight up-weights.
The support condition is critical: whenever . Otherwise the estimator is biased.
Self-Normalized Importance Sampling
When is known only up to a normalizing constant. i.e., we can evaluate where . WE use unnormalized weights and the self-normalized estimator:
This estimator is biased (the ratio of two unbiased estimators is not unbiased) but consistent, and often has lower variance than the unnormalized version even when is known.
Effective Sample Size
The effective sample size measures how many independent samples from your weighted samples are "worth":
where are the normalized weights. The ESS satisfies . When all weights are equal, . When one weight dominates, .
An ESS much smaller than signals that the proposal is a poor match for the target, and the estimate is unreliable.
Main Theorems
Unbiasedness of Importance Sampling
Statement
The importance sampling estimator with is an unbiased estimator of :
Its variance is:
Intuition
The reweighting by exactly corrects for sampling from the "wrong" distribution. Multiplying by inside the -expectation converts it to a -expectation. The variance depends on how variable the product is under . IF the weights are highly variable, the variance can be enormous.
Proof Sketch
The in the denominator of cancels with the in the integral. This is a direct application of the change-of-measure identity.
Why It Matters
Unbiasedness means the estimator is correct on average, regardless of the proposal choice (as long as the support condition holds). This is a fundamental guarantee that allows importance sampling to be used in a wide variety of settings. However, unbiasedness says nothing about variance. The estimator can be unbiased but have variance so large that individual estimates are useless.
Failure Mode
If for some where , the estimator is biased. those regions are never sampled. Even when the support condition holds, if has lighter tails than , the weights can be unbounded, and may be infinite, giving the estimator infinite variance.
Optimal Proposal and Variance Lower Bound
Statement
The variance of the importance sampling estimator is minimized when the proposal is:
For , the resulting minimum variance is:
Wait. This gives zero variance! The optimal proposal produces a constant weight for all , meaning every sample gives exactly the right answer. But this requires knowing . The very quantity we are trying to estimate.
Intuition
The optimal proposal concentrates samples where is large, exactly the regions that contribute most to the integral. Regions where is small or is small are sampled less. This is the "importance" in importance sampling: sample where it matters most.
While the exact optimal proposal is impractical (it requires the answer), it guides proposal design: a good proposal should be roughly proportional to .
Proof Sketch
Write . Minimize subject to . By Cauchy-Schwarz (or Lagrange multipliers):
with equality when .
Why It Matters
This result provides the theoretical benchmark for proposal design. In practice, you approximate by choosing proposals that emphasize the same regions as . It also explains why importance sampling for rare events is hard: is concentrated in a tiny region, and finding a good proposal for that region requires problem-specific knowledge.
Failure Mode
In practice, you cannot use because computing it requires the very integral you are trying to estimate. Practical proposals are always approximations. If your approximation has lighter tails than the target, the variance can be infinite even though an optimal zero-variance proposal exists.
Weight Degeneracy in High Dimensions
This is the most important practical limitation of importance sampling.
In dimensions, even if is a reasonable approximation to , the importance weights tend to become extremely variable. Consider the case where and . The log-weight is:
Since under , the log-weight has mean and standard deviation . By a CLT argument, the weights are approximately log-normal with variance growing linearly in . This means:
- A few samples get enormous weights
- The vast majority get negligible weights
- The ESS collapses to as grows
This is why importance sampling in its basic form does not scale to high dimensions. Particle filtering, which applies importance sampling sequentially, partially mitigates this through resampling steps.
Canonical Examples
Estimating a tail probability
Goal: estimate where .
Naive Monte Carlo: draw and compute . Since , you need roughly samples to see even a few hits.
Importance sampling: use (shifted to the tail). Weights: . Estimator: .
Now most samples fall in the region of interest, and the weights correct for the shifted proposal. With importance samples, you get a reliable estimate that would require naive samples.
Bayesian posterior estimation
Prior: . Likelihood: . Observe .
Posterior: .
Using proposal (the prior):
The weights emphasize values near 3 (where the likelihood is high). The self-normalized estimator with these weights estimates posterior expectations. However, the prior proposal is centered at 0 while the posterior is centered at 1.5, so the ESS will be moderate.
Common Confusions
Importance sampling can have INFINITE variance
If the proposal has lighter tails than the target , the second moment can diverge, giving the estimator infinite variance. For example, using a Gaussian proposal to estimate a Cauchy target: grows exponentially in the tails. With infinite variance, the CLT does not apply, sample averages converge to the wrong value, and the estimator gives no warning that it is failing. This is the single most dangerous failure mode of importance sampling.
Rule of thumb: the proposal should have tails at least as heavy as .
Self-normalized IS is biased but often better
The self-normalized estimator is biased because it is a ratio of two random quantities. However, it has two advantages: (1) it only requires unnormalized weights, so you do not need to know the normalizing constant of ; (2) it is invariant to the normalization of the weights, which can reduce variance. In many practical settings, SNIS has lower MSE than the unnormalized estimator, especially when the weights are highly variable.
High ESS does not guarantee correctness
The ESS measures weight uniformity, not whether the proposal covers the important regions of the target. You can have high ESS with a proposal that completely misses a mode of . The weights in the sampled region are uniform, but the missing mode is never represented. Always combine ESS diagnostics with visual checks when possible.
Summary
- IS formula:
- Importance weights correct for sampling mismatch
- Self-normalized IS handles unknown normalizing constants
- ESS measures effective number of samples
- Optimal proposal: (impractical but guides design)
- Weight degeneracy in high dimensions: ESS collapses as grows
- Proposal must have heavier tails than target to avoid infinite variance
Exercises
Problem
You want to estimate using importance sampling with proposal (uniform on ) and target .
Wait. This is the same as naive Monte Carlo. Now use the proposal on . Compute the importance weights and the IS estimator. Which proposal gives lower variance?
Problem
Derive the effective sample size formula. Start with samples with normalized weights . The ESS should equal when all weights are equal and 1 when one weight dominates. Show that satisfies these properties.
Problem
Show that if (standard normal) and (standard Cauchy), then the importance sampling estimator of has finite variance. What if the roles are reversed. is Cauchy and is Gaussian?
References
Canonical:
- Kahn & Marshall (1953), "Methods of Reducing Sample Size in Monte Carlo Computations"
- Geweke (1989), "Bayesian Inference in Econometric Models Using Monte Carlo Integration"
Current:
-
Owen, Monte Carlo Theory, Methods and Examples (2013), Chapters 9-10
-
Vehtari, Gelman, Gabry (2017), "Pareto Smoothed Importance Sampling"
-
Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 3-7
-
Brooks et al., Handbook of MCMC (2011), Chapters 1-5
Next Topics
The natural next steps from importance sampling:
- Variance reduction techniques: control variates, antithetic variables, and stratification
- Rao-Blackwellization: reducing variance by analytically integrating out some variables
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
Builds on This
- Particle FiltersLayer 3
- Rao-BlackwellizationLayer 2
- Variance Reduction TechniquesLayer 2