Importance Sampling

Sneiderman, Robby

Sampling MCMC

Importance Sampling

Estimate expectations under one distribution by sampling from another and reweighting: a technique that is powerful when done right and catastrophically unreliable when done wrong.

CoreTier 1StableSupporting~50 min

Prerequisites

Common Probability Distributions Monte Carlo Methods Number Theory and ML Radon Nikodym and Conditional Expectation

Start 8-question practice · 7 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

sampling-mcmc | layer 2 | tier 1. This page has 6 direct prerequisites and 4 published dependents.

Open Atlas Prerequisites Leads to

What next

Variance Reduction Techniques

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Importance sampling: sample from q(x), reweight by p(x)/q(x)

Importance sampling is one of the most fundamental ideas in computational statistics. It underlies particle filtering (sequential Monte Carlo), off-policy evaluation in reinforcement learning, rare event simulation, and variational inference diagnostics (PSIS-LOO). Understanding importance sampling, and especially understanding when and why it fails, is essential for anyone working with Monte Carlo methods.

The core idea is simple: if you cannot sample from $p$ , sample from something else $q$ and correct for the mismatch using weights. But this simplicity is deceptive. In high dimensions, the effective sample size can collapse to a handful of points even when the nominal sample size is in the millions. Understanding why connects directly to the curse of dimensionality.

Mental Model

You want to compute the average height of people in Country A, but you can only survey people in Country B. If you know the ratio of population densities (how much more or less likely each "type" of person is in A vs B), you can reweight each measurement: people who are over-represented in B get lower weight, and people who are under-represented get higher weight. The reweighted average estimates the Country A average.

This is importance sampling: sample from $q$ (Country B), reweight by $p/q$ (the density ratio), and compute weighted averages.

Formal Setup and Notation

Let $p(x)$ be the target distribution and $f(x)$ a function whose expectation we wish to compute:

$I = \mathbb{E}_p[f(X)] = \int f(x)\, p(x)\, dx$

We cannot sample from $p$ directly (or it is inefficient to do so), but we can sample from a proposal distribution $q(x)$ .

Definition

Importance Sampling Estimator $\hat{I}_{IS}$

The importance sampling estimator of $\mathbb{E}_p[f(X)]$ is:

$\hat{I}_{\text{IS}} = \frac{1}{N}\sum_{i=1}^{N} f(x_i)\, w(x_i), \quad x_i \sim q$

where the importance weights are:

$w(x) = \frac{p(x)}{q(x)}$

This works because $\mathbb{E}_q[f(X)\,w(X)] = \int f(x)\frac{p(x)}{q(x)}\,q(x)\,dx = \int f(x)\,p(x)\,dx = \mathbb{E}_p[f(X)]$ .

Definition

Importance Weight $w (x) = p (x) / q (x)$

The importance weight $w(x) = p(x)/q(x)$ corrects for the mismatch between proposal $q$ and target $p$ . Where $q$ over-samples relative to $p$ (i.e., $q(x) > p(x)$ ), the weight $w(x) < 1$ down-weights. Where $q$ under-samples (i.e., $q(x) < p(x)$ ), the weight $w(x) > 1$ up-weights.

The support condition is critical: $q(x) > 0$ whenever $p(x)f(x) \neq 0$ . Otherwise the estimator is biased.

Definition

Self-Normalized Importance Sampling $\hat{I}_{SNIS}$

When $p(x)$ is known only up to a normalizing constant (i.e., we can evaluate $\tilde{p}(x)$ where $p(x) = \tilde{p}(x)/Z$ ), we use unnormalized weights $\tilde{w}(x) = \tilde{p}(x)/q(x)$ and the self-normalized estimator:

$\hat{I}_{\text{SNIS}} = \frac{\sum_{i=1}^{N} f(x_i)\,\tilde{w}(x_i)}{\sum_{i=1}^{N} \tilde{w}(x_i)}$

This estimator is biased (the ratio of two unbiased estimators is not unbiased) but consistent, and often has lower variance than the unnormalized version even when $Z$ is known.

Definition

Effective Sample Size $E S S$

The effective sample size measures how many independent samples from $p$ your weighted samples are "worth":

$\text{ESS} = \frac{1}{\sum_{i=1}^{N} \bar{w}_i^2}$

where $\bar{w}_i = \tilde{w}_i / \sum_j \tilde{w}_j$ are the normalized weights. The ESS satisfies $1 \leq \text{ESS} \leq N$ . When all weights are equal, $\text{ESS} = N$ . When one weight dominates, $\text{ESS} \approx 1$ .

An ESS much smaller than $N$ signals that the proposal is a poor match for the target, and the estimate is unreliable.

Main Theorems

Theorem

Unbiasedness of Importance Sampling

Statement

The importance sampling estimator $\hat{I}_{\text{IS}} = \frac{1}{N}\sum_{i=1}^{N} f(x_i)\,w(x_i)$ with $x_i \stackrel{iid}{\sim} q$ is an unbiased estimator of $\mathbb{E}_p[f(X)]$ :

$\mathbb{E}_q[\hat{I}_{\text{IS}}] = \mathbb{E}_p[f(X)]$

Its variance is:

$\text{Var}_q(\hat{I}_{\text{IS}}) = \frac{1}{N}\text{Var}_q(f(X)\,w(X)) = \frac{1}{N}\left(\mathbb{E}_q[f(X)^2\,w(X)^2] - I^2\right)$

Intuition

The reweighting by $w(x) = p(x)/q(x)$ exactly corrects for sampling from the "wrong" distribution. Multiplying by $p/q$ inside the $q$ -expectation converts it to a $p$ -expectation. The variance depends on how variable the product $f(x)w(x)$ is under $q$ . If the weights $w(x)$ are highly variable, the variance can be enormous.

Proof Sketch

$\mathbb{E}_q[\hat{I}_{\text{IS}}] = \frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_q[f(x_i)w(x_i)] = \mathbb{E}_q\!\left[f(X)\frac{p(X)}{q(X)}\right] = \int f(x)\frac{p(x)}{q(x)}q(x)\,dx = \int f(x)p(x)\,dx = I$

The $q(x)$ in the denominator of $w$ cancels with the $q(x)$ in the integral. This is a direct application of the change-of-measure identity.

Why It Matters

Unbiasedness means the estimator is correct on average, regardless of the proposal choice (as long as the support condition holds). This is a fundamental guarantee that allows importance sampling to be used in a wide variety of settings. However, unbiasedness says nothing about variance. The estimator can be unbiased but have variance so large that individual estimates are useless.

Failure Mode

If $q(x) = 0$ for some $x$ where $p(x)f(x) \neq 0$ , the estimator is biased — those regions are never sampled. Even when the support condition holds, if $q$ has lighter tails than $p$ , the weights $w(x) = p(x)/q(x)$ can be unbounded, and $\mathbb{E}_q[f(X)^2 w(X)^2]$ may be infinite, giving the estimator infinite variance.

report a correction →

Theorem

Optimal Proposal and Variance Lower Bound

Statement

The variance of the importance sampling estimator is minimized when the proposal is:

$q^*(x) = \frac{|f(x)|\, p(x)}{\int |f(x)|\, p(x)\, dx} = \frac{|f(x)|\, p(x)}{\mathbb{E}_p[|f(X)|]}$

For $f(x) \geq 0$ , the resulting minimum variance is:

$\text{Var}_{q^*}(\hat{I}_{\text{IS}}) = \frac{1}{N}\left((\mathbb{E}_p[f(X)])^2 - (\mathbb{E}_p[f(X)])^2\right) = 0$

Wait. This gives zero variance! The optimal proposal produces a constant weight $w(x) = \mathbb{E}_p[f(X)]$ for all $x$ , meaning every sample gives exactly the right answer. But this requires knowing $\mathbb{E}_p[f(X)]$ . The very quantity we are trying to estimate.

Intuition

The optimal proposal concentrates samples where $|f(x)|p(x)$ is large, exactly the regions that contribute most to the integral. Regions where $f(x)$ is small or $p(x)$ is small are sampled less. This is the "importance" in importance sampling: sample where it matters most.

While the exact optimal proposal is impractical (it requires the answer), it guides proposal design: a good proposal should be roughly proportional to $|f(x)|p(x)$ .

Proof Sketch

Write $\text{Var}_q(fw) = \mathbb{E}_q[(fw)^2] - I^2$ . Minimize $\mathbb{E}_q[(fw)^2] = \int \frac{f(x)^2 p(x)^2}{q(x)}dx$ subject to $\int q(x)dx = 1$ . By Cauchy-Schwarz (or Lagrange multipliers):

$\int \frac{f(x)^2 p(x)^2}{q(x)}dx \geq \left(\int |f(x)| p(x) dx\right)^2$

with equality when $q(x) \propto |f(x)| p(x)$ .

Why It Matters

This result provides the theoretical benchmark for proposal design. In practice, you approximate $q^*$ by choosing proposals that emphasize the same regions as $|f(x)|p(x)$ . It also explains why importance sampling for rare events is hard: $|f(x)|p(x)$ is concentrated in a tiny region, and finding a good proposal for that region requires problem-specific knowledge.

Failure Mode

In practice, you cannot use $q^*$ because computing it requires the very integral you are trying to estimate. Practical proposals are always approximations. If your approximation has lighter tails than the target, the variance can be infinite even though an optimal zero-variance proposal exists.

report a correction →

Weight Degeneracy in High Dimensions

This is the most important practical limitation of importance sampling.

In $d$ dimensions, even if $q$ is a reasonable approximation to $p$ , the importance weights tend to become extremely variable. Consider the case where $p = \mathcal{N}(0, I_d)$ and $q = \mathcal{N}(0, \sigma^2 I_d)$ . The log-weight is:

$\log w(x) = \log p(x) - \log q(x) = \frac{d}{2}\log\sigma^2 + \frac{(1 - \sigma^2)}{2\sigma^2}\|x\|^2$

(Sign check: when $\sigma^2 > 1$ the proposal $q$ has heavier tails than the target $p$ , so the log-weight should decrease as $\|x\|^2$ grows — and indeed $(1-\sigma^2)/(2\sigma^2) < 0$ . A KL cross-check confirms $\mathbb{E}_q[\log w(X)] = -\mathrm{KL}(q\|p) = (d/2)\log\sigma^2 + d(1-\sigma^2)/2$ , matching the formula above evaluated at $\mathbb{E}_q[\|X\|^2] = d\sigma^2$ .)

Since $\|x\|^2 \approx d\sigma^2$ under $q$ , the log-weight has mean $O(d)$ and standard deviation $O(\sqrt{d})$ . By a CLT argument, the weights are approximately log-normal with variance growing linearly in $d$ . This means:

A few samples get enormous weights
The vast majority get negligible weights
The ESS collapses to $O(1)$ as $d$ grows

This is why importance sampling in its basic form does not scale to high dimensions. Particle filtering, which applies importance sampling sequentially, partially mitigates this through resampling steps.

Canonical Examples

Example

Estimating a tail probability

Goal: estimate $P(X > 4)$ where $X \sim \mathcal{N}(0, 1)$ .

Naive Monte Carlo: draw $x_i \sim \mathcal{N}(0,1)$ and compute $\hat{p} = \frac{1}{N}\sum \mathbf{1}(x_i > 4)$ . Since $P(X > 4) \approx 3.2 \times 10^{-5}$ , you need roughly $N \approx 10^6$ samples to see even a few hits.

Importance sampling: use $q = \mathcal{N}(4, 1)$ (shifted to the tail). Weights: $w(x) = \frac{\phi(x)}{\phi(x-4)} = \exp(-4x + 8)$ . Estimator: $\hat{p}_{\text{IS}} = \frac{1}{N}\sum \mathbf{1}(x_i > 4)\,w(x_i)$ .

Now most samples fall in the region of interest, and the weights correct for the shifted proposal. With $N = 1000$ importance samples, you get a reliable estimate that would require $10^6$ naive samples.

Example

Bayesian posterior estimation

Prior: $\theta \sim \mathcal{N}(0, 1)$ . Likelihood: $x \mid \theta \sim \mathcal{N}(\theta, 1)$ . Observe $x = 3$ .

Posterior: $p(\theta \mid x=3) \propto e^{-\theta^2/2}\, e^{-(3-\theta)^2/2} \propto e^{-(\theta - 1.5)^2}$ .

Using proposal $q(\theta) = \mathcal{N}(0, 1)$ (the prior):

$w(\theta) = \frac{p(\theta \mid x=3)}{q(\theta)} \propto \frac{e^{-\theta^2/2}e^{-(3-\theta)^2/2}}{e^{-\theta^2/2}} = e^{-(3-\theta)^2/2}$

The weights emphasize $\theta$ values near 3 (where the likelihood is high). The self-normalized estimator with these weights estimates posterior expectations. However, the prior proposal is centered at 0 while the posterior is centered at 1.5, so the ESS will be moderate.

Common Confusions

Watch Out

Importance sampling can have INFINITE variance

If the proposal $q$ has lighter tails than the target $p$ , the second moment $\mathbb{E}_q[f(X)^2 w(X)^2]$ can diverge, giving the estimator infinite variance. For example, using a Gaussian proposal to estimate a Cauchy target: $w(x) = p(x)/q(x) \propto (1+x^2)^{-1}/e^{-x^2/2}$ grows exponentially in the tails. With infinite variance, the CLT does not apply, sample averages converge to the wrong value, and the estimator gives no warning that it is failing. This is the single most dangerous failure mode of importance sampling.

Rule of thumb: the proposal should have tails at least as heavy as $|f(x)|p(x)$ .

Watch Out

Self-normalized IS is biased but often better

The self-normalized estimator $\hat{I}_{\text{SNIS}}$ is biased because it is a ratio of two random quantities. However, it has two advantages: (1) it only requires unnormalized weights, so you do not need to know the normalizing constant of $p$ ; (2) it is invariant to the normalization of the weights, which can reduce variance. In many practical settings, SNIS has lower MSE than the unnormalized estimator, especially when the weights are highly variable.

Watch Out

High ESS does not guarantee correctness

The ESS measures weight uniformity, not whether the proposal covers the important regions of the target. You can have high ESS with a proposal that completely misses a mode of $p$ . The weights in the sampled region are uniform, but the missing mode is never represented. Always combine ESS diagnostics with visual checks when possible.

Summary

IS formula: $\mathbb{E}_p[f(X)] = \mathbb{E}_q[f(X)\,p(X)/q(X)]$
Importance weights $w(x) = p(x)/q(x)$ correct for sampling mismatch
Self-normalized IS handles unknown normalizing constants
ESS $= 1/\sum \bar{w}_i^2$ measures effective number of samples
Optimal proposal: $q^*(x) \propto |f(x)|p(x)$ (impractical but guides design)
Weight degeneracy in high dimensions: ESS collapses as $d$ grows
Proposal must have heavier tails than target to avoid infinite variance

Exercises

ExerciseCore

Problem

You want to estimate $I = \int_0^1 e^x\,dx$ using importance sampling with proposal $q(x) = 1$ (uniform on $[0,1]$ ) and target $p(x) = e^x / (e-1)$ .

Wait. This is the same as naive Monte Carlo. Now use the proposal $q(x) = 2x$ on $[0,1]$ . Compute the importance weights and the IS estimator. Which proposal gives lower variance?

ExerciseCore

Problem

Derive the effective sample size formula. Start with $N$ samples with normalized weights $\bar{w}_i = \tilde{w}_i/\sum_j \tilde{w}_j$ . The ESS should equal $N$ when all weights are equal and 1 when one weight dominates. Show that $\text{ESS} = 1/\sum_i \bar{w}_i^2$ satisfies these properties.

ExerciseAdvanced

Problem

Show that if $p(x) \propto e^{-x^2/2}$ (standard normal) and $q(x) \propto (1 + x^2)^{-1}$ (standard Cauchy), then the importance sampling estimator of $\mathbb{E}_p[X^2]$ has finite variance. What if the roles are reversed. $p$ is Cauchy and $q$ is Gaussian?

References

Canonical:

Kahn & Marshall (1953), "Methods of Reducing Sample Size in Monte Carlo Computations"
Geweke (1989), "Bayesian Inference in Econometric Models Using Monte Carlo Integration"

Current:

Owen, Monte Carlo Theory, Methods and Examples (2013), Chapters 9-10
Vehtari, Gelman, Gabry (2017), "Pareto Smoothed Importance Sampling"
Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 3-7
Brooks et al., Handbook of MCMC (2011), Chapters 1-5

Next Topics

The natural next steps from importance sampling:

Variance reduction techniques: control variates, antithetic variables, and stratification
Rao-Blackwellization: reducing variance by analytically integrating out some variables

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Common Probability Distributionslayer 0A · tier 1
Radon-Nikodym and Conditional Expectationlayer 0B · tier 1
Monte Carlo Methodslayer 2 · tier 1
Rejection Samplinglayer 1 · tier 2
Squeezed Rejection Samplinglayer 2 · tier 3

Derived topics

4

Weighted Conformal Prediction Under Covariate Shiftlayer 3 · tier 1
Rao-Blackwellizationlayer 2 · tier 2
Variance Reduction Techniqueslayer 2 · tier 2
Particle Filterslayer 3 · tier 3

Graph-backed continuations

Variance Reduction Techniques Rao-Blackwellization Particle Filters Weighted Conformal Prediction Under Covariate Shift