Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Sampling MCMC

Variance Reduction Techniques

Get the same accuracy with fewer samples by exploiting correlation, known quantities, and stratification. Antithetic variates, control variates, stratification, and Rao-Blackwellization.

CoreTier 2Stable~50 min

Prerequisites

0

Why This Matters

Monte Carlo estimation has a fundamental limitation: the standard error of the mean decreases as O(1/n)O(1/\sqrt{n}). To halve the error, you need four times as many samples. Variance reduction techniques break this bottleneck by using structure in the problem to get more information per sample. In Bayesian inference, reinforcement learning, and simulation, these techniques can reduce computation by orders of magnitude.

Mental Model

Imagine estimating the average height of people in a city by random sampling. Naive: pick people at random. Smarter: sample equal numbers from each neighborhood (stratification). Even smarter: if you know the average income of each neighborhood and income correlates with height, use that information to adjust your estimate (control variates). The idea is always the same: use what you already know to reduce uncertainty in what you do not.

Formal Setup and Notation

We want to estimate μ=E[f(X)]\mu = \mathbb{E}[f(X)] where XpX \sim p. The naive Monte Carlo estimator is:

μ^n=1ni=1nf(Xi),Xip i.i.d.\hat{\mu}_n = \frac{1}{n} \sum_{i=1}^{n} f(X_i), \quad X_i \sim p \text{ i.i.d.}

with variance Var(μ^n)=σ2/n\mathrm{Var}(\hat{\mu}_n) = \sigma^2 / n where σ2=Var(f(X))\sigma^2 = \mathrm{Var}(f(X)). Variance reduction constructs an alternative estimator μ^n\hat{\mu}_n' with Var(μ^n)<Var(μ^n)\mathrm{Var}(\hat{\mu}_n') < \mathrm{Var}(\hat{\mu}_n) while keeping E[μ^n]=μ\mathbb{E}[\hat{\mu}_n'] = \mu.

Definition

Antithetic Variates

Generate pairs (Xi,Xi)(X_i, X_i') that are negatively correlated but each marginally distributed as pp. The estimator:

μ^nAV=1ni=1nf(Xi)+f(Xi)2\hat{\mu}_n^{\text{AV}} = \frac{1}{n} \sum_{i=1}^{n} \frac{f(X_i) + f(X_i')}{2}

has variance 1n[σ22+12Cov(f(X),f(X))]\frac{1}{n}[\frac{\sigma^2}{2} + \frac{1}{2}\mathrm{Cov}(f(X), f(X'))]. When Cov(f(X),f(X))<0\mathrm{Cov}(f(X), f(X')) < 0, this variance is less than σ2/n\sigma^2/n.

For example, if XUniform(0,1)X \sim \text{Uniform}(0,1), set X=1XX' = 1 - X. Then XX' is also Uniform(0,1)\text{Uniform}(0,1) but negatively correlated with XX. If ff is monotone, f(X)f(X) and f(1X)f(1-X) are negatively correlated, so the variance drops.

Definition

Control Variates

Let g(X)g(X) be a function whose expectation E[g(X)]=μg\mathbb{E}[g(X)] = \mu_g is known. The control variate estimator is:

μ^nCV=1ni=1n[f(Xi)c(g(Xi)μg)]\hat{\mu}_n^{\text{CV}} = \frac{1}{n} \sum_{i=1}^{n} [f(X_i) - c(g(X_i) - \mu_g)]

for some coefficient cc. This is unbiased for any cc because E[g(Xi)μg]=0\mathbb{E}[g(X_i) - \mu_g] = 0. Its variance is:

Var(μ^nCV)=1n[σ22cCov(f,g)+c2Var(g)]\mathrm{Var}(\hat{\mu}_n^{\text{CV}}) = \frac{1}{n}[\sigma^2 - 2c\,\mathrm{Cov}(f,g) + c^2\mathrm{Var}(g)]

Definition

Stratified Sampling

Partition the sample space into disjoint strata Ω1,,ΩK\Omega_1, \ldots, \Omega_K with P(Ωk)=wkP(\Omega_k) = w_k. Sample nkn_k points from each stratum (the conditional distribution p(Ωk)p(\cdot | \Omega_k)) and combine:

μ^strat=k=1Kwkμ^k,μ^k=1nki=1nkf(Xi(k))\hat{\mu}^{\text{strat}} = \sum_{k=1}^{K} w_k \hat{\mu}_k, \quad \hat{\mu}_k = \frac{1}{n_k}\sum_{i=1}^{n_k} f(X_i^{(k)})

This is always at least as good as naive Monte Carlo, and strictly better whenever the stratum means differ.

Definition

Rao-Blackwellization

If X=(U,V)X = (U, V) and you can compute h(u)=E[f(U,V)U=u]h(u) = \mathbb{E}[f(U,V) | U = u] analytically, then replace f(Xi)f(X_i) with h(Ui)h(U_i):

μ^nRB=1ni=1nh(Ui)\hat{\mu}_n^{\text{RB}} = \frac{1}{n} \sum_{i=1}^{n} h(U_i)

By the law of total variance, Var(h(U))=Var(f(X))E[Var(f(X)U)]Var(f(X))\mathrm{Var}(h(U)) = \mathrm{Var}(f(X)) - \mathbb{E}[\mathrm{Var}(f(X)|U)] \leq \mathrm{Var}(f(X)). Conditioning out part of the randomness always reduces variance.

Main Theorems

Proposition

Optimal Control Variate Coefficient

Statement

The variance of the control variate estimator μ^CV=f(X)c(g(X)μg)\hat{\mu}^{\text{CV}} = f(X) - c(g(X) - \mu_g) is minimized by:

c=Cov(f(X),g(X))Var(g(X))c^* = \frac{\mathrm{Cov}(f(X), g(X))}{\mathrm{Var}(g(X))}

The minimum variance is:

Var(fcg)=Var(f)(1ρfg2)\mathrm{Var}(f - c^*g) = \mathrm{Var}(f)(1 - \rho_{fg}^2)

where ρfg\rho_{fg} is the correlation between f(X)f(X) and g(X)g(X).

Intuition

The optimal cc^* is the regression coefficient of ff on gg. The variance reduction factor is 1ρfg21 - \rho_{fg}^2: if ff and gg are highly correlated (ρfg1|\rho_{fg}| \approx 1), almost all variance is eliminated. The control variate is like subtracting the "explained" part of ff.

Proof Sketch

Var(fcg)=Var(f)2cCov(f,g)+c2Var(g)\mathrm{Var}(f - cg) = \mathrm{Var}(f) - 2c\,\mathrm{Cov}(f,g) + c^2 \mathrm{Var}(g). This is a quadratic in cc with minimum at c=Cov(f,g)/Var(g)c^* = \mathrm{Cov}(f,g)/\mathrm{Var}(g). Substitute back to get Var(f)(1ρfg2)\mathrm{Var}(f)(1 - \rho_{fg}^2).

Why It Matters

This tells you exactly how powerful a control variate will be: it depends entirely on the correlation ρfg\rho_{fg}. With ρfg=0.99|\rho_{fg}| = 0.99, you reduce variance by 99%99\%, equivalent to using 100×100\times more samples. In practice, you estimate cc^* from data, which adds small overhead.

Failure Mode

If ρfg0\rho_{fg} \approx 0, the control variate does nothing. Also, estimating cc^* from the same samples introduces a small bias in finite samples (the product of two estimated quantities). With large nn, this bias is negligible.

Theorem

Rao-Blackwell Theorem (Variance Reduction)

Statement

Let h(U)=E[f(U,V)U]h(U) = \mathbb{E}[f(U,V) | U]. Then:

E[h(U)]=E[f(X)]\mathbb{E}[h(U)] = \mathbb{E}[f(X)] Var(h(U))Var(f(X))\mathrm{Var}(h(U)) \leq \mathrm{Var}(f(X))

with equality if and only if f(X)=h(U)f(X) = h(U) almost surely (i.e., ff does not actually depend on VV).

Intuition

By conditioning on UU, you analytically average out the randomness in VV. This removes the component of variance due to VV, leaving only the variance due to UU. It is like having infinitely many samples of VV for each value of UU.

Proof Sketch

By the law of total variance: Var(f(X))=Var(E[fU])+E[Var(fU)]=Var(h(U))+E[Var(fU)]\mathrm{Var}(f(X)) = \mathrm{Var}(\mathbb{E}[f|U]) + \mathbb{E}[\mathrm{Var}(f|U)] = \mathrm{Var}(h(U)) + \mathbb{E}[\mathrm{Var}(f|U)]. Since E[Var(fU)]0\mathbb{E}[\mathrm{Var}(f|U)] \geq 0, we get Var(h(U))Var(f(X))\mathrm{Var}(h(U)) \leq \mathrm{Var}(f(X)).

Why It Matters

Rao-Blackwellization is the most principled variance reduction technique: it is guaranteed to help and never hurts. In Gibbs sampling, if you can compute conditional expectations for some variables analytically, always do so. It is free variance reduction.

Failure Mode

The technique requires being able to compute E[fU]\mathbb{E}[f|U] analytically, which is often intractable. It also requires choosing the partition (U,V)(U,V) wisely. If VV contributes little variance, the reduction is small.

Canonical Examples

Example

Control variate for option pricing

Estimating E[max(STK,0)]\mathbb{E}[\max(S_T - K, 0)] for a call option. Use g(X)=STg(X) = S_T (the terminal stock price) as a control variate. Under risk-neutral pricing, E[ST]=S0erT\mathbb{E}[S_T] = S_0 e^{rT} is known. Since the payoff is highly correlated with STS_T, this dramatically reduces variance.

Example

Antithetic variates for integral estimation

Estimate 01exdx\int_0^1 e^x \, dx. Generate UiUniform(0,1)U_i \sim \text{Uniform}(0,1) and use pairs (Ui,1Ui)(U_i, 1-U_i). Since exe^x is increasing, eUie^{U_i} and e1Uie^{1-U_i} are negatively correlated. The estimator 12(eUi+e1Ui)\frac{1}{2}(e^{U_i} + e^{1-U_i}) has lower variance than eUie^{U_i} alone.

Common Confusions

Watch Out

Variance reduction does not change the rate

All these techniques reduce the constant in Var=C/n\mathrm{Var} = C/n but keep the 1/n1/n rate. You still need 4×4\times samples to halve the error. The improvement is in the constant CC, which can be enormous in practice but does not change the asymptotic rate.

Watch Out

Control variates require known expectations

The control variate gg must have a known mean μg\mu_g. If you have to estimate μg\mu_g, it is no longer a control variate. It becomes an importance sampling or regression adjustment problem.

Summary

  • Antithetic variates: use negative correlation between sample pairs
  • Control variates: subtract a known-mean quantity correlated with the target; optimal coefficient is Cov(f,g)/Var(g)\mathrm{Cov}(f,g)/\mathrm{Var}(g)
  • Stratification: partition the space and sample within strata; always helps
  • Rao-Blackwellization: condition out part of the randomness analytically; guaranteed to reduce variance by the law of total variance
  • Variance reduction changes the constant, not the O(1/n)O(1/\sqrt{n}) rate

Exercises

ExerciseCore

Problem

You want to estimate E[eX]\mathbb{E}[e^X] where XN(0,1)X \sim N(0,1). Explain how to use antithetic variates and why it reduces variance.

ExerciseAdvanced

Problem

Derive the optimal control variate coefficient cc^* and show that the variance reduction is (1ρ2)Var(f)(1 - \rho^2) \cdot \mathrm{Var}(f) where ρ\rho is the correlation between f(X)f(X) and g(X)g(X).

References

Canonical:

  • Robert & Casella, Monte Carlo Statistical Methods (2004), Chapter 4
  • Ross, Simulation (2012), Chapter 9

Current:

  • Owen, Monte Carlo Theory, Methods, and Examples (2013), Chapters 8-9

  • Gelman et al., Bayesian Data Analysis (2013), Chapters 10-12

  • Brooks et al., Handbook of MCMC (2011), Chapters 1-5

Next Topics

The natural next steps from variance reduction:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics