Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Foundations

Survey Sampling Methods

The major probability sampling designs used in survey statistics: simple random, stratified, cluster, systematic, multi-stage, and multi-phase sampling, with their variance properties and estimators.

CoreTier 2Stable~60 min
0

Why This Matters

Most ML datasets are not random samples from well-defined populations. But many important datasets are: government surveys, clinical trials, epidemiological studies. If you use data from the Current Population Survey or the American Community Survey without accounting for the sampling design, your standard errors will be wrong and your point estimates may be biased.

Understanding sampling design is also the foundation for understanding when and why observational data fails. Selection bias is a sampling problem.

Mental Model

You want to estimate a population quantity (a mean, a total, a proportion) but you cannot observe the entire population. You select a subset according to a known probabilistic rule. The rule you choose determines the precision, cost, and bias of your estimator.

Different rules trade off variance against cost. Stratified sampling reduces variance by ensuring representation. Cluster sampling reduces cost by sampling groups instead of individuals. The choice is never free: what reduces cost often increases variance.

Core Definitions

Definition

Probability Sampling Design

A probability sampling design assigns a known, positive probability p(S)p(S) to every possible sample SS that could be drawn from the finite population U={1,2,,N}U = \{1, 2, \ldots, N\}. The first-order inclusion probability of unit ii is πi=Pr(iS)=Sip(S)\pi_i = \Pr(i \in S) = \sum_{S \ni i} p(S).

The requirement πi>0\pi_i > 0 for all ii is non-negotiable. If any unit has zero inclusion probability, the design cannot produce unbiased estimates of population quantities that depend on that unit.

Definition

Simple Random Sampling (SRS)

Draw nn units from a population of NN without replacement, with each of the (Nn)\binom{N}{n} possible samples equally likely. Every unit has inclusion probability πi=n/N\pi_i = n/N. The sample mean yˉS=1niSyi\bar{y}_S = \frac{1}{n}\sum_{i \in S} y_i is unbiased for the population mean Yˉ=1Ni=1Nyi\bar{Y} = \frac{1}{N}\sum_{i=1}^{N} y_i.

The variance of the sample mean under SRS is:

Var(yˉS)=Sy2n(1nN)\text{Var}(\bar{y}_S) = \frac{S_y^2}{n}\left(1 - \frac{n}{N}\right)

where Sy2=1N1i=1N(yiYˉ)2S_y^2 = \frac{1}{N-1}\sum_{i=1}^{N}(y_i - \bar{Y})^2 and the factor (1n/N)(1 - n/N) is the finite population correction.

Definition

Stratified Sampling

Partition the population into HH non-overlapping strata U1,U2,,UHU_1, U_2, \ldots, U_H with hUh=U\bigcup_h U_h = U. Draw an independent SRS of size nhn_h from stratum hh. The stratified estimator of the population mean is:

yˉst=h=1HWhyˉh\bar{y}_{\text{st}} = \sum_{h=1}^{H} W_h \bar{y}_h

where Wh=Nh/NW_h = N_h / N is the stratum weight and yˉh\bar{y}_h is the sample mean in stratum hh.

The variance is Var(yˉst)=h=1HWh2Sh2nh(1nh/Nh)\text{Var}(\bar{y}_{\text{st}}) = \sum_{h=1}^{H} W_h^2 \frac{S_h^2}{n_h}(1 - n_h/N_h), which is smaller than SRS variance when strata are internally homogeneous.

Definition

Cluster Sampling

Divide the population into MM clusters (groups of units). Sample mm clusters, then observe all units within selected clusters. If each cluster has BB units and we use SRS of clusters, the estimator of the population mean is:

yˉcl=1mBiselected clustersyi\bar{y}_{\text{cl}} = \frac{1}{mB}\sum_{i \in \text{selected clusters}} y_i

Cluster sampling is cheaper (you only need to visit mm locations) but has higher variance than SRS when units within clusters are similar to each other. The intraclass correlation ρ\rho quantifies this: higher ρ\rho means worse precision for a given total sample size.

Definition

Design Effect (DEFF)

The design effect is the ratio of the variance under the actual sampling design to the variance under SRS with the same total sample size nn:

DEFF=Vardesign(θ^)VarSRS(θ^)\text{DEFF} = \frac{\text{Var}_{\text{design}}(\hat{\theta})}{\text{Var}_{\text{SRS}}(\hat{\theta})}

For cluster sampling with equal cluster sizes BB and intraclass correlation ρ\rho: DEFF1+(B1)ρ\text{DEFF} \approx 1 + (B - 1)\rho. A DEFF of 2 means you need twice the sample size to achieve the same precision as SRS.

Systematic, Multi-Stage, and Multi-Phase Designs

Systematic sampling: select a random start kk between 1 and the sampling interval K=N/nK = N/n, then take every KK-th unit: k,k+K,k+2K,k, k+K, k+2K, \ldots. Simple to implement. Variance depends on the ordering of the population. If there is a periodic pattern with period KK, systematic sampling can be catastrophically bad.

Multi-stage sampling: first sample clusters (primary sampling units), then sample units within selected clusters (secondary sampling units). Can have more than two stages. The U.S. Census Bureau uses multi-stage designs for most of its surveys. The variance has contributions from each stage.

Multi-phase sampling: collect cheap information (e.g., a short screening questionnaire) on a large sample, then collect expensive information (e.g., a blood test) on a subsample selected based on the first-phase data. This is cost-efficient when the expensive measurement is highly correlated with the cheap one.

Main Theorems

Theorem

Horvitz-Thompson Estimator

Statement

For any probability sampling design with inclusion probabilities πi>0\pi_i > 0 for all ii, the Horvitz-Thompson estimator of the population total T=i=1NyiT = \sum_{i=1}^{N} y_i is:

T^HT=iSyiπi\hat{T}_{\text{HT}} = \sum_{i \in S} \frac{y_i}{\pi_i}

This estimator is unbiased: E[T^HT]=T\mathbb{E}[\hat{T}_{\text{HT}}] = T. Its variance is:

Var(T^HT)=i=1Nj=1N(πijπiπj)yiπiyjπj\text{Var}(\hat{T}_{\text{HT}}) = \sum_{i=1}^{N}\sum_{j=1}^{N} (\pi_{ij} - \pi_i \pi_j)\frac{y_i}{\pi_i}\frac{y_j}{\pi_j}

where πij=Pr(iS and jS)\pi_{ij} = \Pr(i \in S \text{ and } j \in S) is the joint inclusion probability.

Intuition

Each sampled unit ii "represents" 1/πi1/\pi_i population units. If a unit has inclusion probability 0.01, it represents 100 units in the population. The estimator weights each observed value by the inverse of its selection probability, which exactly corrects for the unequal probabilities of selection.

Proof Sketch

Define the indicator Zi=1(iS)Z_i = \mathbf{1}(i \in S). Then T^HT=i=1NZiyi/πi\hat{T}_{\text{HT}} = \sum_{i=1}^{N} Z_i y_i / \pi_i. Taking expectations: E[T^HT]=i=1NE[Zi]yi/πi=i=1Nπiyi/πi=T\mathbb{E}[\hat{T}_{\text{HT}}] = \sum_{i=1}^{N} \mathbb{E}[Z_i] y_i / \pi_i = \sum_{i=1}^{N} \pi_i y_i / \pi_i = T. The variance follows from Cov(Zi,Zj)=πijπiπj\text{Cov}(Z_i, Z_j) = \pi_{ij} - \pi_i \pi_j.

Why It Matters

The Horvitz-Thompson estimator is the workhorse of survey statistics. It is the unique linear unbiased estimator that depends only on inclusion probabilities. It handles any probability sampling design: SRS, stratified, cluster, multi-stage, or anything else with known πi\pi_i.

Failure Mode

If inclusion probabilities are unknown or zero for some units, the estimator is undefined or biased. In practice, this happens with non-probability samples (convenience samples, web panels, voluntary surveys). If πi\pi_i is very small for some unit ii and yiy_i is large, the term yi/πiy_i / \pi_i can dominate the estimate and cause high variance. Trimming or capping weights is common but introduces bias.

Common Confusions

Watch Out

Cluster sampling is not stratified sampling

Stratification divides the population into groups and samples within every group. Cluster sampling divides the population into groups and samples entire groups. Stratification reduces variance (by ensuring representation). Cluster sampling typically increases variance (because units within clusters are similar). They are opposite strategies.

Watch Out

Larger samples are not always better

Doubling the sample size under a bad design can be worse than halving it under a good design. If you cluster-sample with high intraclass correlation, adding more units within the same clusters barely helps. You would be better off sampling more clusters with fewer units per cluster.

Watch Out

Random sampling does not mean haphazard sampling

Probability sampling requires a well-defined sampling frame and a randomization mechanism with known probabilities. Grabbing the first 100 people you encounter is not random sampling. It is a convenience sample with unknown inclusion probabilities, and no design-based inference is valid.

Summary

  • Every probability sampling design assigns known, positive inclusion probabilities to all units
  • SRS is the baseline; stratified sampling beats SRS when strata are homogeneous
  • Cluster sampling saves cost but increases variance, quantified by the design effect
  • The Horvitz-Thompson estimator is unbiased for any probability design with known πi\pi_i
  • Multi-stage and multi-phase designs are the practical workhorses for large-scale surveys
  • Ignoring the sampling design when analyzing survey data gives wrong standard errors

Exercises

ExerciseCore

Problem

A population of N=10,000N = 10{,}000 has variance Sy2=25S_y^2 = 25. You take an SRS of n=400n = 400. Compute the variance of the sample mean, with and without the finite population correction. How much does the FPC matter here?

ExerciseAdvanced

Problem

A population has two strata. Stratum 1 has N1=8000N_1 = 8000 units with S12=10S_1^2 = 10. Stratum 2 has N2=2000N_2 = 2000 units with S22=100S_2^2 = 100. You have budget for n=500n = 500 total samples. Compare proportional allocation (nhNhn_h \propto N_h) with Neyman allocation (nhNhShn_h \propto N_h S_h). Which gives smaller variance for the stratified mean?

References

Canonical:

  • Cochran, Sampling Techniques (1977), Chapters 2-5, 9-11
  • Sarndal, Swensson, Wretman, Model Assisted Survey Sampling (1992), Chapters 2-4

Current:

  • Lohr, Sampling: Design and Analysis (2021), Chapters 1-6

  • Fuller, Sampling Statistics (2009), Chapters 1-3

  • Casella & Berger, Statistical Inference (2002), Chapters 5-10

  • Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics