Statistical Foundations
Survey Sampling Methods
The major probability sampling designs used in survey statistics: simple random, stratified, cluster, systematic, multi-stage, and multi-phase sampling, with their variance properties and estimators.
Why This Matters
Most ML datasets are not random samples from well-defined populations. But many important datasets are: government surveys, clinical trials, epidemiological studies. If you use data from the Current Population Survey or the American Community Survey without accounting for the sampling design, your standard errors will be wrong and your point estimates may be biased.
Understanding sampling design is also the foundation for understanding when and why observational data fails. Selection bias is a sampling problem.
Mental Model
You want to estimate a population quantity (a mean, a total, a proportion) but you cannot observe the entire population. You select a subset according to a known probabilistic rule. The rule you choose determines the precision, cost, and bias of your estimator.
Different rules trade off variance against cost. Stratified sampling reduces variance by ensuring representation. Cluster sampling reduces cost by sampling groups instead of individuals. The choice is never free: what reduces cost often increases variance.
Core Definitions
Probability Sampling Design
A probability sampling design assigns a known, positive probability to every possible sample that could be drawn from the finite population . The first-order inclusion probability of unit is .
The requirement for all is non-negotiable. If any unit has zero inclusion probability, the design cannot produce unbiased estimates of population quantities that depend on that unit.
Simple Random Sampling (SRS)
Draw units from a population of without replacement, with each of the possible samples equally likely. Every unit has inclusion probability . The sample mean is unbiased for the population mean .
The variance of the sample mean under SRS is:
where and the factor is the finite population correction.
Stratified Sampling
Partition the population into non-overlapping strata with . Draw an independent SRS of size from stratum . The stratified estimator of the population mean is:
where is the stratum weight and is the sample mean in stratum .
The variance is , which is smaller than SRS variance when strata are internally homogeneous.
Cluster Sampling
Divide the population into clusters (groups of units). Sample clusters, then observe all units within selected clusters. If each cluster has units and we use SRS of clusters, the estimator of the population mean is:
Cluster sampling is cheaper (you only need to visit locations) but has higher variance than SRS when units within clusters are similar to each other. The intraclass correlation quantifies this: higher means worse precision for a given total sample size.
Design Effect (DEFF)
The design effect is the ratio of the variance under the actual sampling design to the variance under SRS with the same total sample size :
For cluster sampling with equal cluster sizes and intraclass correlation : . A DEFF of 2 means you need twice the sample size to achieve the same precision as SRS.
Systematic, Multi-Stage, and Multi-Phase Designs
Systematic sampling: select a random start between 1 and the sampling interval , then take every -th unit: . Simple to implement. Variance depends on the ordering of the population. If there is a periodic pattern with period , systematic sampling can be catastrophically bad.
Multi-stage sampling: first sample clusters (primary sampling units), then sample units within selected clusters (secondary sampling units). Can have more than two stages. The U.S. Census Bureau uses multi-stage designs for most of its surveys. The variance has contributions from each stage.
Multi-phase sampling: collect cheap information (e.g., a short screening questionnaire) on a large sample, then collect expensive information (e.g., a blood test) on a subsample selected based on the first-phase data. This is cost-efficient when the expensive measurement is highly correlated with the cheap one.
Main Theorems
Horvitz-Thompson Estimator
Statement
For any probability sampling design with inclusion probabilities for all , the Horvitz-Thompson estimator of the population total is:
This estimator is unbiased: . Its variance is:
where is the joint inclusion probability.
Intuition
Each sampled unit "represents" population units. If a unit has inclusion probability 0.01, it represents 100 units in the population. The estimator weights each observed value by the inverse of its selection probability, which exactly corrects for the unequal probabilities of selection.
Proof Sketch
Define the indicator . Then . Taking expectations: . The variance follows from .
Why It Matters
The Horvitz-Thompson estimator is the workhorse of survey statistics. It is the unique linear unbiased estimator that depends only on inclusion probabilities. It handles any probability sampling design: SRS, stratified, cluster, multi-stage, or anything else with known .
Failure Mode
If inclusion probabilities are unknown or zero for some units, the estimator is undefined or biased. In practice, this happens with non-probability samples (convenience samples, web panels, voluntary surveys). If is very small for some unit and is large, the term can dominate the estimate and cause high variance. Trimming or capping weights is common but introduces bias.
Common Confusions
Cluster sampling is not stratified sampling
Stratification divides the population into groups and samples within every group. Cluster sampling divides the population into groups and samples entire groups. Stratification reduces variance (by ensuring representation). Cluster sampling typically increases variance (because units within clusters are similar). They are opposite strategies.
Larger samples are not always better
Doubling the sample size under a bad design can be worse than halving it under a good design. If you cluster-sample with high intraclass correlation, adding more units within the same clusters barely helps. You would be better off sampling more clusters with fewer units per cluster.
Random sampling does not mean haphazard sampling
Probability sampling requires a well-defined sampling frame and a randomization mechanism with known probabilities. Grabbing the first 100 people you encounter is not random sampling. It is a convenience sample with unknown inclusion probabilities, and no design-based inference is valid.
Summary
- Every probability sampling design assigns known, positive inclusion probabilities to all units
- SRS is the baseline; stratified sampling beats SRS when strata are homogeneous
- Cluster sampling saves cost but increases variance, quantified by the design effect
- The Horvitz-Thompson estimator is unbiased for any probability design with known
- Multi-stage and multi-phase designs are the practical workhorses for large-scale surveys
- Ignoring the sampling design when analyzing survey data gives wrong standard errors
Exercises
Problem
A population of has variance . You take an SRS of . Compute the variance of the sample mean, with and without the finite population correction. How much does the FPC matter here?
Problem
A population has two strata. Stratum 1 has units with . Stratum 2 has units with . You have budget for total samples. Compare proportional allocation () with Neyman allocation (). Which gives smaller variance for the stratified mean?
References
Canonical:
- Cochran, Sampling Techniques (1977), Chapters 2-5, 9-11
- Sarndal, Swensson, Wretman, Model Assisted Survey Sampling (1992), Chapters 2-4
Current:
-
Lohr, Sampling: Design and Analysis (2021), Chapters 1-6
-
Fuller, Sampling Statistics (2009), Chapters 1-3
-
Casella & Berger, Statistical Inference (2002), Chapters 5-10
-
Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6
Next Topics
- Sample size determination: how many units to sample under different designs
- Design-based vs model-based inference: two philosophies of survey inference
- Nonresponse and missing data: what to do when sampled units do not respond
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.