Survey Sampling Methods

Sneiderman, Robby

Statistical Foundations

Survey Sampling Methods

The major probability sampling designs used in survey statistics: simple random, stratified, cluster, systematic, multi-stage, and multi-phase sampling, with their variance properties and estimators.

CoreTier 2StableSupporting~60 min

Prerequisites

Common Probability Distributions Expectation Variance Covariance Moments Types of Bias in Statistics

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

statistical-foundations | layer 2 | tier 2. This page has 3 direct prerequisites and 5 published dependents.

Open Atlas Prerequisites Leads to

What next

Sample Size Determination

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Most ML datasets are not random samples from well-defined populations. But many important datasets are: government surveys, clinical trials, epidemiological studies. If you use data from the Current Population Survey or the American Community Survey without accounting for the sampling design, your standard errors will be wrong and your point estimates may be biased.

Understanding sampling design is also the foundation for understanding when and why observational data fails. Selection bias is a sampling problem.

Mental Model

You want to estimate a population quantity (a mean, a total, a proportion) but you cannot observe the entire population. You select a subset according to a known probabilistic rule. The rule you choose determines the precision, cost, and bias of your estimator.

Different rules trade off variance against cost. Stratified sampling reduces variance by ensuring representation. Cluster sampling reduces cost by sampling groups instead of individuals. The choice is never free: what reduces cost often increases variance.

Core Definitions

Definition

Probability Sampling Design $p (S)$

A probability sampling design assigns a known, positive probability $p(S)$ to every possible sample $S$ that could be drawn from the finite population $U = \{1, 2, \ldots, N\}$ . The first-order inclusion probability of unit $i$ is $\pi_i = \Pr(i \in S) = \sum_{S \ni i} p(S)$ .

The requirement $\pi_i > 0$ for all $i$ is non-negotiable. If any unit has zero inclusion probability, the design cannot produce unbiased estimates of population quantities that depend on that unit.

Definition

Simple Random Sampling (SRS) $π_{i} = n / N$

Draw $n$ units from a population of $N$ without replacement, with each of the $\binom{N}{n}$ possible samples equally likely. Every unit has inclusion probability $\pi_i = n/N$ . The sample mean $\bar{y}_S = \frac{1}{n}\sum_{i \in S} y_i$ is unbiased for the population mean $\bar{Y} = \frac{1}{N}\sum_{i=1}^{N} y_i$ .

The variance of the sample mean under SRS is:

$\text{Var}(\bar{y}_S) = \frac{S_y^2}{n}\left(1 - \frac{n}{N}\right)$

where $S_y^2 = \frac{1}{N-1}\sum_{i=1}^{N}(y_i - \bar{Y})^2$ and the factor $(1 - n/N)$ is the finite population correction.

Definition

Stratified Sampling

Partition the population into $H$ non-overlapping strata $U_1, U_2, \ldots, U_H$ with $\bigcup_h U_h = U$ . Draw an independent SRS of size $n_h$ from stratum $h$ . The stratified estimator of the population mean is:

$\bar{y}_{\text{st}} = \sum_{h=1}^{H} W_h \bar{y}_h$

where $W_h = N_h / N$ is the stratum weight and $\bar{y}_h$ is the sample mean in stratum $h$ .

The variance is $\text{Var}(\bar{y}_{\text{st}}) = \sum_{h=1}^{H} W_h^2 \frac{S_h^2}{n_h}(1 - n_h/N_h)$ , which is smaller than SRS variance when strata are internally homogeneous.

Definition

Cluster Sampling

Divide the population into $M$ clusters (groups of units). Sample $m$ clusters, then observe all units within selected clusters. If each cluster has $B$ units and we use SRS of clusters, the estimator of the population mean is:

$\bar{y}_{\text{cl}} = \frac{1}{mB}\sum_{i \in \text{selected clusters}} y_i$

Cluster sampling is cheaper (you only need to visit $m$ locations) but has higher variance than SRS when units within clusters are similar to each other. The intraclass correlation $\rho$ quantifies this: higher $\rho$ means worse precision for a given total sample size.

Definition

Design Effect (DEFF)

The design effect is the ratio of the variance under the actual sampling design to the variance under SRS with the same total sample size $n$ :

$\text{DEFF} = \frac{\text{Var}_{\text{design}}(\hat{\theta})}{\text{Var}_{\text{SRS}}(\hat{\theta})}$

For cluster sampling with equal cluster sizes $B$ and intraclass correlation $\rho$ : $\text{DEFF} \approx 1 + (B - 1)\rho$ . A DEFF of 2 means you need twice the sample size to achieve the same precision as SRS.

Systematic, Multi-Stage, and Multi-Phase Designs

Systematic sampling: select a random start $k$ between 1 and the sampling interval $K = N/n$ , then take every $K$ -th unit: $k, k+K, k+2K, \ldots$ . Simple to implement. Variance depends on the ordering of the population. If there is a periodic pattern with period $K$ , systematic sampling can be catastrophically bad.

Multi-stage sampling: first sample clusters (primary sampling units), then sample units within selected clusters (secondary sampling units). Can have more than two stages. The U.S. Census Bureau uses multi-stage designs for most of its surveys. The variance has contributions from each stage.

Multi-phase sampling: collect cheap information (e.g., a short screening questionnaire) on a large sample, then collect expensive information (e.g., a blood test) on a subsample selected based on the first-phase data. This is cost-efficient when the expensive measurement is highly correlated with the cheap one.

Main Theorems

Theorem

Horvitz-Thompson Estimator

Statement

For any probability sampling design with inclusion probabilities $\pi_i > 0$ for all $i$ , the Horvitz-Thompson estimator of the population total $T = \sum_{i=1}^{N} y_i$ is:

$\hat{T}_{\text{HT}} = \sum_{i \in S} \frac{y_i}{\pi_i}$

This estimator is unbiased: $\mathbb{E}[\hat{T}_{\text{HT}}] = T$ . Its variance is:

$\text{Var}(\hat{T}_{\text{HT}}) = \sum_{i=1}^{N}\sum_{j=1}^{N} (\pi_{ij} - \pi_i \pi_j)\frac{y_i}{\pi_i}\frac{y_j}{\pi_j}$

where $\pi_{ij} = \Pr(i \in S \text{ and } j \in S)$ is the joint inclusion probability.

Intuition

Each sampled unit $i$ "represents" $1/\pi_i$ population units. If a unit has inclusion probability 0.01, it represents 100 units in the population. The estimator weights each observed value by the inverse of its selection probability, which exactly corrects for the unequal probabilities of selection.

Proof Sketch

Define the indicator $Z_i = \mathbf{1}(i \in S)$ . Then $\hat{T}_{\text{HT}} = \sum_{i=1}^{N} Z_i y_i / \pi_i$ . Taking expectations: $\mathbb{E}[\hat{T}_{\text{HT}}] = \sum_{i=1}^{N} \mathbb{E}[Z_i] y_i / \pi_i = \sum_{i=1}^{N} \pi_i y_i / \pi_i = T$ . The variance follows from $\text{Cov}(Z_i, Z_j) = \pi_{ij} - \pi_i \pi_j$ .

Why It Matters

The Horvitz-Thompson estimator is the workhorse of survey statistics. It is the unique linear unbiased estimator that depends only on inclusion probabilities. It handles any probability sampling design: SRS, stratified, cluster, multi-stage, or anything else with known $\pi_i$ .

Failure Mode

If inclusion probabilities are unknown or zero for some units, the estimator is undefined or biased. In practice, this happens with non-probability samples (convenience samples, web panels, voluntary surveys). If $\pi_i$ is very small for some unit $i$ and $y_i$ is large, the term $y_i / \pi_i$ can dominate the estimate and cause high variance. Trimming or capping weights is common but introduces bias.

report a correction →

Common Confusions

Watch Out

Cluster sampling is not stratified sampling

Stratification divides the population into groups and samples within every group. Cluster sampling divides the population into groups and samples entire groups. Stratification reduces variance (by ensuring representation). Cluster sampling typically increases variance (because units within clusters are similar). They are opposite strategies.

Watch Out

Larger samples are not always better

Doubling the sample size under a bad design can be worse than halving it under a good design. If you cluster-sample with high intraclass correlation, adding more units within the same clusters barely helps. You would be better off sampling more clusters with fewer units per cluster.

Watch Out

Random sampling does not mean haphazard sampling

Probability sampling requires a well-defined sampling frame and a randomization mechanism with known probabilities. Grabbing the first 100 people you encounter is not random sampling. It is a convenience sample with unknown inclusion probabilities, and no design-based inference is valid.

Summary

Every probability sampling design assigns known, positive inclusion probabilities to all units
SRS is the baseline; stratified sampling beats SRS when strata are homogeneous
Cluster sampling saves cost but increases variance, quantified by the design effect
The Horvitz-Thompson estimator is unbiased for any probability design with known $\pi_i$
Multi-stage and multi-phase designs are the practical workhorses for large-scale surveys
Ignoring the sampling design when analyzing survey data gives wrong standard errors

Exercises

ExerciseCore

Problem

A population of $N = 10{,}000$ has variance $S_y^2 = 25$ . You take an SRS of $n = 400$ . Compute the variance of the sample mean, with and without the finite population correction. How much does the FPC matter here?

ExerciseAdvanced

Problem

A population has two strata. Stratum 1 has $N_1 = 8000$ units with $S_1^2 = 10$ . Stratum 2 has $N_2 = 2000$ units with $S_2^2 = 100$ . You have budget for $n = 500$ total samples. Compare proportional allocation ( $n_h \propto N_h$ ) with Neyman allocation ( $n_h \propto N_h S_h$ ). Which gives smaller variance for the stratified mean?

References

Pre-canonical (foundational):

Neyman, "On the Two Different Aspects of the Representative Method" (1934), JRSS 97(4), 558-625. Establishes stratified random sampling and optimum (Neyman) allocation.
Horvitz & Thompson, "A Generalization of Sampling Without Replacement From a Finite Universe" (1952), JASA 47(260), 663-685. Introduces the HT estimator under arbitrary probability designs.
Kish, Survey Sampling (1965). Systematizes design effects, cluster sampling variance, and practical complex-design analysis.

Canonical:

Cochran, Sampling Techniques (1977), Chapters 2-5, 9-11
Sarndal, Swensson, Wretman, Model Assisted Survey Sampling (1992), Chapters 2-4

Current:

Lohr, Sampling: Design and Analysis (2021), Chapters 1-6
Fuller, Sampling Statistics (2009), Chapters 1-3

Next Topics

Sample size determination: how many units to sample under different designs
Design-based vs model-based inference: two philosophies of survey inference
Nonresponse and missing data: what to do when sampled units do not respond

Last reviewed: April 17, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Common Probability Distributionslayer 0A · tier 1
Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Types of Bias in Statisticslayer 1 · tier 1

Derived topics

5

Design-Based vs. Model-Based Inferencelayer 2 · tier 2
Nonresponse and Missing Datalayer 2 · tier 2
Sample Size Determinationlayer 2 · tier 2
GREG Estimatorlayer 3 · tier 2
Official Statistics and National Surveyslayer 3 · tier 3

Graph-backed continuations

Sample Size Determination Design-Based vs. Model-Based Inference Nonresponse and Missing Data GREG Estimator Official Statistics and National Surveys