Statistical Foundations
Design-Based vs. Model-Based Inference
Two philosophies of statistical inference from survey data: design-based inference where randomness comes from the sampling design, and model-based inference where randomness comes from a statistical model, with the model-assisted hybrid approach.
Prerequisites
Why This Matters
Survey statisticians and ML researchers think about inference differently, and this difference leads to real mistakes when the two communities interact. Survey statisticians typically use design-based inference: the randomness comes from the sampling design, not from a model. ML researchers typically use model-based inference: they assume a data-generating process and fit a model.
Neither approach is universally correct. Design-based inference is robust (it requires no model assumptions) but can be inefficient. Model-based inference is efficient but depends on the model being correct. Understanding both is necessary for anyone who works with survey data or builds models using survey-derived datasets.
Mental Model
Imagine a finite population of people. Each person has a fixed, non-random value (their income, their height, their vote). There is no randomness in the population itself.
Design-based view: the only randomness is in which people we select for the sample. The sampling design defines a probability distribution over all possible samples. Inference is about the properties of estimators under this distribution.
Model-based view: the population values are themselves random draws from a probability model . Inference is about the parameters or about predictions from the model.
These are structurally different conceptions of what "random" means in the problem.
Design-Based Inference
Design-Based Inference
In design-based (also called randomization-based) inference:
- The population values are fixed, unknown constants
- The only source of randomness is the sampling mechanism
- Expectations, variances, and probabilities are computed over repeated sampling from the same fixed population
- The target of inference is a finite population quantity: or
No distributional assumptions are needed. The validity of inference depends entirely on the sampling design.
The workhorse estimator is the Horvitz-Thompson estimator:
Its unbiasedness holds for any set of population values , because the expectation is over the sampling design , not over any model for .
Strengths of Design-Based Inference
- Robustness: no model assumptions. If the design is a probability sample, the HT estimator is unbiased regardless of the true relationship between and .
- Transparency: the properties of estimators depend only on the known sampling design.
- Legal standing: official statistics agencies use design-based methods because their validity does not depend on contestable modeling choices.
Weaknesses of Design-Based Inference
- Inefficiency: the HT estimator ignores auxiliary information (covariates) that could improve precision.
- Requires probability sampling: if the sample is not a probability sample, design-based inference is undefined.
- Small domains: design-based estimates for small subpopulations have large variance because they use only the data from that subpopulation.
Model-Based Inference
Model-Based (Superpopulation) Inference
In model-based inference:
- The population values are realizations of random variables from a model (the "superpopulation"):
- Inference is about the model parameters or about predictions for unobserved units
- Expectations and variances are computed under the model, not under the sampling design
- The sampling design is relevant only insofar as it creates selection bias under the model
Under the model, the population mean is itself a random variable. The target might be (expectation under the model ) or itself.
Strengths of Model-Based Inference
- Efficiency: uses auxiliary information (covariates) to improve estimates.
- Small domains: can produce estimates for any subpopulation, even those with zero sample size, by extrapolating from the model.
- Non-probability samples: can (in principle) be used with any sample, as long as the model correctly specifies the relationship between and .
Weaknesses of Model-Based Inference
- Model dependence: if the model is wrong, the estimator can be biased. There is no "free lunch" guarantee like the HT estimator's design-unbiasedness.
- Difficult to verify: model diagnostics help but cannot definitively confirm the model is correct.
- Controversial in official statistics: agencies are reluctant to publish estimates that depend on modeling choices that could be disputed.
Model-Assisted Inference
Model-Assisted Inference
Model-assisted inference uses a model to improve efficiency while retaining the design-based validity guarantee. The idea: use the model to construct a good estimator, but evaluate its properties (bias, variance) under the sampling design, not the model.
The generalized regression (GREG) estimator is the standard model-assisted estimator:
where is a regression coefficient estimated from the sample. This is the HT estimator plus a correction term that exploits the known population totals of auxiliary variables .
The GREG estimator is approximately design-unbiased regardless of whether the regression model is correct. If the model is correct, it is also efficient. If the model is wrong, the correction term is small (because the HT part already does the heavy lifting).
Main Theorems
Design Consistency of the Horvitz-Thompson Estimator
Statement
Let be the HT estimator of the population mean . Under a sequence of finite populations and sampling designs where and regularity conditions on the inclusion probabilities hold:
and
The convergence is in probability and distribution under the sampling design , for any fixed set of population values .
Intuition
As the sample gets large relative to the population, the HT estimator converges to the population mean and is approximately normal. This is a law of large numbers and central limit theorem under the randomization distribution, not under any model. The result holds for any population values because the randomness is in the sampling, not the data.
Proof Sketch
Write where . Under regularity conditions (Hajek conditions on the design), this is a sum of weakly dependent random variables with mean zero. A CLT for finite-population sampling (Hajek, 1960) gives asymptotic normality.
Why It Matters
This theorem is the theoretical backbone of design-based survey inference. It guarantees that HT-based confidence intervals have correct coverage in large samples, with no model assumptions. This is why statistical agencies trust design-based methods: the theory requires only a probability sampling design and regularity conditions.
Failure Mode
The convergence requires the sample size to grow. For small samples (small areas, rare domains), the normal approximation is poor and confidence intervals may have incorrect coverage. The regularity conditions on inclusion probabilities can fail with highly unequal probability designs where some are very small. In such cases, the HT estimator has high variance and the CLT approximation is poor.
The Key Philosophical Difference
In design-based inference, the population quantity is a fixed number. There is nothing random about it. The probability statement "95% confidence interval" means: if we repeated the sampling procedure many times from the same population, 95% of the resulting intervals would contain .
In model-based inference, is a random variable. A "95% credible interval" (Bayesian) or "95% prediction interval" means something different: it reflects uncertainty about the data-generating process, not just the sampling.
These are not just philosophical niceties. They determine what your confidence intervals actually mean and when they have correct coverage.
When to Use Which
| Situation | Recommended Approach |
|---|---|
| Official statistics, legal requirements | Design-based or model-assisted |
| Probability sample, large domains | Design-based (possibly model-assisted) |
| Small areas, rare subpopulations | Model-based (SAE) |
| Non-probability sample (web panel, convenience) | Model-based (with caution) |
| ML model training on survey data | Use survey weights (design-based perspective) |
Common Confusions
Design-based inference does not mean you cannot use models
Model-assisted inference uses regression models to improve efficiency while retaining design-based validity. The distinction is about what justifies the inference (the design vs. the model), not about whether models appear in the estimation procedure.
ML prediction is not the same as survey estimation
In ML, you fit a model to predict from and evaluate on held-out data. The goal is prediction. In survey estimation, the goal is to estimate a population quantity (a mean, a total). Prediction error and estimation error are different things. A good predictive model can still produce biased population estimates if used without survey weights.
Ignoring survey weights in ML does not always help prediction
Some ML practitioners discard survey weights because they think weights only matter for inference, not prediction. This is wrong when the sampling design is informative: when the probability of inclusion depends on . In that case, the unweighted sample overrepresents certain values, and a model trained without weights learns a distorted relationship.
Summary
- Design-based: randomness from sampling, population values are fixed constants
- Model-based: randomness from the data-generating model, population values are random
- Design-based is robust but can be inefficient; model-based is efficient but model-dependent
- Model-assisted = use models for efficiency, evaluate properties under the design
- The GREG estimator is the standard model-assisted estimator
- HT estimator is design-consistent for any population under regularity conditions
- Use survey weights when training ML models on survey data
Exercises
Problem
A population of has values . You take an SRS of . List all possible samples, compute for each, and verify that the average over all possible samples equals .
Problem
A researcher trains a logistic regression on CPS microdata to predict labor force participation. They do not use the survey weights. A survey statistician says the model is "wrong." The ML researcher says prediction accuracy on held-out CPS data is 89%. Who is right, and why might they both have a point?
References
Canonical:
- Sarndal, Swensson, Wretman, Model Assisted Survey Sampling (1992), Chapters 1-2, 6-7
- Cochran, Sampling Techniques (1977), Chapters 1-2
Current:
-
Lumley, Complex Surveys: A Guide to Analysis Using R (2010), Chapters 1-2
-
Buelens et al., "Comparing Inference Methods for Non-Probability Samples" (2018), ISR
-
Casella & Berger, Statistical Inference (2002), Chapters 5-10
-
Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6
Next Topics
- Small area estimation: where model-based methods are essential
- Nonresponse and missing data: a challenge for both paradigms
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.