Applied Statistics
Non-Probability Sampling
Convenience and opt-in samples do not give probability-of-inclusion guarantees. The data-defect identity (Meng 2018) shows why a massive convenience sample can produce a confidently wrong answer. Repair methods: calibration, sampling-score weighting, mass imputation, doubly robust integration with a probability sample, and sensitivity analysis.
Prerequisites
Why This Matters
A probability sample is one where every population unit has a known, positive probability of being selected and the design is recorded in the data. Classical survey inference is a story about probability samples: known inclusion probabilities give unbiased Horvitz-Thompson estimators and design-based variances. The math works because selection is a controlled random experiment.
A non-probability sample is anything else: web-panel respondents, app telemetry, social-media posts, opt-in surveys, convenience samples, scraped records, voluntary-response polls, log data, and most large datasets used in machine learning. Inclusion probabilities are unknown, may be zero for some units, and depend on covariates and outcomes in ways that are not recorded.
This distinction is increasingly the central problem in applied statistics and causal ML. Three things are simultaneously true:
- Probability samples are expensive, slow, and shrinking; response rates on classical household surveys have fallen for two decades.
- Non-probability data is cheap, fast, and growing. Web logs, app telemetry, and voluntary panels are the operational substrate of modern evidence.
- Population inference from non-probability data without correction is more dangerous as grows, not less. This is Meng's big data paradox (formal statement below): the absolute error of a sample mean from a biased non-probability sample scales with for fixed selection-bias correlation, so making the sample larger relative to the population only sharpens the wrong answer.
The methodological response is data integration: combine non-probability data with probability-sample information or known population totals to recover defensible inference. This is the natural home of double/debiased machine learning, calibration estimators (GREG), and sampling-score weighting.
Three Sampling Regimes
| Regime | What is known about selection | Main strength | Main failure mode |
|---|---|---|---|
| Probability sample | Inclusion probability for every unit, design fully recorded | Design-based, distribution-free inference (Horvitz-Thompson) | Cost, declining response, frame coverage gaps |
| Non-probability sample | Records exist; selection mechanism unknown and unrecorded | Cheap, fast, large | Unknown selection bias; population estimands generally not identified without auxiliary information |
| Integrated sample | Non-probability data plus a probability sample or known population totals | Efficiency of large with correctness from auxiliary information | Requires identification assumptions; only as good as the auxiliary data |
Probability sampling is the design-based foundation; see survey sampling methods for the standard designs (SRS, stratified, cluster, multi-stage) and the GREG estimator for the model-assisted calibration form.
The Statistical Problem
Let be a finite population with values . Define the inclusion indicator with when unit is in the observed sample. The sample size is and the sampling fraction is .
The sample mean is
The population mean is . The bias of the sample mean is
This expression is the source of all the trouble. When selection is independent of the outcome (probability sampling with constant inclusion probability, or simple random sampling on a representative frame), the conditional and marginal expectations agree and the bias is zero. When selection depends on , directly or indirectly through covariates correlated with , the bias is non-zero and does not vanish as .
This is structurally different from sampling-error variance, which shrinks like . Selection bias is a bias, not a variance, and collecting more biased data does not fix it.
The Data-Defect Identity (Meng 2018)
The clean identity that makes the big-data paradox sharp.
Data-Defect Identity (Meng 2018)
Statement
For a finite population of size with values , observed sample size , and inclusion indicator ,
where is the sampling fraction, is the population standard deviation of , and
is the data-defect correlation between the inclusion indicator and the outcome.
Intuition
The estimation error decomposes into three factors: how biased the selection is (, the correlation between "are you in the sample" and "what is your outcome"), how variable the outcome is in the population (), and a sampling-fraction multiplier .
The sampling-fraction multiplier is the surprising part. For fixed and small , error scales like . A massive convenience sample (large but with still small because is huge) produces a more confidently wrong answer than a smaller biased sample at the same . The variance of around goes to zero with , sharpening point estimation around the biased target , not around .
Proof Sketch
The covariance form is mechanical: Solving for the difference: Full derivation in Meng (2018), Annals of Applied Statistics.
Why It Matters
The identity quantifies what "selection bias" costs in a single, operational number. For the 2016 US presidential election, Meng's analysis showed that voter-intention polls had effective sample sizes that were negative relative to the bias they incurred: doubling the sample without fixing the selection mechanism actively hurt accuracy. The same identity is now standard in COVID-19 prevalence-survey audits and in modern web-panel reweighting evaluation.
The qualitative payoff is: estimating a population mean from non-probability data requires either a small bias correlation (structural argument that selection is roughly independent of ) or an explicit correction (calibration, weighting, imputation, or DR). "Large " is not on its own a defense, and is in fact a hazard.
Failure Mode
The identity is for the simple sample mean of a single . For regression coefficients, ratio estimators, ML predictions, or causal estimands, analogous decompositions exist but the data-defect "correlation" is replaced by a richer object (a function of the score, the propensity, and the outcome model). The intuition (selection bias persists and can sharpen as the sample grows) carries over uniformly. The exact constant does not.
Repair Methods
Five standard methods, ordered roughly by how much auxiliary information they require. None is universally best; each buys correctness at the price of a different identification assumption.
| Method | What you need | What you assume | Output |
|---|---|---|---|
| Calibration / raking / GREG | Known population totals of auxiliary variables (counts, marginals) | Outcome model linear in calibration auxiliaries, or close to it | Adjusted weights so weighted sample matches population on |
| Sampling-score weighting | Probability reference sample with the same | A model for is correctly specified, with positivity | Inverse-probability weighted estimator |
| Mass imputation | Probability reference sample with observed there or population frame with | An outcome model is correctly specified | Predict for every population unit and average |
| Doubly robust | Both of the above ingredients | At least one of the two models (sampling-score or outcome) is correctly specified | Consistent estimator under either; product-rate bias when both are misspecified but converge |
| Sensitivity analysis | Anything above, plus a parametric bound on residual selection on unmeasured variables | A worst-case bound on the unmeasured-confounding parameter | Coverage-degrades-gracefully interval |
Calibration / GREG
Calibration adjusts the sample weights to match known population totals on auxiliary variables. Solve:
where is a starting weight (e.g., from the design or just for non-probability), is the known population total of , and is a distance function (chi-squared, raking, logit). The GREG estimator is the special case where is the chi-squared distance and the auxiliary model is linear.
When predicts both and , calibration removes the part of the selection bias that operates through . It cannot remove selection that operates through unmeasured variables. That residual is the substance of sensitivity analysis below.
Sampling-score weighting
Treat the non-probability sample like a missing-data problem: every unit has a latent inclusion probability , fit this from a probability reference sample with the same , and weight the non-probability units by . The estimator is
This requires positivity: for all in the target population. Web-panel and opt-in samples often have on sub-populations (people who do not use the panel platform at all), which voids the construction for those sub-populations no matter how clever the weighting. Wiśniowski et al. (2020, J. Official Statistics) is the standard treatment of this issue.
Mass imputation
If a probability reference sample carries , fit on it. Then for every population unit (or every unit in the non-probability sample), predict and average:
This is unbiased when the outcome model is correctly specified; it fails when is misspecified, in roughly the way ordinary regression-based extrapolation fails.
Doubly robust
Combine sampling-score weighting and mass imputation. The standard form, adapting AIPW from causal inference:
This is consistent when either is consistent for or is consistent for , under positivity and standard regularity. The bias is the product of the two estimation errors, so doubly robust estimation pairs naturally with double/debiased machine learning: the orthogonal-score formulation gives -rate inference under the same product-rate condition with cross-fitted nuisance estimators. Chen, Li, Wu (2020) and the JRSSA / J. Official Statistics literature develop this in the survey-sampling setting.
Sensitivity analysis
Even after the four corrections above, residual selection on unmeasured variables may remain. A sensitivity analysis parameterizes this residual (e.g., Rosenbaum's for hidden confounding, or a bound on the omitted-variable correlation ) and reports estimates that degrade gracefully as the parameter grows. This gives an honest answer to "what if our adjustment leaves an unmeasured selection mechanism with strength behind?", without claiming a guarantee.
Connection to Convex Tinkering
Cheap, biased data is high-leverage for exploration and dangerous as final population evidence. This is the survey-statistics counterpart to convex tinkering: convenience data is a useful cheap-option generator under bounded downside, not a final estimator under unbounded downside.
The convex move:
- Use non-probability data to screen hypotheses and surface candidate patterns at low cost.
- Reserve probability sampling, randomized experiments, or calibration against trusted population totals for the candidates that might actually drive a decision.
- Treat the transition from a screening insight to a committed claim as the moment downside becomes unbounded. Calibration, weighting, or DR adjustment is what re-bounds it.
The failure mode is using non-probability data for both steps, treating exploratory pattern-mining as evidence by virtue of alone. Meng's data-defect identity is the warning label.
Common Confusions
Bigger n fixes selection bias
The data-defect identity makes this exactly wrong. For fixed , the absolute estimation error scales as . Only large sampling fractions close to 1, not large relative to the bias correlation, control the error. A non-probability sample of from a population of has , no better than a smaller sample at the same . Confidence intervals built from -style asymptotics will shrink, but they shrink around the biased target, not the population parameter.
Probability sample is automatically representative
A probability sample with declining response rate becomes a hybrid: the design probabilities are still known, but actual response is conditioned on covariates and outcomes. The effective analysis is non-probability with a known design weight; the data-defect identity applies to the response process. This is why even classical agencies that retain probability designs increasingly publish calibration-adjusted weights.
Calibration removes all selection bias
Calibration removes selection that operates through observed auxiliary variables . It does not touch selection that operates through unmeasured variables, residual variation in at fixed , or non-linearities the calibration model misses. The -conditional residual is what sensitivity analysis bounds.
Doubly robust means twice as accurate
"Doubly robust" means consistent under either of two models, not "more accurate than singly robust." If both models are misspecified, DR can be worse than a well-specified single model. The product-bias shape is the right intuition: bias is small when at least one model is near the truth, and the rate of convergence to the truth is the product of the two estimation rates (slow slow can still be fast enough; see DML's product-rate condition).
Anonymous, opt-in, or scraped data is exempt from this
The data-defect identity does not care how the data was obtained. It applies to any sample with a non-trivial inclusion mechanism. Web scrapes, app telemetry, leaked records, opt-in panels, and clinical convenience cohorts all sit in the same regime. The only difference is how much auxiliary information is available to attempt a correction; none of them get the design-based guarantees of probability sampling.
Exercises
Problem
A population of has standard deviation . A non-probability sample of (sampling fraction ) has data-defect correlation . What is the absolute error of the sample mean as an estimator of the population mean?
Problem
For the same population and bias correlation as above, how large would the sampling fraction need to be for the bias to drop to standard deviations of ?
Problem
A web-panel sample is calibrated against the marginal distribution of age, gender, and education. After calibration, the residual data-defect correlation is bounded above by 0.01 under a sensitivity analysis. With , , give the worst-case absolute estimation error and explain how it differs from the unconditional version.
References
Canonical:
- Meng, X.-L. "Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election." Annals of Applied Statistics 12(2) (2018): 685-726. The data-defect identity and the "big data paradox" framing.
- Cochran, W. G. Sampling Techniques (3rd ed., Wiley 1977). The classical text for design-based survey inference.
- Lohr, S. L. Sampling: Design and Analysis (3rd ed., CRC 2021). The standard modern textbook covering both probability and non-probability designs.
- Lumley, T. Complex Surveys: A Guide to Analysis Using R (Wiley 2010). Practical implementation, including calibration and weighting.
Integration of probability and non-probability samples:
- Wiśniowski, A., Sakshaug, J. W., Perez Ruiz, D. A., Blom, A. G. "Integrating probability and nonprobability samples for survey inference." Journal of Survey Statistics and Methodology 8(1) (2020): 120-147. Standard reference for the limits of inclusion-probability estimation when probabilities may be zero.
- Chen, Y., Li, P., Wu, C. "Doubly robust inference with non-probability survey samples." Journal of the American Statistical Association 115(532) (2020): 2011-2021. The doubly robust construction.
- Elliott, M. R., Valliant, R. "Inference for nonprobability samples." Statistical Science 32(2) (2017): 249-264. Survey-style review.
- Kim, J. K., Park, S., Chen, Y., Wu, C. "Combining non-probability and probability survey samples through mass imputation." Journal of the Royal Statistical Society Series A 184(3) (2021): 941-963. Mass imputation framework.
Calibration and design weighting:
- Deville, J.-C., Särndal, C.-E. "Calibration estimators in survey sampling." Journal of the American Statistical Association 87(418) (1992): 376-382. The original GREG / calibration paper.
- Särndal, C.-E., Lundström, S. Estimation in Surveys with Nonresponse (Wiley 2005). Calibration under nonresponse, a partially-non-probability regime.
Sensitivity analysis:
- Rosenbaum, P. R. Observational Studies (2nd ed., Springer 2002). The sensitivity model.
- Cinelli, C., Hazlett, C. "Making sense of sensitivity: Extending omitted variable bias." Journal of the Royal Statistical Society Series B 82(1) (2020): 39-67. Modern extensions to non-experimental data.
Educational and policy resources:
- AAPOR Task Force on Non-Probability Sampling. "Report of the AAPOR Task Force on Non-Probability Sampling." Journal of Survey Statistics and Methodology 1(2) (2013): 90-143. Practitioner-oriented guidance.
- OECD, "Quality Framework for OECD Statistics" (2011). Quality criteria that apply across data sources, including non-probability inputs.
Cross-Network Links
This page sits at the intersection of survey methodology, causal inference, and modern ML evaluation. Natural neighbours:
- Survey sampling methods is the parent page for the probability-design baseline.
- GREG estimator is the technical companion for calibration in the model-assisted form.
- Double/debiased machine learning shares the orthogonal-score machinery and the product-rate condition used by doubly robust non-probability estimators.
- Weighted conformal prediction generalizes the same idea (reweight by likelihood ratio under covariate shift) for prediction-set construction rather than mean estimation.
- Total variation distance bounds worst-case shift in expectation under bounded-loss families; one way to read the data-defect identity is as a TV-flavoured statement with the sampling fraction explicit.
- Convex tinkering is the methodological framing for using non-probability data as cheap exploration without treating it as final population evidence.
- Hypothesis testing for ML inherits the same selection issues when the test set is itself a convenience sample.
Next Topics
- GREG estimator for the calibration technical details.
- Double/debiased machine learning for the doubly robust orthogonal-score construction.
- Convex tinkering for the methodological framing of cheap exploration vs committed inference.
Last reviewed: April 28, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
- Central Limit Theoremlayer 0B · tier 1
- Law of Large Numberslayer 0B · tier 1
- Double/Debiased Machine Learninglayer 3 · tier 1
Derived topics
0No published topic currently declares this as a prerequisite.