Applied Statistics

Non-Probability Sampling

Convenience and opt-in samples do not give probability-of-inclusion guarantees. The data-defect identity (Meng 2018) shows why a massive convenience sample can produce a confidently wrong answer. Repair methods: calibration, sampling-score weighting, mass imputation, doubly robust integration with a probability sample, and sensitivity analysis.

AdvancedTier 1CurrentCore spine~50 min

Prerequisites

Expectation Variance Covariance Moments Law of Large Numbers Central Limit Theorem Double Debiased Machine Learning

Prereq Map

Why This Matters

A probability sample is one where every population unit has a known, positive probability of being selected and the design is recorded in the data. Classical survey inference is a story about probability samples: known inclusion probabilities give unbiased Horvitz-Thompson estimators and design-based variances. The math works because selection is a controlled random experiment.

A non-probability sample is anything else: web-panel respondents, app telemetry, social-media posts, opt-in surveys, convenience samples, scraped records, voluntary-response polls, log data, and most large datasets used in machine learning. Inclusion probabilities are unknown, may be zero for some units, and depend on covariates and outcomes in ways that are not recorded.

This distinction is increasingly the central problem in applied statistics and causal ML. Three things are simultaneously true:

Probability samples are expensive, slow, and shrinking; response rates on classical household surveys have fallen for two decades.
Non-probability data is cheap, fast, and growing. Web logs, app telemetry, and voluntary panels are the operational substrate of modern evidence.
Population inference from non-probability data without correction is more dangerous as $N$ grows, not less. This is Meng's big data paradox (formal statement below): the absolute error of a sample mean from a biased non-probability sample scales with $\sqrt{N/n}$ for fixed selection-bias correlation, so making the sample larger relative to the population only sharpens the wrong answer.

The methodological response is data integration: combine non-probability data with probability-sample information or known population totals to recover defensible inference. This is the natural home of double/debiased machine learning, calibration estimators (GREG), and sampling-score weighting.

Three Sampling Regimes

Regime	What is known about selection	Main strength	Main failure mode
Probability sample	Inclusion probability $\pi_i$ for every unit, design fully recorded	Design-based, distribution-free inference (Horvitz-Thompson)	Cost, declining response, frame coverage gaps
Non-probability sample	Records exist; selection mechanism unknown and unrecorded	Cheap, fast, large $n$	Unknown selection bias; population estimands generally not identified without auxiliary information
Integrated sample	Non-probability data plus a probability sample or known population totals	Efficiency of large $n$ with correctness from auxiliary information	Requires identification assumptions; only as good as the auxiliary data

Probability sampling is the design-based foundation; see survey sampling methods for the standard designs (SRS, stratified, cluster, multi-stage) and the GREG estimator for the model-assisted calibration form.

The Statistical Problem

Let $\mathcal{P} = \{1, \ldots, N\}$ be a finite population with values $\{y_1, \ldots, y_N\}$ . Define the inclusion indicator $R_i \in \{0, 1\}$ with $R_i = 1$ when unit $i$ is in the observed sample. The sample size is $n = \sum_i R_i$ and the sampling fraction is $f = n/N$ .

The sample mean is

$\bar{Y}_n = \frac{1}{n} \sum_{i : R_i = 1} y_i = \frac{\sum_i R_i y_i}{\sum_i R_i}.$

The population mean is $\bar{Y}_N = \frac{1}{N} \sum_i y_i$ . The bias of the sample mean is

$\bar{Y}_n - \bar{Y}_N = \mathbb{E}[Y \mid R = 1] - \mathbb{E}[Y].$

This expression is the source of all the trouble. When selection $R$ is independent of the outcome $Y$ (probability sampling with constant inclusion probability, or simple random sampling on a representative frame), the conditional and marginal expectations agree and the bias is zero. When selection depends on $Y$ , directly or indirectly through covariates correlated with $Y$ , the bias is non-zero and does not vanish as $n \to \infty$ .

This is structurally different from sampling-error variance, which shrinks like $1/n$ . Selection bias is a bias, not a variance, and collecting more biased data does not fix it.

The Data-Defect Identity (Meng 2018)

The clean identity that makes the big-data paradox sharp.

Theorem

Data-Defect Identity (Meng 2018)

Statement

For a finite population of size $N$ with values $\{y_i\}$ , observed sample size $n$ , and inclusion indicator $R_i$ ,

$\bar{Y}_n - \bar{Y}_N \;=\; \rho_{R,Y} \cdot \sigma_Y \cdot \sqrt{\frac{1 - f}{f}},$

where $f = n/N$ is the sampling fraction, $\sigma_Y$ is the population standard deviation of $Y$ , and

$\rho_{R,Y} = \mathrm{Corr}(R, Y) = \frac{\mathrm{Cov}(R, Y)}{\sqrt{\mathrm{Var}(R)\, \mathrm{Var}(Y)}}$

is the data-defect correlation between the inclusion indicator and the outcome.

Intuition

The estimation error decomposes into three factors: how biased the selection is ( $\rho_{R,Y}$ , the correlation between "are you in the sample" and "what is your outcome"), how variable the outcome is in the population ( $\sigma_Y$ ), and a sampling-fraction multiplier $\sqrt{(1-f)/f}$ .

The sampling-fraction multiplier is the surprising part. For fixed $\rho_{R,Y}$ and small $f$ , error scales like $\sqrt{1/f} = \sqrt{N/n}$ . A massive convenience sample (large $n$ but with $f$ still small because $N$ is huge) produces a more confidently wrong answer than a smaller biased sample at the same $\rho_{R,Y}$ . The variance of $\bar{Y}_n$ around $\bar{Y}_N$ goes to zero with $n$ , sharpening point estimation around the biased target $\mathbb{E}[Y \mid R = 1]$ , not around $\bar{Y}_N$ .

Proof Sketch

The covariance form is mechanical: $\mathrm{Cov}(R, Y) = \mathbb{E}[RY] - \mathbb{E}[R]\mathbb{E}[Y] = f \cdot \bar{Y}_n - f \cdot \bar{Y}_N = f(\bar{Y}_n - \bar{Y}_N).$ Solving for the difference: $\bar{Y}_n - \bar{Y}_N = \frac{\mathrm{Cov}(R, Y)}{f} = \frac{\rho_{R,Y} \sqrt{\mathrm{Var}(R) \mathrm{Var}(Y)}}{f} = \frac{\rho_{R,Y} \sqrt{f(1-f)} \cdot \sigma_Y}{f} = \rho_{R,Y} \sigma_Y \sqrt{\frac{1-f}{f}}.$ Full derivation in Meng (2018), Annals of Applied Statistics.

Why It Matters

The identity quantifies what "selection bias" costs in a single, operational number. For the 2016 US presidential election, Meng's analysis showed that voter-intention polls had effective sample sizes that were negative relative to the bias they incurred: doubling the sample without fixing the selection mechanism actively hurt accuracy. The same identity is now standard in COVID-19 prevalence-survey audits and in modern web-panel reweighting evaluation.

The qualitative payoff is: estimating a population mean from non-probability data requires either a small bias correlation (structural argument that selection is roughly independent of $Y$ ) or an explicit correction (calibration, weighting, imputation, or DR). "Large $n$ " is not on its own a defense, and is in fact a hazard.

Failure Mode

The identity is for the simple sample mean of a single $Y$ . For regression coefficients, ratio estimators, ML predictions, or causal estimands, analogous decompositions exist but the data-defect "correlation" is replaced by a richer object (a function of the score, the propensity, and the outcome model). The intuition (selection bias persists and can sharpen as the sample grows) carries over uniformly. The exact constant does not.

report a correction →

Repair Methods

Five standard methods, ordered roughly by how much auxiliary information they require. None is universally best; each buys correctness at the price of a different identification assumption.

Method	What you need	What you assume	Output
Calibration / raking / GREG	Known population totals of auxiliary variables $X$ (counts, marginals)	Outcome model linear in calibration auxiliaries, or close to it	Adjusted weights so weighted sample matches population on $X$
Sampling-score weighting	Probability reference sample with the same $X$	A model for $\pi(x) = P(R=1 \mid X=x)$ is correctly specified, with positivity	Inverse-probability weighted estimator
Mass imputation	Probability reference sample with $Y$ observed there or population frame with $X$	An outcome model $m(x) = \mathbb{E}[Y \mid X]$ is correctly specified	Predict $\hat{Y}_i$ for every population unit and average
Doubly robust	Both of the above ingredients	At least one of the two models (sampling-score or outcome) is correctly specified	Consistent estimator under either; product-rate bias when both are misspecified but converge
Sensitivity analysis	Anything above, plus a parametric bound on residual selection on unmeasured variables	A worst-case bound on the unmeasured-confounding parameter	Coverage-degrades-gracefully interval

Calibration / GREG

Calibration adjusts the sample weights $\{w_i\}$ to match known population totals on auxiliary variables. Solve:

$\min_w \sum_{i \in s} G(w_i, d_i) \quad \text{subject to} \quad \sum_{i \in s} w_i x_i = T_x,$

where $d_i$ is a starting weight (e.g., $1/\pi_i$ from the design or just $N/n$ for non-probability), $T_x$ is the known population total of $x$ , and $G$ is a distance function (chi-squared, raking, logit). The GREG estimator is the special case where $G$ is the chi-squared distance and the auxiliary model is linear.

When $X$ predicts both $Y$ and $R$ , calibration removes the part of the selection bias that operates through $X$ . It cannot remove selection that operates through unmeasured variables. That residual is the substance of sensitivity analysis below.

Sampling-score weighting

Treat the non-probability sample like a missing-data problem: every unit has a latent inclusion probability $\pi(x) = P(R=1 \mid X=x)$ , fit this from a probability reference sample with the same $X$ , and weight the non-probability units by $1/\hat{\pi}(x)$ . The estimator is

$\hat{\bar{Y}}_{\text{IPW}} = \frac{\sum_{i : R_i = 1} y_i / \hat{\pi}(x_i)}{\sum_{i : R_i = 1} 1 / \hat{\pi}(x_i)}.$

This requires positivity: $\pi(x) > 0$ for all $x$ in the target population. Web-panel and opt-in samples often have $\pi(x) = 0$ on sub-populations (people who do not use the panel platform at all), which voids the construction for those sub-populations no matter how clever the weighting. Wiśniowski et al. (2020, J. Official Statistics) is the standard treatment of this issue.

Mass imputation

If a probability reference sample carries $Y$ , fit $\hat{m}(x) = \hat{\mathbb{E}}[Y \mid X]$ on it. Then for every population unit (or every unit in the non-probability sample), predict $\hat{Y}_i = \hat{m}(x_i)$ and average:

$\hat{\bar{Y}}_{\text{MI}} = \frac{1}{N} \sum_{i \in \mathcal{P}} \hat{m}(x_i).$

This is unbiased when the outcome model is correctly specified; it fails when $m$ is misspecified, in roughly the way ordinary regression-based extrapolation fails.

Doubly robust

Combine sampling-score weighting and mass imputation. The standard form, adapting AIPW from causal inference:

$\hat{\bar{Y}}_{\text{DR}} = \frac{1}{N} \sum_{i \in \mathcal{P}} \hat{m}(x_i) + \frac{1}{n} \sum_{i : R_i = 1} \frac{y_i - \hat{m}(x_i)}{\hat{\pi}(x_i)}.$

This is consistent when either $\hat{\pi}$ is consistent for $\pi$ or $\hat{m}$ is consistent for $m$ , under positivity and standard regularity. The bias is the product of the two estimation errors, so doubly robust estimation pairs naturally with double/debiased machine learning: the orthogonal-score formulation gives $\sqrt{n}$ -rate inference under the same product-rate condition with cross-fitted nuisance estimators. Chen, Li, Wu (2020) and the JRSSA / J. Official Statistics literature develop this in the survey-sampling setting.

Sensitivity analysis

Even after the four corrections above, residual selection on unmeasured variables may remain. A sensitivity analysis parameterizes this residual (e.g., Rosenbaum's $\Gamma$ for hidden confounding, or a bound on the omitted-variable correlation $\rho_{R,Y \mid X}$ ) and reports estimates that degrade gracefully as the parameter grows. This gives an honest answer to "what if our adjustment leaves an unmeasured selection mechanism with strength $\Gamma$ behind?", without claiming a guarantee.

Connection to Convex Tinkering

Cheap, biased data is high-leverage for exploration and dangerous as final population evidence. This is the survey-statistics counterpart to convex tinkering: convenience data is a useful cheap-option generator under bounded downside, not a final estimator under unbounded downside.

The convex move:

Use non-probability data to screen hypotheses and surface candidate patterns at low cost.
Reserve probability sampling, randomized experiments, or calibration against trusted population totals for the candidates that might actually drive a decision.
Treat the transition from a screening insight to a committed claim as the moment downside becomes unbounded. Calibration, weighting, or DR adjustment is what re-bounds it.

The failure mode is using non-probability data for both steps, treating exploratory pattern-mining as evidence by virtue of $n$ alone. Meng's data-defect identity is the warning label.

Common Confusions

Watch Out

Bigger n fixes selection bias

The data-defect identity makes this exactly wrong. For fixed $\rho_{R,Y}$ , the absolute estimation error scales as $\sqrt{(1-f)/f}$ . Only large sampling fractions close to 1, not large $n$ relative to the bias correlation, control the error. A non-probability sample of $n = 10^7$ from a population of $N = 10^9$ has $f = 0.01$ , no better than a smaller sample at the same $\rho$ . Confidence intervals built from $\sqrt{n}$ -style asymptotics will shrink, but they shrink around the biased target, not the population parameter.

Watch Out

Probability sample is automatically representative

A probability sample with declining response rate becomes a hybrid: the design probabilities $\pi_i$ are still known, but actual response $\delta_i$ is conditioned on covariates and outcomes. The effective analysis is non-probability with a known design weight; the data-defect identity applies to the response process. This is why even classical agencies that retain probability designs increasingly publish calibration-adjusted weights.

Watch Out

Calibration removes all selection bias

Calibration removes selection that operates through observed auxiliary variables $X$ . It does not touch selection that operates through unmeasured variables, residual variation in $Y$ at fixed $X$ , or non-linearities the calibration model misses. The $X$ -conditional residual $\rho_{R,Y \mid X}$ is what sensitivity analysis bounds.

Watch Out

Doubly robust means twice as accurate

"Doubly robust" means consistent under either of two models, not "more accurate than singly robust." If both models are misspecified, DR can be worse than a well-specified single model. The product-bias shape is the right intuition: bias is small when at least one model is near the truth, and the rate of convergence to the truth is the product of the two estimation rates (slow $\times$ slow can still be fast enough; see DML's $n^{-1/4}$ product-rate condition).

Watch Out

Anonymous, opt-in, or scraped data is exempt from this

The data-defect identity does not care how the data was obtained. It applies to any sample with a non-trivial inclusion mechanism. Web scrapes, app telemetry, leaked records, opt-in panels, and clinical convenience cohorts all sit in the same regime. The only difference is how much auxiliary information is available to attempt a correction; none of them get the design-based guarantees of probability sampling.

Exercises

ExerciseCore

Problem

A population of $N = 10^9$ has standard deviation $\sigma_Y = 1$ . A non-probability sample of $n = 10^6$ (sampling fraction $f = 10^{-3}$ ) has data-defect correlation $\rho_{R,Y} = 0.05$ . What is the absolute error of the sample mean as an estimator of the population mean?

ExerciseCore

Problem

For the same population and bias correlation as above, how large would the sampling fraction $f$ need to be for the bias to drop to $0.05$ standard deviations of $Y$ ?

ExerciseAdvanced

Problem

A web-panel sample is calibrated against the marginal distribution of age, gender, and education. After calibration, the residual data-defect correlation $\rho_{R,Y \mid X}$ is bounded above by 0.01 under a sensitivity analysis. With $\sigma_Y = 1$ , $f = 10^{-4}$ , give the worst-case absolute estimation error and explain how it differs from the unconditional version.

References

Canonical:

Meng, X.-L. "Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election." Annals of Applied Statistics 12(2) (2018): 685-726. The data-defect identity and the "big data paradox" framing.
Cochran, W. G. Sampling Techniques (3rd ed., Wiley 1977). The classical text for design-based survey inference.
Lohr, S. L. Sampling: Design and Analysis (3rd ed., CRC 2021). The standard modern textbook covering both probability and non-probability designs.
Lumley, T. Complex Surveys: A Guide to Analysis Using R (Wiley 2010). Practical implementation, including calibration and weighting.

Integration of probability and non-probability samples:

Wiśniowski, A., Sakshaug, J. W., Perez Ruiz, D. A., Blom, A. G. "Integrating probability and nonprobability samples for survey inference." Journal of Survey Statistics and Methodology 8(1) (2020): 120-147. Standard reference for the limits of inclusion-probability estimation when probabilities may be zero.
Chen, Y., Li, P., Wu, C. "Doubly robust inference with non-probability survey samples." Journal of the American Statistical Association 115(532) (2020): 2011-2021. The doubly robust construction.
Elliott, M. R., Valliant, R. "Inference for nonprobability samples." Statistical Science 32(2) (2017): 249-264. Survey-style review.
Kim, J. K., Park, S., Chen, Y., Wu, C. "Combining non-probability and probability survey samples through mass imputation." Journal of the Royal Statistical Society Series A 184(3) (2021): 941-963. Mass imputation framework.

Calibration and design weighting:

Deville, J.-C., Särndal, C.-E. "Calibration estimators in survey sampling." Journal of the American Statistical Association 87(418) (1992): 376-382. The original GREG / calibration paper.
Särndal, C.-E., Lundström, S. Estimation in Surveys with Nonresponse (Wiley 2005). Calibration under nonresponse, a partially-non-probability regime.

Sensitivity analysis:

Rosenbaum, P. R. Observational Studies (2nd ed., Springer 2002). The $\Gamma$ sensitivity model.
Cinelli, C., Hazlett, C. "Making sense of sensitivity: Extending omitted variable bias." Journal of the Royal Statistical Society Series B 82(1) (2020): 39-67. Modern extensions to non-experimental data.

Educational and policy resources:

AAPOR Task Force on Non-Probability Sampling. "Report of the AAPOR Task Force on Non-Probability Sampling." Journal of Survey Statistics and Methodology 1(2) (2013): 90-143. Practitioner-oriented guidance.
OECD, "Quality Framework for OECD Statistics" (2011). Quality criteria that apply across data sources, including non-probability inputs.

Cross-Network Links

This page sits at the intersection of survey methodology, causal inference, and modern ML evaluation. Natural neighbours:

Survey sampling methods is the parent page for the probability-design baseline.
GREG estimator is the technical companion for calibration in the model-assisted form.
Double/debiased machine learning shares the orthogonal-score machinery and the product-rate condition used by doubly robust non-probability estimators.
Weighted conformal prediction generalizes the same idea (reweight by likelihood ratio under covariate shift) for prediction-set construction rather than mean estimation.
Total variation distance bounds worst-case shift in expectation under bounded-loss families; one way to read the data-defect identity is as a TV-flavoured statement with the sampling fraction explicit.
Convex tinkering is the methodological framing for using non-probability data as cheap exploration without treating it as final population evidence.
Hypothesis testing for ML inherits the same selection issues when the test set is itself a convenience sample.

Next Topics

GREG estimator for the calibration technical details.
Double/debiased machine learning for the doubly robust orthogonal-score construction.
Convex tinkering for the methodological framing of cheap exploration vs committed inference.

Last reviewed: April 28, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Central Limit Theoremlayer 0B · tier 1
Law of Large Numberslayer 0B · tier 1
Double/Debiased Machine Learninglayer 3 · tier 1

Derived topics

No published topic currently declares this as a prerequisite.