Skip to main content

Applied Statistics

Non-Probability Sampling

Convenience and opt-in samples do not give probability-of-inclusion guarantees. The data-defect identity (Meng 2018) shows why a massive convenience sample can produce a confidently wrong answer. Repair methods: calibration, sampling-score weighting, mass imputation, doubly robust integration with a probability sample, and sensitivity analysis.

AdvancedTier 1CurrentCore spine~50 min

Why This Matters

A probability sample is one where every population unit has a known, positive probability of being selected and the design is recorded in the data. Classical survey inference is a story about probability samples: known inclusion probabilities give unbiased Horvitz-Thompson estimators and design-based variances. The math works because selection is a controlled random experiment.

A non-probability sample is anything else: web-panel respondents, app telemetry, social-media posts, opt-in surveys, convenience samples, scraped records, voluntary-response polls, log data, and most large datasets used in machine learning. Inclusion probabilities are unknown, may be zero for some units, and depend on covariates and outcomes in ways that are not recorded.

This distinction is increasingly the central problem in applied statistics and causal ML. Three things are simultaneously true:

  • Probability samples are expensive, slow, and shrinking; response rates on classical household surveys have fallen for two decades.
  • Non-probability data is cheap, fast, and growing. Web logs, app telemetry, and voluntary panels are the operational substrate of modern evidence.
  • Population inference from non-probability data without correction is more dangerous as NN grows, not less. This is Meng's big data paradox (formal statement below): the absolute error of a sample mean from a biased non-probability sample scales with N/n\sqrt{N/n} for fixed selection-bias correlation, so making the sample larger relative to the population only sharpens the wrong answer.

The methodological response is data integration: combine non-probability data with probability-sample information or known population totals to recover defensible inference. This is the natural home of double/debiased machine learning, calibration estimators (GREG), and sampling-score weighting.

Three Sampling Regimes

RegimeWhat is known about selectionMain strengthMain failure mode
Probability sampleInclusion probability πi\pi_i for every unit, design fully recordedDesign-based, distribution-free inference (Horvitz-Thompson)Cost, declining response, frame coverage gaps
Non-probability sampleRecords exist; selection mechanism unknown and unrecordedCheap, fast, large nnUnknown selection bias; population estimands generally not identified without auxiliary information
Integrated sampleNon-probability data plus a probability sample or known population totalsEfficiency of large nn with correctness from auxiliary informationRequires identification assumptions; only as good as the auxiliary data

Probability sampling is the design-based foundation; see survey sampling methods for the standard designs (SRS, stratified, cluster, multi-stage) and the GREG estimator for the model-assisted calibration form.

The Statistical Problem

Let P={1,,N}\mathcal{P} = \{1, \ldots, N\} be a finite population with values {y1,,yN}\{y_1, \ldots, y_N\}. Define the inclusion indicator Ri{0,1}R_i \in \{0, 1\} with Ri=1R_i = 1 when unit ii is in the observed sample. The sample size is n=iRin = \sum_i R_i and the sampling fraction is f=n/Nf = n/N.

The sample mean is

Yˉn=1ni:Ri=1yi=iRiyiiRi.\bar{Y}_n = \frac{1}{n} \sum_{i : R_i = 1} y_i = \frac{\sum_i R_i y_i}{\sum_i R_i}.

The population mean is YˉN=1Niyi\bar{Y}_N = \frac{1}{N} \sum_i y_i. The bias of the sample mean is

YˉnYˉN=E[YR=1]E[Y].\bar{Y}_n - \bar{Y}_N = \mathbb{E}[Y \mid R = 1] - \mathbb{E}[Y].

This expression is the source of all the trouble. When selection RR is independent of the outcome YY (probability sampling with constant inclusion probability, or simple random sampling on a representative frame), the conditional and marginal expectations agree and the bias is zero. When selection depends on YY, directly or indirectly through covariates correlated with YY, the bias is non-zero and does not vanish as nn \to \infty.

This is structurally different from sampling-error variance, which shrinks like 1/n1/n. Selection bias is a bias, not a variance, and collecting more biased data does not fix it.

The Data-Defect Identity (Meng 2018)

The clean identity that makes the big-data paradox sharp.

Theorem

Data-Defect Identity (Meng 2018)

Statement

For a finite population of size NN with values {yi}\{y_i\}, observed sample size nn, and inclusion indicator RiR_i,

YˉnYˉN  =  ρR,YσY1ff,\bar{Y}_n - \bar{Y}_N \;=\; \rho_{R,Y} \cdot \sigma_Y \cdot \sqrt{\frac{1 - f}{f}},

where f=n/Nf = n/N is the sampling fraction, σY\sigma_Y is the population standard deviation of YY, and

ρR,Y=Corr(R,Y)=Cov(R,Y)Var(R)Var(Y)\rho_{R,Y} = \mathrm{Corr}(R, Y) = \frac{\mathrm{Cov}(R, Y)}{\sqrt{\mathrm{Var}(R)\, \mathrm{Var}(Y)}}

is the data-defect correlation between the inclusion indicator and the outcome.

Intuition

The estimation error decomposes into three factors: how biased the selection is (ρR,Y\rho_{R,Y}, the correlation between "are you in the sample" and "what is your outcome"), how variable the outcome is in the population (σY\sigma_Y), and a sampling-fraction multiplier (1f)/f\sqrt{(1-f)/f}.

The sampling-fraction multiplier is the surprising part. For fixed ρR,Y\rho_{R,Y} and small ff, error scales like 1/f=N/n\sqrt{1/f} = \sqrt{N/n}. A massive convenience sample (large nn but with ff still small because NN is huge) produces a more confidently wrong answer than a smaller biased sample at the same ρR,Y\rho_{R,Y}. The variance of Yˉn\bar{Y}_n around YˉN\bar{Y}_N goes to zero with nn, sharpening point estimation around the biased target E[YR=1]\mathbb{E}[Y \mid R = 1], not around YˉN\bar{Y}_N.

Proof Sketch

The covariance form is mechanical: Cov(R,Y)=E[RY]E[R]E[Y]=fYˉnfYˉN=f(YˉnYˉN).\mathrm{Cov}(R, Y) = \mathbb{E}[RY] - \mathbb{E}[R]\mathbb{E}[Y] = f \cdot \bar{Y}_n - f \cdot \bar{Y}_N = f(\bar{Y}_n - \bar{Y}_N). Solving for the difference: YˉnYˉN=Cov(R,Y)f=ρR,YVar(R)Var(Y)f=ρR,Yf(1f)σYf=ρR,YσY1ff.\bar{Y}_n - \bar{Y}_N = \frac{\mathrm{Cov}(R, Y)}{f} = \frac{\rho_{R,Y} \sqrt{\mathrm{Var}(R) \mathrm{Var}(Y)}}{f} = \frac{\rho_{R,Y} \sqrt{f(1-f)} \cdot \sigma_Y}{f} = \rho_{R,Y} \sigma_Y \sqrt{\frac{1-f}{f}}. Full derivation in Meng (2018), Annals of Applied Statistics.

Why It Matters

The identity quantifies what "selection bias" costs in a single, operational number. For the 2016 US presidential election, Meng's analysis showed that voter-intention polls had effective sample sizes that were negative relative to the bias they incurred: doubling the sample without fixing the selection mechanism actively hurt accuracy. The same identity is now standard in COVID-19 prevalence-survey audits and in modern web-panel reweighting evaluation.

The qualitative payoff is: estimating a population mean from non-probability data requires either a small bias correlation (structural argument that selection is roughly independent of YY) or an explicit correction (calibration, weighting, imputation, or DR). "Large nn" is not on its own a defense, and is in fact a hazard.

Failure Mode

The identity is for the simple sample mean of a single YY. For regression coefficients, ratio estimators, ML predictions, or causal estimands, analogous decompositions exist but the data-defect "correlation" is replaced by a richer object (a function of the score, the propensity, and the outcome model). The intuition (selection bias persists and can sharpen as the sample grows) carries over uniformly. The exact constant does not.

Repair Methods

Five standard methods, ordered roughly by how much auxiliary information they require. None is universally best; each buys correctness at the price of a different identification assumption.

MethodWhat you needWhat you assumeOutput
Calibration / raking / GREGKnown population totals of auxiliary variables XX (counts, marginals)Outcome model linear in calibration auxiliaries, or close to itAdjusted weights so weighted sample matches population on XX
Sampling-score weightingProbability reference sample with the same XXA model for π(x)=P(R=1X=x)\pi(x) = P(R=1 \mid X=x) is correctly specified, with positivityInverse-probability weighted estimator
Mass imputationProbability reference sample with YY observed there or population frame with XXAn outcome model m(x)=E[YX]m(x) = \mathbb{E}[Y \mid X] is correctly specifiedPredict Y^i\hat{Y}_i for every population unit and average
Doubly robustBoth of the above ingredientsAt least one of the two models (sampling-score or outcome) is correctly specifiedConsistent estimator under either; product-rate bias when both are misspecified but converge
Sensitivity analysisAnything above, plus a parametric bound on residual selection on unmeasured variablesA worst-case bound on the unmeasured-confounding parameterCoverage-degrades-gracefully interval

Calibration / GREG

Calibration adjusts the sample weights {wi}\{w_i\} to match known population totals on auxiliary variables. Solve:

minwisG(wi,di)subject toiswixi=Tx,\min_w \sum_{i \in s} G(w_i, d_i) \quad \text{subject to} \quad \sum_{i \in s} w_i x_i = T_x,

where did_i is a starting weight (e.g., 1/πi1/\pi_i from the design or just N/nN/n for non-probability), TxT_x is the known population total of xx, and GG is a distance function (chi-squared, raking, logit). The GREG estimator is the special case where GG is the chi-squared distance and the auxiliary model is linear.

When XX predicts both YY and RR, calibration removes the part of the selection bias that operates through XX. It cannot remove selection that operates through unmeasured variables. That residual is the substance of sensitivity analysis below.

Sampling-score weighting

Treat the non-probability sample like a missing-data problem: every unit has a latent inclusion probability π(x)=P(R=1X=x)\pi(x) = P(R=1 \mid X=x), fit this from a probability reference sample with the same XX, and weight the non-probability units by 1/π^(x)1/\hat{\pi}(x). The estimator is

Yˉ^IPW=i:Ri=1yi/π^(xi)i:Ri=11/π^(xi).\hat{\bar{Y}}_{\text{IPW}} = \frac{\sum_{i : R_i = 1} y_i / \hat{\pi}(x_i)}{\sum_{i : R_i = 1} 1 / \hat{\pi}(x_i)}.

This requires positivity: π(x)>0\pi(x) > 0 for all xx in the target population. Web-panel and opt-in samples often have π(x)=0\pi(x) = 0 on sub-populations (people who do not use the panel platform at all), which voids the construction for those sub-populations no matter how clever the weighting. Wiśniowski et al. (2020, J. Official Statistics) is the standard treatment of this issue.

Mass imputation

If a probability reference sample carries YY, fit m^(x)=E^[YX]\hat{m}(x) = \hat{\mathbb{E}}[Y \mid X] on it. Then for every population unit (or every unit in the non-probability sample), predict Y^i=m^(xi)\hat{Y}_i = \hat{m}(x_i) and average:

Yˉ^MI=1NiPm^(xi).\hat{\bar{Y}}_{\text{MI}} = \frac{1}{N} \sum_{i \in \mathcal{P}} \hat{m}(x_i).

This is unbiased when the outcome model is correctly specified; it fails when mm is misspecified, in roughly the way ordinary regression-based extrapolation fails.

Doubly robust

Combine sampling-score weighting and mass imputation. The standard form, adapting AIPW from causal inference:

Yˉ^DR=1NiPm^(xi)+1ni:Ri=1yim^(xi)π^(xi).\hat{\bar{Y}}_{\text{DR}} = \frac{1}{N} \sum_{i \in \mathcal{P}} \hat{m}(x_i) + \frac{1}{n} \sum_{i : R_i = 1} \frac{y_i - \hat{m}(x_i)}{\hat{\pi}(x_i)}.

This is consistent when either π^\hat{\pi} is consistent for π\pi or m^\hat{m} is consistent for mm, under positivity and standard regularity. The bias is the product of the two estimation errors, so doubly robust estimation pairs naturally with double/debiased machine learning: the orthogonal-score formulation gives n\sqrt{n}-rate inference under the same product-rate condition with cross-fitted nuisance estimators. Chen, Li, Wu (2020) and the JRSSA / J. Official Statistics literature develop this in the survey-sampling setting.

Sensitivity analysis

Even after the four corrections above, residual selection on unmeasured variables may remain. A sensitivity analysis parameterizes this residual (e.g., Rosenbaum's Γ\Gamma for hidden confounding, or a bound on the omitted-variable correlation ρR,YX\rho_{R,Y \mid X}) and reports estimates that degrade gracefully as the parameter grows. This gives an honest answer to "what if our adjustment leaves an unmeasured selection mechanism with strength Γ\Gamma behind?", without claiming a guarantee.

Connection to Convex Tinkering

Cheap, biased data is high-leverage for exploration and dangerous as final population evidence. This is the survey-statistics counterpart to convex tinkering: convenience data is a useful cheap-option generator under bounded downside, not a final estimator under unbounded downside.

The convex move:

  1. Use non-probability data to screen hypotheses and surface candidate patterns at low cost.
  2. Reserve probability sampling, randomized experiments, or calibration against trusted population totals for the candidates that might actually drive a decision.
  3. Treat the transition from a screening insight to a committed claim as the moment downside becomes unbounded. Calibration, weighting, or DR adjustment is what re-bounds it.

The failure mode is using non-probability data for both steps, treating exploratory pattern-mining as evidence by virtue of nn alone. Meng's data-defect identity is the warning label.

Common Confusions

Watch Out

Bigger n fixes selection bias

The data-defect identity makes this exactly wrong. For fixed ρR,Y\rho_{R,Y}, the absolute estimation error scales as (1f)/f\sqrt{(1-f)/f}. Only large sampling fractions close to 1, not large nn relative to the bias correlation, control the error. A non-probability sample of n=107n = 10^7 from a population of N=109N = 10^9 has f=0.01f = 0.01, no better than a smaller sample at the same ρ\rho. Confidence intervals built from n\sqrt{n}-style asymptotics will shrink, but they shrink around the biased target, not the population parameter.

Watch Out

Probability sample is automatically representative

A probability sample with declining response rate becomes a hybrid: the design probabilities πi\pi_i are still known, but actual response δi\delta_i is conditioned on covariates and outcomes. The effective analysis is non-probability with a known design weight; the data-defect identity applies to the response process. This is why even classical agencies that retain probability designs increasingly publish calibration-adjusted weights.

Watch Out

Calibration removes all selection bias

Calibration removes selection that operates through observed auxiliary variables XX. It does not touch selection that operates through unmeasured variables, residual variation in YY at fixed XX, or non-linearities the calibration model misses. The XX-conditional residual ρR,YX\rho_{R,Y \mid X} is what sensitivity analysis bounds.

Watch Out

Doubly robust means twice as accurate

"Doubly robust" means consistent under either of two models, not "more accurate than singly robust." If both models are misspecified, DR can be worse than a well-specified single model. The product-bias shape is the right intuition: bias is small when at least one model is near the truth, and the rate of convergence to the truth is the product of the two estimation rates (slow ×\times slow can still be fast enough; see DML's n1/4n^{-1/4} product-rate condition).

Watch Out

Anonymous, opt-in, or scraped data is exempt from this

The data-defect identity does not care how the data was obtained. It applies to any sample with a non-trivial inclusion mechanism. Web scrapes, app telemetry, leaked records, opt-in panels, and clinical convenience cohorts all sit in the same regime. The only difference is how much auxiliary information is available to attempt a correction; none of them get the design-based guarantees of probability sampling.

Exercises

ExerciseCore

Problem

A population of N=109N = 10^9 has standard deviation σY=1\sigma_Y = 1. A non-probability sample of n=106n = 10^6 (sampling fraction f=103f = 10^{-3}) has data-defect correlation ρR,Y=0.05\rho_{R,Y} = 0.05. What is the absolute error of the sample mean as an estimator of the population mean?

ExerciseCore

Problem

For the same population and bias correlation as above, how large would the sampling fraction ff need to be for the bias to drop to 0.050.05 standard deviations of YY?

ExerciseAdvanced

Problem

A web-panel sample is calibrated against the marginal distribution of age, gender, and education. After calibration, the residual data-defect correlation ρR,YX\rho_{R,Y \mid X} is bounded above by 0.01 under a sensitivity analysis. With σY=1\sigma_Y = 1, f=104f = 10^{-4}, give the worst-case absolute estimation error and explain how it differs from the unconditional version.

References

Canonical:

  • Meng, X.-L. "Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election." Annals of Applied Statistics 12(2) (2018): 685-726. The data-defect identity and the "big data paradox" framing.
  • Cochran, W. G. Sampling Techniques (3rd ed., Wiley 1977). The classical text for design-based survey inference.
  • Lohr, S. L. Sampling: Design and Analysis (3rd ed., CRC 2021). The standard modern textbook covering both probability and non-probability designs.
  • Lumley, T. Complex Surveys: A Guide to Analysis Using R (Wiley 2010). Practical implementation, including calibration and weighting.

Integration of probability and non-probability samples:

  • Wiśniowski, A., Sakshaug, J. W., Perez Ruiz, D. A., Blom, A. G. "Integrating probability and nonprobability samples for survey inference." Journal of Survey Statistics and Methodology 8(1) (2020): 120-147. Standard reference for the limits of inclusion-probability estimation when probabilities may be zero.
  • Chen, Y., Li, P., Wu, C. "Doubly robust inference with non-probability survey samples." Journal of the American Statistical Association 115(532) (2020): 2011-2021. The doubly robust construction.
  • Elliott, M. R., Valliant, R. "Inference for nonprobability samples." Statistical Science 32(2) (2017): 249-264. Survey-style review.
  • Kim, J. K., Park, S., Chen, Y., Wu, C. "Combining non-probability and probability survey samples through mass imputation." Journal of the Royal Statistical Society Series A 184(3) (2021): 941-963. Mass imputation framework.

Calibration and design weighting:

  • Deville, J.-C., Särndal, C.-E. "Calibration estimators in survey sampling." Journal of the American Statistical Association 87(418) (1992): 376-382. The original GREG / calibration paper.
  • Särndal, C.-E., Lundström, S. Estimation in Surveys with Nonresponse (Wiley 2005). Calibration under nonresponse, a partially-non-probability regime.

Sensitivity analysis:

  • Rosenbaum, P. R. Observational Studies (2nd ed., Springer 2002). The Γ\Gamma sensitivity model.
  • Cinelli, C., Hazlett, C. "Making sense of sensitivity: Extending omitted variable bias." Journal of the Royal Statistical Society Series B 82(1) (2020): 39-67. Modern extensions to non-experimental data.

Educational and policy resources:

  • AAPOR Task Force on Non-Probability Sampling. "Report of the AAPOR Task Force on Non-Probability Sampling." Journal of Survey Statistics and Methodology 1(2) (2013): 90-143. Practitioner-oriented guidance.
  • OECD, "Quality Framework for OECD Statistics" (2011). Quality criteria that apply across data sources, including non-probability inputs.

Cross-Network Links

This page sits at the intersection of survey methodology, causal inference, and modern ML evaluation. Natural neighbours:

  • Survey sampling methods is the parent page for the probability-design baseline.
  • GREG estimator is the technical companion for calibration in the model-assisted form.
  • Double/debiased machine learning shares the orthogonal-score machinery and the product-rate condition used by doubly robust non-probability estimators.
  • Weighted conformal prediction generalizes the same idea (reweight by likelihood ratio under covariate shift) for prediction-set construction rather than mean estimation.
  • Total variation distance bounds worst-case shift in expectation under bounded-loss families; one way to read the data-defect identity is as a TV-flavoured statement with the sampling fraction explicit.
  • Convex tinkering is the methodological framing for using non-probability data as cheap exploration without treating it as final population evidence.
  • Hypothesis testing for ML inherits the same selection issues when the test set is itself a convenience sample.

Next Topics

Last reviewed: April 28, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

4

Derived topics

0

No published topic currently declares this as a prerequisite.