Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Types of Bias in Statistics

A taxonomy of the major biases that corrupt statistical inference: selection bias, measurement bias, survivorship bias, response bias, attrition bias, publication bias, and their consequences for ML.

CoreTier 1Stable~50 min

Why This Matters

Bias is systematic error. Unlike variance, which you can reduce by collecting more data, bias persists no matter how large your sample. A biased estimate converges to the wrong value. No amount of data fixes this.

For ML practitioners, bias in training data becomes bias in model predictions. A language model trained on internet text inherits the selection bias of who writes on the internet. A medical model trained on hospital records inherits the selection bias of who visits hospitals. Understanding the taxonomy of biases is the first step toward diagnosing and mitigating them.

Mental Model

Think of bias as the difference between what your data represents and what you want it to represent. Your data is a sample from some actual distribution PdataP_{\text{data}}. You want to make inferences about a target distribution PtargetP_{\text{target}}. Bias arises whenever PdataPtargetP_{\text{data}} \neq P_{\text{target}}, and the discrepancy is systematic rather than random.

Taxonomy of Biases

Selection Bias

Definition

Selection Bias

Selection bias occurs when the mechanism that determines which units enter your sample is correlated with the outcome of interest. Formally, if SS is the selection indicator and YY is the outcome, selection bias exists when E[YS=1]E[Y]\mathbb{E}[Y \mid S = 1] \neq \mathbb{E}[Y].

Examples: hospital-based studies overrepresent severe cases. Surveys with voluntary participation overrepresent people with strong opinions. ML training sets overrepresent data that is easy to collect or label.

Survivorship Bias

Definition

Survivorship Bias

Survivorship bias is a special case of selection bias where the sample consists only of units that "survived" some selection process, and the failures are invisible. Formally, you observe YY>cY \mid Y > c for some threshold cc but want to estimate properties of the unconditional distribution of YY.

Classic example: studying only successful mutual funds to evaluate investment strategies. The funds that lost money were closed and are absent from the data. The observed average return is biased upward.

In ML: evaluating model architectures only on papers that got accepted (the ones that worked). The architectures that failed are in unpublished experiments.

Measurement Bias

Definition

Measurement Bias

Measurement bias (also called information bias) occurs when the measurement process systematically distorts the true value. If the true value is YY and the measured value is Y=Y+ϵY^* = Y + \epsilon where E[ϵ]0\mathbb{E}[\epsilon] \neq 0, the measurement is biased.

Examples: self-reported weight is systematically lower than actual weight. Sentiment labels from crowd workers are systematically biased by the labeling interface. A miscalibrated sensor adds a constant offset.

Response Bias

Definition

Response Bias

Response bias occurs when subjects systematically give inaccurate answers. This includes social desirability bias (reporting what is socially acceptable rather than the truth), acquiescence bias (tendency to agree with statements), and recall bias (inaccurate memory of past events).

In ML: annotator bias in labeling tasks. Annotators may systematically disagree with each other or with the ground truth due to cultural background, fatigue, or ambiguous instructions.

Attrition Bias

Definition

Attrition Bias

Attrition bias occurs in longitudinal studies when participants drop out non-randomly. If sicker patients are more likely to leave a clinical trial, the remaining sample is healthier than the original, biasing treatment effect estimates.

In ML: users who dislike a recommendation system stop using it. The remaining users give positive feedback, creating a feedback loop that makes the system appear better than it is.

Confirmation Bias

Definition

Confirmation Bias

Confirmation bias is the tendency to search for, interpret, and remember information that confirms pre-existing beliefs. In research: choosing analyses that support your hypothesis and ignoring those that do not. In ML: tuning hyperparameters on the test set until the result looks good, then reporting it as the "final" result.

This is not a property of the data but of the analyst. It interacts with publication bias (below) to create systematic distortions in the literature.

Publication Bias

Definition

Publication Bias

Publication bias is the tendency for studies with statistically significant or positive results to be published more often than those with null or negative results. If only results with p<0.05p < 0.05 are published, the published literature overestimates effect sizes.

This is sometimes called the "file drawer problem." For every published result, there may be multiple unpublished null results sitting in researchers' file drawers.

In ML: papers reporting SOTA results are published. Papers reporting that a method did not improve over the baseline are not. This inflates the apparent rate of progress.

Comparison Table of Bias Types

Bias TypeMechanismDirectionFixable by more data?Fixable by design?ML Example
Selection biasNon-random inclusion in sampleDepends on selection mechanismNoYes: random sampling, stratificationTraining on English-only web text underrepresents non-English speakers
Survivorship biasOnly "winners" observedUpward (overestimates success)NoYes: track failures explicitlyEvaluating architectures only from published (successful) papers
Measurement biasSystematic measurement errorConstant offsetNoYes: calibrate instruments, validate labelsCrowd-sourced labels with ambiguous annotation guidelines
Response biasSubjects give inaccurate answersToward socially desirable answersNoPartially: anonymous surveys, randomized responseUser satisfaction surveys overreport satisfaction
Attrition biasNon-random dropout from studyDepends on who drops outNoPartially: intention-to-treat analysisUsers who dislike recommendations stop using the app
Confirmation biasAnalyst seeks confirming evidenceToward analyst's hypothesisNoYes: pre-registration, blinded analysisTuning hyperparameters on test set until results look good
Publication biasPositive results published moreOverestimates effect sizesNoYes: pre-registration, publish null resultsSOTA papers published; failed methods go unreported

The critical distinction: sampling error shrinks with more data. All of these biases do not. A biased dataset with n=106n = 10^6 observations gives you a precise estimate of the wrong quantity.

Sampling Error vs. Non-Sampling Error

Definition

Sampling Error vs. Non-Sampling Error

Sampling error is the random variation due to observing a sample rather than the full population. It decreases as nn increases. It is quantified by the standard error.

Non-sampling error includes all other sources of error: selection bias, measurement bias, response bias, processing errors, coverage errors (some units are not on the sampling frame). Non-sampling error does not decrease with nn. A larger biased sample gives you a more precise wrong answer.

Main Theorems

Theorem

Selection Bias Decomposition

Statement

Let YY be an outcome variable and S{0,1}S \in \{0, 1\} a selection indicator. The bias of the selected-sample mean as an estimator of the population mean is:

E[YS=1]E[Y]=Cov(Y,S)Pr(S=1)\mathbb{E}[Y \mid S=1] - \mathbb{E}[Y] = \frac{\text{Cov}(Y, S)}{\Pr(S=1)}

When Cov(Y,S)>0\text{Cov}(Y, S) > 0, the selected sample overestimates the population mean. When Cov(Y,S)<0\text{Cov}(Y, S) < 0, it underestimates.

Intuition

If the probability of being selected is positively correlated with YY, then units with high YY values are overrepresented in the sample. The magnitude of the bias depends on how strong the correlation is and how selective the process is (lower Pr(S=1)\Pr(S=1) amplifies the bias).

Proof Sketch

Write E[YS]=E[YS=1]Pr(S=1)\mathbb{E}[YS] = \mathbb{E}[Y \mid S=1]\Pr(S=1) and Cov(Y,S)=E[YS]E[Y]E[S]\text{Cov}(Y,S) = \mathbb{E}[YS] - \mathbb{E}[Y]\mathbb{E}[S]. Substitute E[S]=Pr(S=1)\mathbb{E}[S] = \Pr(S=1) and solve for E[YS=1]E[Y]\mathbb{E}[Y \mid S=1] - \mathbb{E}[Y].

Why It Matters

This decomposition makes selection bias quantifiable. Instead of vaguely saying "the sample is biased," you can ask: how large is Cov(Y,S)\text{Cov}(Y, S) and what is Pr(S=1)\Pr(S=1)? It also shows that highly selective processes (small Pr(S=1)\Pr(S=1)) produce larger bias for the same covariance.

Failure Mode

The formula assumes you can define and measure SS. In many practical situations, the selection mechanism is unknown or unobservable. You may not know which units are missing from your sample. The bound also assumes a single selection step; real data often undergoes multiple layers of selection.

Common Confusions

Watch Out

Bias is not the same as unfairness

In statistics, bias means systematic error: E[θ^]θ\mathbb{E}[\hat{\theta}] \neq \theta. In the ML fairness literature, "bias" often means discriminatory outcomes. These are related but distinct concepts. An estimator can be statistically unbiased (converges to the right average) while producing unfair outcomes for subgroups. Conversely, a biased estimator might be preferable if the bias reduces variance (as in ridge regression).

Watch Out

Large samples do not fix bias

A common mistake is thinking that more data eliminates all problems. More data reduces variance (sampling error) but not bias (non-sampling error). If your data collection process systematically excludes a subpopulation, doubling the sample size gives you more data from the same biased source. The bias remains; only the standard error shrinks, making your confidence interval narrower around the wrong value.

Watch Out

Random sampling eliminates selection bias, not other biases

Probability sampling ensures E[YS=1]=E[Y]\mathbb{E}[Y \mid S=1] = \mathbb{E}[Y] (no selection bias). But measurement bias, response bias, and processing errors can still corrupt a perfectly designed random sample. Random sampling is necessary but not sufficient for unbiased inference.

Consequences for ML

Training data biases become model biases through a direct mechanism. If the training distribution PtrainP_{\text{train}} differs from the deployment distribution PdeployP_{\text{deploy}} due to selection bias, the model optimizes for the wrong objective. Specifically, the model minimizes EPtrain[(h(x),y)]\mathbb{E}_{P_{\text{train}}}[\ell(h(x), y)] when you want it to minimize EPdeploy[(h(x),y)]\mathbb{E}_{P_{\text{deploy}}}[\ell(h(x), y)].

Domain adaptation and distribution shift methods attempt to correct for this, but they require assumptions about the relationship between PtrainP_{\text{train}} and PdeployP_{\text{deploy}} that are often untestable.

Summary

  • Bias is systematic error that does not decrease with sample size
  • Selection bias: who enters your sample is correlated with the outcome
  • Survivorship bias: you only see the winners, not the losers
  • Measurement bias: the measurement process distorts the true value
  • Publication bias: the literature overrepresents positive results
  • Sampling error decreases with nn; non-sampling error does not
  • In ML, training data bias becomes model prediction bias

Exercises

ExerciseCore

Problem

A company surveys its customers by emailing a satisfaction questionnaire. 20% respond. The average satisfaction score among respondents is 8.2/10. Is this an unbiased estimate of satisfaction among all customers? Identify all biases that may be present.

ExerciseAdvanced

Problem

You are analyzing the returns of hedge funds using a database that only includes funds currently in operation. Funds that lost too much money were liquidated and removed from the database. If the true average annual return across all funds (including dead ones) is 5%, and 30% of funds have been liquidated with an average return of -8% before liquidation, what is the survivorship-biased average that you observe?

References

Canonical:

  • Cochran, Sampling Techniques (1977), Chapters 1, 13
  • Rothman, Greenland, Lash, Modern Epidemiology (2008), Chapters 9, 12

Current:

  • Hernan & Robins, Causal Inference: What If (2020), Chapters 8-9

  • Suresh & Guttag, "A Framework for Understanding Sources of Harm throughout the ML Life Cycle" (2021)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

Next Topics

Last reviewed: April 2026

Next Topics