Types of Bias in Statistics

Sneiderman, Robby

Methodology

Types of Bias in Statistics

A taxonomy of the major biases that corrupt statistical inference: selection bias, measurement bias, survivorship bias, response bias, attrition bias, publication bias, and their consequences for ML.

CoreTier 1StableSupporting~50 min

Prerequisites

Anthropic Bias and Observation Selection

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

methodology | layer 1 | tier 1. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Survey Sampling Methods

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Bias is systematic error. Unlike variance, which you can reduce by collecting more data, bias persists no matter how large your sample. A biased estimate converges to the wrong value. No amount of data fixes this.

For ML practitioners, bias in training data becomes bias in model predictions. A language model trained on internet text inherits the selection bias of who writes on the internet. A medical model trained on hospital records inherits the selection bias of who visits hospitals. Understanding the taxonomy of biases is the first step toward diagnosing and mitigating them.

Mental Model

Think of bias as the difference between what your data represents and what you want it to represent. Your data is a sample from some actual distribution $P_{\text{data}}$ . You want to make inferences about a target distribution $P_{\text{target}}$ . Bias arises whenever $P_{\text{data}} \neq P_{\text{target}}$ , and the discrepancy is systematic rather than random.

Taxonomy of Biases

Selection Bias

Definition

Selection Bias

Selection bias occurs when the mechanism that determines which units enter your sample is correlated with the outcome of interest. Formally, if $S$ is the selection indicator and $Y$ is the outcome, selection bias exists when $\mathbb{E}[Y \mid S = 1] \neq \mathbb{E}[Y]$ .

Examples: hospital-based studies overrepresent severe cases. Surveys with voluntary participation overrepresent people with strong opinions. ML training sets overrepresent data that is easy to collect or label.

Survivorship Bias

Definition

Survivorship Bias

Survivorship bias is a special case of selection bias where the sample consists only of units that "survived" some selection process, and the failures are invisible. Formally, you observe $Y \mid Y > c$ for some threshold $c$ but want to estimate properties of the unconditional distribution of $Y$ .

Classic example: studying only successful mutual funds to evaluate investment strategies. The funds that lost money were closed and are absent from the data. The observed average return is biased upward.

In ML: evaluating model architectures only on papers that got accepted (the ones that worked). The architectures that failed are in unpublished experiments.

Measurement Bias

Definition

Measurement Bias

Measurement bias (also called information bias) occurs when the measurement process systematically distorts the true value. If the true value is $Y$ and the measured value is $Y^* = Y + \epsilon$ where $\mathbb{E}[\epsilon] \neq 0$ , the measurement is biased.

Examples: self-reported weight is systematically lower than actual weight. Sentiment labels from crowd workers are systematically biased by the labeling interface. A miscalibrated sensor adds a constant offset.

Response Bias

Definition

Response Bias

Response bias occurs when subjects systematically give inaccurate answers. This includes social desirability bias (reporting what is socially acceptable rather than the truth), acquiescence bias (tendency to agree with statements), and recall bias (inaccurate memory of past events).

In ML: annotator bias in labeling tasks. Annotators may systematically disagree with each other or with the ground truth due to cultural background, fatigue, or ambiguous instructions.

Attrition Bias

Definition

Attrition Bias

Attrition bias occurs in longitudinal studies when participants drop out non-randomly. If sicker patients are more likely to leave a clinical trial, the remaining sample is healthier than the original, biasing treatment effect estimates.

In ML: users who dislike a recommendation system stop using it. The remaining users give positive feedback, creating a feedback loop that makes the system appear better than it is.

Confirmation Bias

Definition

Confirmation Bias

Confirmation bias is the tendency to search for, interpret, and remember information that confirms pre-existing beliefs. In research: choosing analyses that support your hypothesis and ignoring those that do not. In ML: tuning hyperparameters on the test set until the result looks good, then reporting it as the "final" result. This is a form of data leakage.

This is not a property of the data but of the analyst. It interacts with publication bias (below) to create systematic distortions in the literature.

Publication Bias

Definition

Publication Bias

Publication bias is the tendency for studies with statistically significant or positive results to be published more often than those with null or negative results. If only results with $p < 0.05$ are published, the published literature overestimates effect sizes.

This is sometimes called the "file drawer problem." For every published result, there may be multiple unpublished null results sitting in researchers' file drawers.

In ML: papers reporting SOTA results are published. Papers reporting that a method did not improve over the baseline are not. This inflates the apparent rate of progress.

Comparison Table of Bias Types

Bias Type	Mechanism	Direction	Fixable by more data?	Fixable by design?	ML Example
Selection bias	Non-random inclusion in sample	Depends on selection mechanism	No	Yes: random sampling, stratification	Training on English-only web text underrepresents non-English speakers
Survivorship bias	Only "winners" observed	Upward (overestimates success)	No	Yes: track failures explicitly	Evaluating architectures only from published (successful) papers
Measurement bias	Systematic measurement error	Constant offset	No	Yes: calibrate instruments, validate labels	Crowd-sourced labels with ambiguous annotation guidelines
Response bias	Subjects give inaccurate answers	Toward socially desirable answers	No	Partially: anonymous surveys, randomized response	User satisfaction surveys overreport satisfaction
Attrition bias	Non-random dropout from study	Depends on who drops out	No	Partially: intention-to-treat analysis	Users who dislike recommendations stop using the app
Confirmation bias	Analyst seeks confirming evidence	Toward analyst's hypothesis	No	Yes: pre-registration, blinded analysis	Tuning hyperparameters on test set until results look good
Publication bias	Positive results published more	Overestimates effect sizes	No	Yes: pre-registration, publish null results	SOTA papers published; failed methods go unreported

The critical distinction: sampling error shrinks with more data. All of these biases do not. A biased dataset with $n = 10^6$ observations gives you a precise estimate of the wrong quantity.

Sampling Error vs. Non-Sampling Error

Definition

Sampling Error vs. Non-Sampling Error

Sampling error is the random variation due to observing a sample rather than the full population. It decreases as $n$ increases, as formalized by the law of large numbers. It is quantified by the standard error.

Non-sampling error includes all other sources of error: selection bias, measurement bias, response bias, processing errors, coverage errors (some units are not on the sampling frame). Non-sampling error does not decrease with $n$ . A larger biased sample gives you a more precise wrong answer.

Main Theorems

Theorem

Selection Bias Decomposition

Statement

Let $Y$ be an outcome variable and $S \in \{0, 1\}$ a selection indicator. The bias of the selected-sample mean as an estimator of the population mean is:

$\mathbb{E}[Y \mid S=1] - \mathbb{E}[Y] = \frac{\text{Cov}(Y, S)}{\Pr(S=1)}$

When $\text{Cov}(Y, S) > 0$ , the selected sample overestimates the population mean. When $\text{Cov}(Y, S) < 0$ , it underestimates.

Intuition

If the probability of being selected is positively correlated with $Y$ , then units with high $Y$ values are overrepresented in the sample. The magnitude of the bias depends on how strong the correlation is and how selective the process is (lower $\Pr(S=1)$ amplifies the bias).

Proof Sketch

Write $\mathbb{E}[YS] = \mathbb{E}[Y \mid S=1]\Pr(S=1)$ and $\text{Cov}(Y,S) = \mathbb{E}[YS] - \mathbb{E}[Y]\mathbb{E}[S]$ . Substitute $\mathbb{E}[S] = \Pr(S=1)$ and solve for $\mathbb{E}[Y \mid S=1] - \mathbb{E}[Y]$ .

Why It Matters

This decomposition makes selection bias quantifiable. Instead of vaguely saying "the sample is biased," you can ask: how large is $\text{Cov}(Y, S)$ and what is $\Pr(S=1)$ ? It also shows that highly selective processes (small $\Pr(S=1)$ ) produce larger bias for the same covariance.

Failure Mode

The formula assumes you can define and measure $S$ . In many practical situations, the selection mechanism is unknown or unobservable. You may not know which units are missing from your sample. The bound also assumes a single selection step; real data often undergoes multiple layers of selection.

report a correction →

Common Confusions

Watch Out

Bias is not the same as unfairness

In statistics, bias means systematic error: $\mathbb{E}[\hat{\theta}] \neq \theta$ . In the ML fairness literature, "bias" often means discriminatory outcomes. These are related but distinct concepts. An estimator can be statistically unbiased (converges to the right average) while producing unfair outcomes for subgroups. Conversely, a biased estimator might be preferable if the bias reduces variance (as in ridge regression).

Watch Out

Large samples do not fix bias

A common mistake is thinking that more data eliminates all problems. More data reduces variance (sampling error) but not bias (non-sampling error). The central limit theorem governs the shrinkage of sampling error; no analogous result rescues systematic bias. If your data collection process systematically excludes a subpopulation, doubling the sample size gives you more data from the same biased source. The bias remains; only the standard error shrinks, making your confidence interval narrower around the wrong value.

Watch Out

Random sampling eliminates selection bias, not other biases

Probability sampling ensures $\mathbb{E}[Y \mid S=1] = \mathbb{E}[Y]$ (no selection bias). But measurement bias, response bias, and processing errors can still corrupt a perfectly designed random sample. Random sampling is necessary but not sufficient for unbiased inference.

Consequences for ML

Training data biases become model biases through a direct mechanism. If the training distribution $P_{\text{train}}$ differs from the deployment distribution $P_{\text{deploy}}$ due to selection bias, the model optimizes for the wrong objective. Specifically, the model minimizes $\mathbb{E}_{P_{\text{train}}}[\ell(h(x), y)]$ when you want it to minimize $\mathbb{E}_{P_{\text{deploy}}}[\ell(h(x), y)]$ . This is the core problem formalized by empirical risk minimization.

Importance sampling and domain adaptation methods attempt to correct for this, but they require assumptions about the relationship between $P_{\text{train}}$ and $P_{\text{deploy}}$ that are often untestable.

Summary

Bias is systematic error that does not decrease with sample size
Selection bias: who enters your sample is correlated with the outcome
Survivorship bias: you only see the winners, not the losers
Measurement bias: the measurement process distorts the true value
Publication bias: the literature overrepresents positive results
Sampling error decreases with $n$ ; non-sampling error does not
In ML, training data bias becomes model prediction bias

Exercises

ExerciseCore

Problem

A company surveys its customers by emailing a satisfaction questionnaire. 20% respond. The average satisfaction score among respondents is 8.2/10. Is this an unbiased estimate of satisfaction among all customers? Identify all biases that may be present.

ExerciseAdvanced

Problem

You are analyzing the returns of hedge funds using a database that only includes funds currently in operation. Funds that lost too much money were liquidated and removed from the database. If the true average annual return across all funds (including dead ones) is 5%, and 30% of funds have been liquidated with an average return of -8% before liquidation, what is the survivorship-biased average that you observe?

References

Canonical:

Cochran, Sampling Techniques (1977), Chapters 1, 13.
Kish, Survey Sampling (1965), Chapters 1-2.
Rothman, Greenland, Lash, Modern Epidemiology (2008), Chapters 9, 12.
Hernan & Robins, Causal Inference: What If (2020), Chapters 8-9.
Little & Rubin, Statistical Analysis with Missing Data (2019), Chapters 1-3.
Ioannidis, "Why Most Published Research Findings Are False" (2005).

ML and data practice:

Suresh & Guttag, "A Framework for Understanding Sources of Harm throughout the ML Life Cycle" (2021).
Gebru et al., "Datasheets for Datasets" (2021).
Paullada et al., "Data and its (Dis)contents: A Survey of Dataset Development and Use in Machine Learning Research" (2021).

Next Topics

Survey sampling methods: how probability sampling eliminates selection bias
Nonresponse and missing data: techniques for handling non-random missingness

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Anthropic Bias and Observation Selectionlayer 3 · tier 3

Derived topics

2

Nonresponse and Missing Datalayer 2 · tier 2
Survey Sampling Methodslayer 2 · tier 2

Graph-backed continuations

Survey Sampling Methods Nonresponse and Missing Data