Methodology
Types of Bias in Statistics
A taxonomy of the major biases that corrupt statistical inference: selection bias, measurement bias, survivorship bias, response bias, attrition bias, publication bias, and their consequences for ML.
Why This Matters
Bias is systematic error. Unlike variance, which you can reduce by collecting more data, bias persists no matter how large your sample. A biased estimate converges to the wrong value. No amount of data fixes this.
For ML practitioners, bias in training data becomes bias in model predictions. A language model trained on internet text inherits the selection bias of who writes on the internet. A medical model trained on hospital records inherits the selection bias of who visits hospitals. Understanding the taxonomy of biases is the first step toward diagnosing and mitigating them.
Mental Model
Think of bias as the difference between what your data represents and what you want it to represent. Your data is a sample from some actual distribution . You want to make inferences about a target distribution . Bias arises whenever , and the discrepancy is systematic rather than random.
Taxonomy of Biases
Selection Bias
Selection Bias
Selection bias occurs when the mechanism that determines which units enter your sample is correlated with the outcome of interest. Formally, if is the selection indicator and is the outcome, selection bias exists when .
Examples: hospital-based studies overrepresent severe cases. Surveys with voluntary participation overrepresent people with strong opinions. ML training sets overrepresent data that is easy to collect or label.
Survivorship Bias
Survivorship Bias
Survivorship bias is a special case of selection bias where the sample consists only of units that "survived" some selection process, and the failures are invisible. Formally, you observe for some threshold but want to estimate properties of the unconditional distribution of .
Classic example: studying only successful mutual funds to evaluate investment strategies. The funds that lost money were closed and are absent from the data. The observed average return is biased upward.
In ML: evaluating model architectures only on papers that got accepted (the ones that worked). The architectures that failed are in unpublished experiments.
Measurement Bias
Measurement Bias
Measurement bias (also called information bias) occurs when the measurement process systematically distorts the true value. If the true value is and the measured value is where , the measurement is biased.
Examples: self-reported weight is systematically lower than actual weight. Sentiment labels from crowd workers are systematically biased by the labeling interface. A miscalibrated sensor adds a constant offset.
Response Bias
Response Bias
Response bias occurs when subjects systematically give inaccurate answers. This includes social desirability bias (reporting what is socially acceptable rather than the truth), acquiescence bias (tendency to agree with statements), and recall bias (inaccurate memory of past events).
In ML: annotator bias in labeling tasks. Annotators may systematically disagree with each other or with the ground truth due to cultural background, fatigue, or ambiguous instructions.
Attrition Bias
Attrition Bias
Attrition bias occurs in longitudinal studies when participants drop out non-randomly. If sicker patients are more likely to leave a clinical trial, the remaining sample is healthier than the original, biasing treatment effect estimates.
In ML: users who dislike a recommendation system stop using it. The remaining users give positive feedback, creating a feedback loop that makes the system appear better than it is.
Confirmation Bias
Confirmation Bias
Confirmation bias is the tendency to search for, interpret, and remember information that confirms pre-existing beliefs. In research: choosing analyses that support your hypothesis and ignoring those that do not. In ML: tuning hyperparameters on the test set until the result looks good, then reporting it as the "final" result.
This is not a property of the data but of the analyst. It interacts with publication bias (below) to create systematic distortions in the literature.
Publication Bias
Publication Bias
Publication bias is the tendency for studies with statistically significant or positive results to be published more often than those with null or negative results. If only results with are published, the published literature overestimates effect sizes.
This is sometimes called the "file drawer problem." For every published result, there may be multiple unpublished null results sitting in researchers' file drawers.
In ML: papers reporting SOTA results are published. Papers reporting that a method did not improve over the baseline are not. This inflates the apparent rate of progress.
Comparison Table of Bias Types
| Bias Type | Mechanism | Direction | Fixable by more data? | Fixable by design? | ML Example |
|---|---|---|---|---|---|
| Selection bias | Non-random inclusion in sample | Depends on selection mechanism | No | Yes: random sampling, stratification | Training on English-only web text underrepresents non-English speakers |
| Survivorship bias | Only "winners" observed | Upward (overestimates success) | No | Yes: track failures explicitly | Evaluating architectures only from published (successful) papers |
| Measurement bias | Systematic measurement error | Constant offset | No | Yes: calibrate instruments, validate labels | Crowd-sourced labels with ambiguous annotation guidelines |
| Response bias | Subjects give inaccurate answers | Toward socially desirable answers | No | Partially: anonymous surveys, randomized response | User satisfaction surveys overreport satisfaction |
| Attrition bias | Non-random dropout from study | Depends on who drops out | No | Partially: intention-to-treat analysis | Users who dislike recommendations stop using the app |
| Confirmation bias | Analyst seeks confirming evidence | Toward analyst's hypothesis | No | Yes: pre-registration, blinded analysis | Tuning hyperparameters on test set until results look good |
| Publication bias | Positive results published more | Overestimates effect sizes | No | Yes: pre-registration, publish null results | SOTA papers published; failed methods go unreported |
The critical distinction: sampling error shrinks with more data. All of these biases do not. A biased dataset with observations gives you a precise estimate of the wrong quantity.
Sampling Error vs. Non-Sampling Error
Sampling Error vs. Non-Sampling Error
Sampling error is the random variation due to observing a sample rather than the full population. It decreases as increases. It is quantified by the standard error.
Non-sampling error includes all other sources of error: selection bias, measurement bias, response bias, processing errors, coverage errors (some units are not on the sampling frame). Non-sampling error does not decrease with . A larger biased sample gives you a more precise wrong answer.
Main Theorems
Selection Bias Decomposition
Statement
Let be an outcome variable and a selection indicator. The bias of the selected-sample mean as an estimator of the population mean is:
When , the selected sample overestimates the population mean. When , it underestimates.
Intuition
If the probability of being selected is positively correlated with , then units with high values are overrepresented in the sample. The magnitude of the bias depends on how strong the correlation is and how selective the process is (lower amplifies the bias).
Proof Sketch
Write and . Substitute and solve for .
Why It Matters
This decomposition makes selection bias quantifiable. Instead of vaguely saying "the sample is biased," you can ask: how large is and what is ? It also shows that highly selective processes (small ) produce larger bias for the same covariance.
Failure Mode
The formula assumes you can define and measure . In many practical situations, the selection mechanism is unknown or unobservable. You may not know which units are missing from your sample. The bound also assumes a single selection step; real data often undergoes multiple layers of selection.
Common Confusions
Bias is not the same as unfairness
In statistics, bias means systematic error: . In the ML fairness literature, "bias" often means discriminatory outcomes. These are related but distinct concepts. An estimator can be statistically unbiased (converges to the right average) while producing unfair outcomes for subgroups. Conversely, a biased estimator might be preferable if the bias reduces variance (as in ridge regression).
Large samples do not fix bias
A common mistake is thinking that more data eliminates all problems. More data reduces variance (sampling error) but not bias (non-sampling error). If your data collection process systematically excludes a subpopulation, doubling the sample size gives you more data from the same biased source. The bias remains; only the standard error shrinks, making your confidence interval narrower around the wrong value.
Random sampling eliminates selection bias, not other biases
Probability sampling ensures (no selection bias). But measurement bias, response bias, and processing errors can still corrupt a perfectly designed random sample. Random sampling is necessary but not sufficient for unbiased inference.
Consequences for ML
Training data biases become model biases through a direct mechanism. If the training distribution differs from the deployment distribution due to selection bias, the model optimizes for the wrong objective. Specifically, the model minimizes when you want it to minimize .
Domain adaptation and distribution shift methods attempt to correct for this, but they require assumptions about the relationship between and that are often untestable.
Summary
- Bias is systematic error that does not decrease with sample size
- Selection bias: who enters your sample is correlated with the outcome
- Survivorship bias: you only see the winners, not the losers
- Measurement bias: the measurement process distorts the true value
- Publication bias: the literature overrepresents positive results
- Sampling error decreases with ; non-sampling error does not
- In ML, training data bias becomes model prediction bias
Exercises
Problem
A company surveys its customers by emailing a satisfaction questionnaire. 20% respond. The average satisfaction score among respondents is 8.2/10. Is this an unbiased estimate of satisfaction among all customers? Identify all biases that may be present.
Problem
You are analyzing the returns of hedge funds using a database that only includes funds currently in operation. Funds that lost too much money were liquidated and removed from the database. If the true average annual return across all funds (including dead ones) is 5%, and 30% of funds have been liquidated with an average return of -8% before liquidation, what is the survivorship-biased average that you observe?
References
Canonical:
- Cochran, Sampling Techniques (1977), Chapters 1, 13
- Rothman, Greenland, Lash, Modern Epidemiology (2008), Chapters 9, 12
Current:
-
Hernan & Robins, Causal Inference: What If (2020), Chapters 8-9
-
Suresh & Guttag, "A Framework for Understanding Sources of Harm throughout the ML Life Cycle" (2021)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
Next Topics
- Survey sampling methods: how probability sampling eliminates selection bias
- Nonresponse and missing data: techniques for handling non-random missingness
Last reviewed: April 2026