Methodology
Base Rate Fallacy
Ignoring the prior probability (base rate) when interpreting test results. A 99% accurate test for a 1% prevalence disease gives only 50% positive predictive value.
Prerequisites
Why This Matters
The base rate fallacy is the single most common error in interpreting classifier outputs. When a classifier says "positive," people assume this means "probably truly positive." But the positive predictive value depends on the base rate (prevalence) of the condition, not just the test accuracy. In ML, the same error occurs with imbalanced classes: a model with 99% accuracy on a dataset where 99% of examples are negative is useless. It has learned to always predict negative.
Setup
A disease affects 1% of the population. A test for the disease has 99% sensitivity (true positive rate) and 99% specificity (true negative rate). You test positive. What is the probability you actually have the disease?
Most people answer "99%." The correct answer is about 50%.
Base Rate
The base rate (or prior probability, or prevalence) is the unconditional probability of the condition before any test is administered. In the example above, .
Positive Predictive Value
The positive predictive value is : the probability of having the disease given a positive test result. This is what you actually want to know after testing positive.
Main Theorems
Positive Predictive Value via Bayes Theorem
Statement
Given prevalence , sensitivity , and specificity , the positive predictive value is:
For , , :
Intuition
Out of 10,000 people, 100 have the disease and 9,900 do not. The test correctly identifies 99 of the 100 sick people (sensitivity = 99%). But it also falsely flags 99 of the 9,900 healthy people (false positive rate = 1%). So there are 99 true positives and 99 false positives among the 198 positive results: exactly 50%.
Proof Sketch
Direct application of Bayes theorem:
Substituting , , :
Why It Matters
This formula shows that PPV depends on three quantities, not just test accuracy. When prevalence is low, even highly accurate tests produce many false positives relative to true positives. This is the core reason why screening tests for rare conditions require confirmation with a second, more specific test.
Failure Mode
The formula assumes test performance is constant across the population. In practice, sensitivity and specificity can vary by subgroup (age, genetics, disease severity). The formula also breaks down when tests are applied to selected populations rather than the general population, because the effective prevalence changes.
Connection to ML: Precision and Class Imbalance
In ML terminology, PPV is precision. Sensitivity is recall. The base rate fallacy explains why precision drops when classes are imbalanced:
- Precision = : same as PPV
- Recall = : same as sensitivity
A classifier with 99% accuracy on a 1% positive rate dataset can achieve this by predicting "negative" for every example. It has 99% accuracy, 0% recall, and undefined (0/0) precision on the positive class. Accuracy alone hides the failure.
Common Confusions
Test accuracy equals probability of disease given positive test
A test that is "99% accurate" does not mean a positive result has a 99% chance of being correct. The 99% refers to and , not to . These are different quantities. The confusion is between and .
High accuracy means a good classifier
On imbalanced datasets, accuracy is dominated by the majority class. A spam filter with 99.9% accuracy that never flags any email as spam (because only 0.1% of emails are spam) is useless. Use precision, recall, and F1 instead.
Repeated testing fixes the problem
A common suggestion is "just test again." If the second test is independent given disease status, the math does work: a second positive raises the posterior significantly. But in practice, the same test on the same patient often has correlated errors, reducing the benefit of retesting.
Canonical Examples
Disease screening with different prevalences
Fix sensitivity = 99%, specificity = 99%.
| Prevalence | PPV |
|---|---|
| 50% | 99% |
| 10% | 91.7% |
| 1% | 50% |
| 0.1% | 9.0% |
At 0.1% prevalence, a positive result means only a 9% chance of disease. The same test goes from nearly definitive to nearly useless as prevalence drops.
Exercises
Problem
A classifier has 95% recall and 90% specificity on a binary task where 5% of examples are positive. What is the precision?
Problem
What specificity is needed to achieve 95% precision when prevalence is 1% and sensitivity is 99%?
References
Canonical:
- Kahneman, Slovic, Tversky, Judgment Under Uncertainty (1982), Chapter on base rates
- Gigerenzer, "Calculated Risks" (2002), Chapters 3-4
Current:
-
Saito & Rehmsmeier, "The Precision-Recall Plot Is More Informative than the ROC Plot", PLOS ONE (2015)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
-
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 5-7
Next Topics
- Confusion matrices and classification metrics: the full framework for evaluating classifiers
- Simpson's paradox: another case where aggregation produces misleading results
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A