Methodology
Simpson's Paradox
A trend present in every subgroup can reverse when the subgroups are combined. This happens when a confounding variable determines both group membership and outcome.
Why This Matters
Simpson's paradox is the most important example of why you cannot blindly aggregate data. It shows that a statistical relationship can reverse direction when you move between subgroup analysis and aggregate analysis. This is not a curiosity. It determines whether a medical treatment is judged helpful or harmful, whether a university is judged biased or fair, and whether an ML feature appears positively or negatively correlated with the target.
The paradox has no purely statistical resolution. Deciding which analysis is "correct" (subgroup or aggregate) requires causal reasoning. This makes Simpson's paradox the bridge between statistics and causal inference.
The Paradox
Simpsons Paradox
A statistical relationship between two variables and that holds in every subgroup defined by a third variable reverses when the subgroups are combined. Formally: it is possible that for every value , yet in the aggregate.
The Classic Example: Berkeley Admissions
In 1973, UC Berkeley graduate admissions appeared to discriminate against women: 44% of male applicants were admitted versus 35% of female applicants. But when broken down by department, women had equal or higher admission rates in most departments.
The confounder: women disproportionately applied to competitive departments with low admission rates. Men disproportionately applied to less competitive departments. The variable "department applied to" confounded the relationship between gender and admission.
Formal Setup
Consider binary treatment , binary outcome , and a confounding variable taking values in .
Let and .
Let be the number of treated units in stratum , and the number of control units.
Main Theorems
Simpsons Reversal Condition
Statement
It is possible that for all (treatment helps in every stratum) while the aggregate rates satisfy:
This occurs when the treated group is disproportionately represented in strata with lower baseline rates. The weights and differ across treatment groups, and this differential weighting can flip the overall comparison.
Intuition
The aggregate rate is a weighted average of the stratum-specific rates. If treated and control groups use different weights (because they are distributed differently across strata), the weighted averages can reverse even when every stratum-specific comparison goes the same way.
Proof Sketch
Construct a concrete example. Stratum A: treatment succeeds 80/100, control succeeds 70/100. Stratum B: treatment succeeds 30/50, control succeeds 20/50. Within each stratum, treatment wins (80% vs 70%, 60% vs 40%). Aggregate: treatment succeeds 110/150 = 73.3%, control succeeds 90/150 = 60%. No reversal here. Now change the distribution: put most treated units in stratum B (low baseline) and most control units in stratum A (high baseline). Stratum A: treatment 8/10 (80%), control 63/90 (70%). Stratum B: treatment 27/90 (30%), control 4/10 (40%). Wait: within strata, treatment still wins. Aggregate: treatment (8 + 27)/100 = 35%, control (63 + 4)/100 = 67%. Reversal achieved.
Why It Matters
Any time you compare two groups using aggregate statistics, you risk Simpson's paradox if the groups have different compositions over a confounding variable. This applies to A/B testing (if user segments differ), model comparisons (if evaluation data mixes differ), and observational studies generally.
Failure Mode
The paradox does not tell you which analysis is correct. If is a confounder (a common cause of and ), you should condition on . If is a mediator (caused by and affecting ), you should not condition on it. Only a causal model, not the data alone, resolves the ambiguity.
Connection to Causal Inference
Pearl's causal framework resolves Simpson's paradox by asking: what is the causal graph? If is a common cause of and , the correct analysis adjusts for (the subgroup analysis is right). If is caused by and itself causes , adjusting for introduces collider bias, and the aggregate analysis is right.
The formula for the causal effect when is a confounder is:
Note the crucial detail: the weights are from the marginal distribution of , not the conditional distribution given .
Common Confusions
The paradox means one analysis is wrong
Both analyses are arithmetically correct. The question is which one answers the question you are asking. If you want the causal effect of treatment, you must know the causal structure. If is a confounder, stratify. If is a collider, do not.
More data resolves the paradox
No. The paradox is not a small-sample artifact. It persists at any sample size because it is a structural property of how the confounding variable relates to treatment and outcome.
Canonical Examples
Numerical construction
Two departments, A and B. Department A admits 40% of applicants. Department B admits 80%.
Men: 20 apply to A (8 admitted), 80 apply to B (64 admitted). Total: 72/100 = 72%. Women: 80 apply to A (32 admitted), 20 apply to B (16 admitted). Total: 48/100 = 48%.
Within department A: women 40%, men 40%. Within department B: women 80%, men 80%. No gender bias in either department. But the aggregate shows men at 72% vs women at 48%. The "bias" is entirely driven by women applying more to the selective department.
Exercises
Problem
Drug A cures 30/100 young patients and 50/100 old patients. Drug B cures 5/20 young patients and 40/80 old patients. Which drug has a higher cure rate in each age group? Which drug has a higher aggregate cure rate? Is this a Simpson's paradox?
Problem
Prove that Simpson's paradox cannot occur when the treatment variable is independent of the confounding variable . That is, if for all , the within-stratum and aggregate comparisons must agree in direction.
References
Canonical:
- Simpson, "The Interpretation of Interaction in Contingency Tables", JRSS-B (1951)
- Pearl, Causality (2009), Chapter 6
Current:
-
Pearl, Glymour, Jewell, Causal Inference in Statistics: A Primer (2016), Chapter 1
-
Bickel, Hammel, O'Connell, "Sex Bias in Graduate Admissions", Science (1975)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
Next Topics
- Base-rate fallacy: another case where marginal and conditional reasoning diverge
- Statistical paradoxes collection: a survey of related paradoxes in statistics
Last reviewed: April 2026