Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Simpson's Paradox

A trend present in every subgroup can reverse when the subgroups are combined. This happens when a confounding variable determines both group membership and outcome.

CoreTier 2Stable~35 min
0

Why This Matters

Simpson's paradox is the most important example of why you cannot blindly aggregate data. It shows that a statistical relationship can reverse direction when you move between subgroup analysis and aggregate analysis. This is not a curiosity. It determines whether a medical treatment is judged helpful or harmful, whether a university is judged biased or fair, and whether an ML feature appears positively or negatively correlated with the target.

The paradox has no purely statistical resolution. Deciding which analysis is "correct" (subgroup or aggregate) requires causal reasoning. This makes Simpson's paradox the bridge between statistics and causal inference.

The Paradox

Definition

Simpsons Paradox

A statistical relationship between two variables XX and YY that holds in every subgroup defined by a third variable ZZ reverses when the subgroups are combined. Formally: it is possible that P(YX,Z=z)>P(Y¬X,Z=z)P(Y \mid X, Z = z) > P(Y \mid \neg X, Z = z) for every value zz, yet P(YX)<P(Y¬X)P(Y \mid X) < P(Y \mid \neg X) in the aggregate.

The Classic Example: Berkeley Admissions

In 1973, UC Berkeley graduate admissions appeared to discriminate against women: 44% of male applicants were admitted versus 35% of female applicants. But when broken down by department, women had equal or higher admission rates in most departments.

The confounder: women disproportionately applied to competitive departments with low admission rates. Men disproportionately applied to less competitive departments. The variable "department applied to" confounded the relationship between gender and admission.

Formal Setup

Consider binary treatment X{0,1}X \in \{0, 1\}, binary outcome Y{0,1}Y \in \{0, 1\}, and a confounding variable ZZ taking values in {1,,K}\{1, \ldots, K\}.

Let pk=P(Y=1X=1,Z=k)p_k = P(Y = 1 \mid X = 1, Z = k) and qk=P(Y=1X=0,Z=k)q_k = P(Y = 1 \mid X = 0, Z = k).

Let nk(1)n_k^{(1)} be the number of treated units in stratum kk, and nk(0)n_k^{(0)} the number of control units.

Main Theorems

Theorem

Simpsons Reversal Condition

Statement

It is possible that pk>qkp_k > q_k for all kk (treatment helps in every stratum) while the aggregate rates satisfy:

knk(1)pkknk(1)<knk(0)qkknk(0)\frac{\sum_k n_k^{(1)} p_k}{\sum_k n_k^{(1)}} < \frac{\sum_k n_k^{(0)} q_k}{\sum_k n_k^{(0)}}

This occurs when the treated group is disproportionately represented in strata with lower baseline rates. The weights nk(1)/jnj(1)n_k^{(1)} / \sum_j n_j^{(1)} and nk(0)/jnj(0)n_k^{(0)} / \sum_j n_j^{(0)} differ across treatment groups, and this differential weighting can flip the overall comparison.

Intuition

The aggregate rate is a weighted average of the stratum-specific rates. If treated and control groups use different weights (because they are distributed differently across strata), the weighted averages can reverse even when every stratum-specific comparison goes the same way.

Proof Sketch

Construct a concrete example. Stratum A: treatment succeeds 80/100, control succeeds 70/100. Stratum B: treatment succeeds 30/50, control succeeds 20/50. Within each stratum, treatment wins (80% vs 70%, 60% vs 40%). Aggregate: treatment succeeds 110/150 = 73.3%, control succeeds 90/150 = 60%. No reversal here. Now change the distribution: put most treated units in stratum B (low baseline) and most control units in stratum A (high baseline). Stratum A: treatment 8/10 (80%), control 63/90 (70%). Stratum B: treatment 27/90 (30%), control 4/10 (40%). Wait: within strata, treatment still wins. Aggregate: treatment (8 + 27)/100 = 35%, control (63 + 4)/100 = 67%. Reversal achieved.

Why It Matters

Any time you compare two groups using aggregate statistics, you risk Simpson's paradox if the groups have different compositions over a confounding variable. This applies to A/B testing (if user segments differ), model comparisons (if evaluation data mixes differ), and observational studies generally.

Failure Mode

The paradox does not tell you which analysis is correct. If ZZ is a confounder (a common cause of XX and YY), you should condition on ZZ. If ZZ is a mediator (caused by XX and affecting YY), you should not condition on it. Only a causal model, not the data alone, resolves the ambiguity.

Connection to Causal Inference

Pearl's causal framework resolves Simpson's paradox by asking: what is the causal graph? If ZZ is a common cause of XX and YY, the correct analysis adjusts for ZZ (the subgroup analysis is right). If ZZ is caused by XX and itself causes YY, adjusting for ZZ introduces collider bias, and the aggregate analysis is right.

The formula for the causal effect when ZZ is a confounder is:

P(Y=1do(X=1))=kP(Y=1X=1,Z=k)P(Z=k)P(Y = 1 \mid do(X = 1)) = \sum_k P(Y = 1 \mid X = 1, Z = k) \cdot P(Z = k)

Note the crucial detail: the weights are P(Z=k)P(Z = k) from the marginal distribution of ZZ, not the conditional distribution given XX.

Common Confusions

Watch Out

The paradox means one analysis is wrong

Both analyses are arithmetically correct. The question is which one answers the question you are asking. If you want the causal effect of treatment, you must know the causal structure. If ZZ is a confounder, stratify. If ZZ is a collider, do not.

Watch Out

More data resolves the paradox

No. The paradox is not a small-sample artifact. It persists at any sample size because it is a structural property of how the confounding variable relates to treatment and outcome.

Canonical Examples

Example

Numerical construction

Two departments, A and B. Department A admits 40% of applicants. Department B admits 80%.

Men: 20 apply to A (8 admitted), 80 apply to B (64 admitted). Total: 72/100 = 72%. Women: 80 apply to A (32 admitted), 20 apply to B (16 admitted). Total: 48/100 = 48%.

Within department A: women 40%, men 40%. Within department B: women 80%, men 80%. No gender bias in either department. But the aggregate shows men at 72% vs women at 48%. The "bias" is entirely driven by women applying more to the selective department.

Exercises

ExerciseCore

Problem

Drug A cures 30/100 young patients and 50/100 old patients. Drug B cures 5/20 young patients and 40/80 old patients. Which drug has a higher cure rate in each age group? Which drug has a higher aggregate cure rate? Is this a Simpson's paradox?

ExerciseAdvanced

Problem

Prove that Simpson's paradox cannot occur when the treatment variable XX is independent of the confounding variable ZZ. That is, if P(Z=kX=1)=P(Z=kX=0)P(Z = k \mid X = 1) = P(Z = k \mid X = 0) for all kk, the within-stratum and aggregate comparisons must agree in direction.

References

Canonical:

  • Simpson, "The Interpretation of Interaction in Contingency Tables", JRSS-B (1951)
  • Pearl, Causality (2009), Chapter 6

Current:

  • Pearl, Glymour, Jewell, Causal Inference in Statistics: A Primer (2016), Chapter 1

  • Bickel, Hammel, O'Connell, "Sex Bias in Graduate Admissions", Science (1975)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

Next Topics

  • Base-rate fallacy: another case where marginal and conditional reasoning diverge
  • Statistical paradoxes collection: a survey of related paradoxes in statistics

Last reviewed: April 2026

Builds on This

Next Topics