Simpson's Paradox

Sneiderman, Robby

Methodology

Simpson's Paradox

A trend present in every subgroup can reverse when the subgroups are combined. This happens when a confounding variable determines both group membership and outcome.

CoreTier 2StableInsight~35 min

Prerequisites

Causal Inference Basics

Start 8-question practice · 6 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

methodology | layer 1 | tier 2. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Base Rate Fallacy

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Simpson's paradox is the most important example of why you cannot blindly aggregate data. It shows that a statistical relationship can reverse direction when you move between subgroup analysis and aggregate analysis. This is not a curiosity. It determines whether a medical treatment is judged helpful or harmful, whether a university is judged biased or fair, and whether an ML feature appears positively or negatively correlated with the target.

The paradox has no purely statistical resolution. Deciding which analysis is "correct" (subgroup or aggregate) requires causal reasoning. This makes Simpson's paradox the bridge between statistics and causal inference.

The Paradox

Definition

Simpsons Paradox

A statistical relationship between two variables $X$ and $Y$ that holds in every subgroup defined by a third variable $Z$ reverses when the subgroups are combined. Formally: it is possible that $P(Y \mid X, Z = z) > P(Y \mid \neg X, Z = z)$ for every value $z$ , yet $P(Y \mid X) < P(Y \mid \neg X)$ in the aggregate.

The Classic Example: Berkeley Admissions

In 1973, UC Berkeley graduate admissions appeared to discriminate against women: 44% of male applicants were admitted versus 35% of female applicants. But when broken down by department, women had equal or higher admission rates in most departments.

The confounder: women disproportionately applied to competitive departments with low admission rates. Men disproportionately applied to less competitive departments. The variable "department applied to" confounded the relationship between gender and admission.

Formal Setup

Consider binary treatment $X \in \{0, 1\}$ , binary outcome $Y \in \{0, 1\}$ , and a confounding variable $Z$ taking values in $\{1, \ldots, K\}$ .

Let $p_k = P(Y = 1 \mid X = 1, Z = k)$ and $q_k = P(Y = 1 \mid X = 0, Z = k)$ .

Let $n_k^{(1)}$ be the number of treated units in stratum $k$ , and $n_k^{(0)}$ the number of control units.

Main Theorems

Theorem

Simpsons Reversal Condition

Statement

It is possible that $p_k > q_k$ for all $k$ (treatment helps in every stratum) while the aggregate rates satisfy:

$\frac{\sum_k n_k^{(1)} p_k}{\sum_k n_k^{(1)}} < \frac{\sum_k n_k^{(0)} q_k}{\sum_k n_k^{(0)}}$

This occurs when the treated group is disproportionately represented in strata with lower baseline rates. The weights $n_k^{(1)} / \sum_j n_j^{(1)}$ and $n_k^{(0)} / \sum_j n_j^{(0)}$ differ across treatment groups, and this differential weighting can flip the overall comparison.

Intuition

The aggregate rate is a weighted average of the stratum-specific rates. If treated and control groups use different weights (because they are distributed differently across strata), the weighted averages can reverse even when every stratum-specific comparison goes the same way.

Proof Sketch

Construct a concrete example. Stratum A: treatment succeeds 80/100, control succeeds 70/100. Stratum B: treatment succeeds 30/50, control succeeds 20/50. Within each stratum, treatment wins (80% vs 70%, 60% vs 40%). Aggregate: treatment succeeds 110/150 = 73.3%, control succeeds 90/150 = 60%. No reversal here. Now change the distribution: put most treated units in stratum B (low baseline) and most control units in stratum A (high baseline). Stratum A: treatment 8/10 (80%), control 63/90 (70%). Stratum B: treatment 27/90 (30%), control 4/10 (40%). Wait: within strata, treatment still wins. Aggregate: treatment (8 + 27)/100 = 35%, control (63 + 4)/100 = 67%. Reversal achieved.

Why It Matters

Any time you compare two groups using aggregate statistics, you risk Simpson's paradox if the groups have different compositions over a confounding variable. This applies to A/B testing (if user segments differ), model comparisons (if evaluation data mixes differ), and observational studies generally.

Failure Mode

The paradox does not tell you which analysis is correct. If $Z$ is a confounder (a common cause of $X$ and $Y$ ), you should condition on $Z$ . If $Z$ is a mediator (caused by $X$ and affecting $Y$ ), you should not condition on it. Only a causal model, not the data alone, resolves the ambiguity.

report a correction →

Connection to Causal Inference

Pearl's causal framework resolves Simpson's paradox by asking: what is the causal graph? If $Z$ is a common cause of $X$ and $Y$ , the correct analysis adjusts for $Z$ (the subgroup analysis is right). If $Z$ is caused by $X$ and itself causes $Y$ , adjusting for $Z$ introduces collider bias, and the aggregate analysis is right.

The formula for the causal effect when $Z$ is a confounder is:

$P(Y = 1 \mid do(X = 1)) = \sum_k P(Y = 1 \mid X = 1, Z = k) \cdot P(Z = k)$

Note the crucial detail: the weights are $P(Z = k)$ from the marginal distribution of $Z$ , not the conditional distribution given $X$ .

Common Confusions

Watch Out

The paradox means one analysis is wrong

Both analyses are arithmetically correct. The question is which one answers the question you are asking. If you want the causal effect of treatment, you must know the causal structure. If $Z$ is a confounder, stratify. If $Z$ is a collider, do not.

Watch Out

More data resolves the paradox

No. The paradox is not a small-sample artifact. It persists at any sample size because it is a structural property of how the confounding variable relates to treatment and outcome.

Canonical Examples

Example

Numerical construction

Two departments, A and B. Department A admits 40% of applicants. Department B admits 80%.

Men: 20 apply to A (8 admitted), 80 apply to B (64 admitted). Total: 72/100 = 72%. Women: 80 apply to A (32 admitted), 20 apply to B (16 admitted). Total: 48/100 = 48%.

Within department A: women 40%, men 40%. Within department B: women 80%, men 80%. No gender bias in either department. But the aggregate shows men at 72% vs women at 48%. The "bias" is entirely driven by women applying more to the selective department.

Exercises

ExerciseCore

Problem

Drug A cures 30/100 young patients and 50/100 old patients. Drug B cures 5/20 young patients and 40/80 old patients. Which drug has a higher cure rate in each age group? Which drug has a higher aggregate cure rate? Is this a Simpson's paradox?

ExerciseAdvanced

Problem

Prove that Simpson's paradox cannot occur when the treatment variable $X$ is independent of the confounding variable $Z$ . That is, if $P(Z = k \mid X = 1) = P(Z = k \mid X = 0)$ for all $k$ , the within-stratum and aggregate comparisons must agree in direction.

References

Canonical:

Simpson, "The Interpretation of Interaction in Contingency Tables", JRSS-B (1951)
Pearl, Causality (2009), Chapter 6

Current:

Pearl, Glymour, Jewell, Causal Inference in Statistics: A Primer (2016), Chapter 1
Bickel, Hammel, O'Connell, "Sex Bias in Graduate Admissions", Science (1975)

Next Topics

Base-rate fallacy: another case where marginal and conditional reasoning diverge
Statistical paradoxes collection: a survey of related paradoxes in statistics

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Causal Inference Basicslayer 3 · tier 3

Derived topics

2

Base Rate Fallacylayer 1 · tier 2
Statistical Paradoxes Collectionlayer 2 · tier 3

Graph-backed continuations

Base Rate Fallacy Statistical Paradoxes Collection