Statistical Paradoxes Collection

Sneiderman, Robby

Methodology

Statistical Paradoxes Collection

A curated collection of statistical paradoxes that trip up practitioners: Lindley's paradox, Lord's paradox, Freedman's paradox, Hand's paradox, and the low birth weight paradox. Each with statement, mechanism, and lesson.

AdvancedTier 3StableInsight~55 min

Prerequisites

Simpsons Paradox Base Rate Fallacy Anthropic Bias and Observation Selection Steins Paradox

Prereq Map

Learning position

Read this page in the graph.

methodology | layer 2 | tier 3. This page has 4 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Statistical paradoxes are not mere curiosities. Each one reveals a specific failure mode in statistical reasoning that recurs in practice. Knowing these paradoxes equips you to spot the failure before it causes harm. Several of these paradoxes have direct analogues in ML: overfitting with many features (Freedman), conflating individual and population behavior (Hand), and conditioning on post-treatment variables (low birth weight).

This page collects five paradoxes beyond the "big two" (Simpson's paradox and the base-rate fallacy, which have their own dedicated pages).

Lindley's Paradox

Definition

Lindleys Paradox

A situation where a frequentist hypothesis test rejects the null hypothesis at a given significance level, while a Bayesian analysis with a reasonable prior assigns high posterior probability to the null. The two frameworks reach opposite conclusions from the same data.

Proposition

Lindleys Paradox

Statement

Consider testing $H_0: \mu = 0$ vs $H_1: \mu \neq 0$ with $X_1, \ldots, X_n \sim N(\mu, 1)$ . As $n \to \infty$ , there exist sample means $\bar{X}_n$ that are statistically significant at any fixed level $\alpha$ (i.e., $|\bar{X}_n| > z_{\alpha/2}/\sqrt{n}$ ) while the Bayesian posterior probability $P(H_0 \mid \text{data})$ approaches 1.

Concretely, if $\bar{X}_n = c/\sqrt{n}$ for some constant $c > z_{\alpha/2}$ , the p-value stays below $\alpha$ for all $n$ , but the Bayes factor in favor of $H_0$ grows as $\sqrt{n}$ .

Intuition

The frequentist test asks: "is $\bar{X}_n$ far from 0 relative to its standard error $1/\sqrt{n}$ ?" When $\bar{X}_n = c/\sqrt{n}$ , the z-score is $c$ , which stays significant. The Bayesian analysis asks: "is the data more likely under $H_0$ or $H_1$ ?" The data point $\bar{X}_n = c/\sqrt{n}$ is very close to 0 in absolute terms (approaching 0 as $n$ grows), so $H_0$ looks increasingly plausible from the Bayesian perspective.

Proof Sketch

Under $H_0$ , $\bar{X}_n \sim N(0, 1/n)$ , so the density at $\bar{X}_n = c/\sqrt{n}$ is $\sqrt{n/(2\pi)} \exp(-c^2/2)$ . Under a diffuse prior $\mu \sim N(0, \tau^2)$ for $H_1$ , the marginal density of $\bar{X}_n$ is $\sqrt{n/(2\pi(1 + n\tau^2))} \exp(-nc^2/(2n(1 + n\tau^2)))$ . The Bayes factor $BF_{01}$ scales as $\sqrt{n\tau^2} \to \infty$ , so the posterior probability of $H_0$ approaches 1 even though the p-value remains below $\alpha$ .

Why It Matters

This paradox reveals a deep disagreement between frequentist and Bayesian hypothesis testing. In large samples, p-values can be "significant" for effect sizes that are practically negligible. The Bayesian framework, by requiring specification of alternatives, naturally penalizes vague alternative hypotheses. This is directly relevant to ML experiments where large datasets can produce tiny but "significant" improvements.

Failure Mode

The paradox requires a point null hypothesis with positive prior mass. If the Bayesian uses a continuous prior over the null region (testing $|\mu| < \epsilon$ instead of $\mu = 0$ ), the paradox is reduced. The choice of prior on $H_1$ also matters: a concentrated prior (small $\tau^2$ ) reduces the disagreement.

report a correction →

Lesson: Statistical significance is not the same as practical significance, and the two frameworks can formally disagree. Always report effect sizes alongside p-values.

Lord's Paradox

Definition

Lords Paradox

Two statisticians analyze the same data about the effect of a treatment (e.g., diet type) on an outcome (e.g., final weight), both using valid statistical methods, and reach opposite conclusions. One uses ANCOVA (adjusting for baseline), the other compares group means of change scores.

Lord's original formulation: two dining halls serve different diets. Freshman weights are measured at the start and end of the year. Statistician A compares average weight change between halls and finds no difference. Statistician B runs ANCOVA regressing final weight on diet type, controlling for initial weight, and finds a significant diet effect.

Why it happens: The paradox arises because the two analyses answer different causal questions. The change-score analysis estimates the average causal effect under different identifying assumptions than ANCOVA. When the groups differ in their baseline distributions (e.g., one hall starts heavier), the two methods can disagree because they handle the confounding differently.

Lesson: Choosing between change scores and ANCOVA is a causal question, not a statistical one. You must specify your causal model (the DAG) before choosing the analysis method.

Freedman's Paradox

Proposition

Freedmans Paradox

Statement

Generate $y \sim N(0, I_n)$ and $X$ as an $n \times p$ matrix of i.i.d. $N(0, 1)$ entries, with $y$ independent of $X$ . Fit a linear regression of $y$ on $X$ , select the variables with $|t| > 2$ , and refit using only those variables. When $p/n$ is substantial (e.g., $p = 0.4n$ ), the refitted model will typically show multiple "significant" coefficients, high $R^2$ , and significant F-test, despite there being no true signal.

Intuition

With $p$ predictors and $n$ observations, each predictor has about a 5% chance of appearing significant by chance (at the 5% level). If $p = 50$ and $n = 125$ , about 2-3 predictors pass the screen. But the significance levels in the refitted model are invalid because the same data was used for selection and inference. The refitted model "locks in" the noise that happened to align with $y$ , producing inflated t-statistics and $R^2$ .

Proof Sketch

The expected number of variables selected in the first stage is approximately $0.05p$ (under the null). After refitting on only the selected variables, the effective degrees of freedom are understated because the selection step is ignored. The standard errors from the refitted model assume the variables were chosen a priori, leading to anti-conservative inference. Freedman (1983) showed via simulation that the F-test rejects the null (all coefficients zero) far more than 5% of the time.

Why It Matters

This is the statistical version of overfitting. Any time you use the data to select features and then use the same data to assess significance, your inference is invalid. This is the core motivation for sample splitting, cross-validation, and post-selection inference procedures.

Failure Mode

The paradox is strongest when $p/n$ is large. When $p$ is very small relative to $n$ , the effect is negligible. The paradox also assumes that variable selection and inference use the same data; honest sample splitting (select on one half, infer on the other) eliminates the problem.

report a correction →

Lesson: Never trust significance tests on variables that were selected using the same data. Split your data or use methods designed for post-selection inference.

Hand's Paradox

Definition

Hands Paradox

Each individual in a population prefers option A to option B, yet at the population level, option B appears superior. This differs from Simpson's paradox in that no subgroup aggregation is involved. Instead, the paradox arises because different individuals use different criteria, and the population-level summary combines incompatible scales.

Example: Suppose each customer rates products on different personal scales. Customer 1 rates product A as 8/10 and product B as 7/10 (prefers A). Customer 2 rates product A as 6/10 and product B as 5/10 (prefers A). But Customer 1 is a generous rater (high scores), so the average for B is $(7 + 5)/2 = 6.0$ and the average for A is $(8 + 6)/2 = 7.0$ . No reversal here. But if the metric each person uses is different (e.g., one rates satisfaction, another rates quality), averaging can produce a reversal because the scales are incommensurable.

Why it happens: Aggregating individual preferences into a population preference requires that individual measurements be on a common scale. When they are not, the aggregate can reverse the individual-level ordering. This is related to Arrow's impossibility theorem in social choice theory.

Lesson: Population-level averages can misrepresent individual-level preferences when measurement scales differ across individuals. In ML, this appears when aggregating metrics across heterogeneous evaluation sets.

Low Birth Weight Paradox

Definition

Low Birth Weight Paradox

Among low birth weight infants, babies of smoking mothers have lower mortality than babies of non-smoking mothers. This appears to suggest that maternal smoking is protective for low birth weight babies, contradicting the known harmful effect of smoking.

Why it happens: Smoking causes low birth weight. But so do other, more severe conditions (e.g., congenital defects). Among low birth weight babies, the smoking-exposed group is "low weight for a less severe reason" (smoking alone) compared to the non-smoking group (low weight due to more dangerous causes). Conditioning on the intermediate variable "birth weight" introduces collider bias: birth weight is caused by both smoking and by other health conditions, and selecting on the collider creates a spurious association between smoking and the other conditions.

In causal DAG terms: Smoking $\to$ Low Birth Weight $\leftarrow$ Severe Defects $\to$ Mortality. Conditioning on Low Birth Weight opens the path Smoking $\to$ Low Birth Weight $\leftarrow$ Severe Defects $\to$ Mortality, creating a spurious negative association between smoking and mortality within the low birth weight stratum.

Lesson: Conditioning on a variable that is caused by the treatment (a post-treatment variable or mediator) can create bias. This is collider bias or selection bias. In ML, this appears when you evaluate a model on a subset selected by a variable that the model influences.

Unifying Themes

All five paradoxes share a common structure: a statistical analysis that seems reasonable produces a conclusion that is wrong or misleading.

Paradox	Root Cause	Lesson
Lindley's	Point null + diffuse alternative	P-values and Bayes factors measure different things
Lord's	Different causal assumptions	The causal question determines the method
Freedman's	Data reuse for selection and inference	Split data or use post-selection inference
Hand's	Incompatible scales across individuals	Do not average incommensurable measurements
Low birth weight	Conditioning on a collider	Do not condition on post-treatment variables

Common Confusions

Watch Out

Paradoxes mean statistics is unreliable

These paradoxes do not show that statistics is broken. They show that applying standard methods without understanding their assumptions leads to wrong answers. Each paradox has a well-understood resolution once you identify the correct causal or inferential framework.

Watch Out

Paradoxes only matter in small samples

Lindley's paradox is worse in large samples. Freedman's paradox persists at any sample size when $p/n$ is substantial. These are structural problems, not finite-sample artifacts.

Exercises

ExerciseCore

Problem

In Freedman's paradox, you have $n = 100$ observations and $p = 40$ pure-noise predictors. Approximately how many predictors do you expect to have $|t| > 2$ in the initial regression? What is the expected $R^2$ of the initial regression even though there is no signal?

ExerciseAdvanced

Problem

Draw the causal DAG for the low birth weight paradox with variables: Smoking (S), Severe Defects (D), Birth Weight (W), Mortality (M). Explain why conditioning on $W$ creates a spurious association between $S$ and $M$ even though $S$ and $D$ are independent.

References

Canonical:

Lindley, "A Statistical Paradox", Biometrika (1957)
Lord, "A Paradox in the Interpretation of Group Comparisons", Psychological Bulletin (1967)
Freedman, "A Note on Screening Regression Equations", The American Statistician (1983)
Hand, "Deconstructing Statistical Questions", JRSS-A (1994)
Hernandez-Diaz, Schisterman, et al., "The Birth Weight Paradox Revisited", Epidemiology (2006)

Current:

Pearl, The Book of Why (2018), Chapters 6-7
Hernan & Robins, Causal Inference: What If (2020), Chapters 8-9

Next Topics

See the individual pages on Simpson's paradox and the base-rate fallacy for deeper treatment of those two paradoxes.

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Stein's Paradoxlayer 0B · tier 2
Base Rate Fallacylayer 1 · tier 2
Simpson's Paradoxlayer 1 · tier 2
Anthropic Bias and Observation Selectionlayer 3 · tier 3

Derived topics

0

No published topic currently declares this as a prerequisite.