Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Statistical Paradoxes Collection

A curated collection of statistical paradoxes that trip up practitioners: Lindley's paradox, Lord's paradox, Freedman's paradox, Hand's paradox, and the low birth weight paradox. Each with statement, mechanism, and lesson.

AdvancedTier 3Stable~55 min
0

Why This Matters

Statistical paradoxes are not mere curiosities. Each one reveals a specific failure mode in statistical reasoning that recurs in practice. Knowing these paradoxes equips you to spot the failure before it causes harm. Several of these paradoxes have direct analogues in ML: overfitting with many features (Freedman), conflating individual and population behavior (Hand), and conditioning on post-treatment variables (low birth weight).

This page collects five paradoxes beyond the "big two" (Simpson's paradox and the base-rate fallacy, which have their own dedicated pages).

Lindley's Paradox

Definition

Lindleys Paradox

A situation where a frequentist hypothesis test rejects the null hypothesis at a given significance level, while a Bayesian analysis with a reasonable prior assigns high posterior probability to the null. The two frameworks reach opposite conclusions from the same data.

Proposition

Lindleys Paradox

Statement

Consider testing H0:μ=0H_0: \mu = 0 vs H1:μ0H_1: \mu \neq 0 with X1,,XnN(μ,1)X_1, \ldots, X_n \sim N(\mu, 1). As nn \to \infty, there exist sample means Xˉn\bar{X}_n that are statistically significant at any fixed level α\alpha (i.e., Xˉn>zα/2/n|\bar{X}_n| > z_{\alpha/2}/\sqrt{n}) while the Bayesian posterior probability P(H0data)P(H_0 \mid \text{data}) approaches 1.

Concretely, if Xˉn=c/n\bar{X}_n = c/\sqrt{n} for some constant c>zα/2c > z_{\alpha/2}, the p-value stays below α\alpha for all nn, but the Bayes factor in favor of H0H_0 grows as n\sqrt{n}.

Intuition

The frequentist test asks: "is Xˉn\bar{X}_n far from 0 relative to its standard error 1/n1/\sqrt{n}?" When Xˉn=c/n\bar{X}_n = c/\sqrt{n}, the z-score is cc, which stays significant. The Bayesian analysis asks: "is the data more likely under H0H_0 or H1H_1?" The data point Xˉn=c/n\bar{X}_n = c/\sqrt{n} is very close to 0 in absolute terms (approaching 0 as nn grows), so H0H_0 looks increasingly plausible from the Bayesian perspective.

Proof Sketch

Under H0H_0, XˉnN(0,1/n)\bar{X}_n \sim N(0, 1/n), so the density at Xˉn=c/n\bar{X}_n = c/\sqrt{n} is n/(2π)exp(c2/2)\sqrt{n/(2\pi)} \exp(-c^2/2). Under a diffuse prior μN(0,τ2)\mu \sim N(0, \tau^2) for H1H_1, the marginal density of Xˉn\bar{X}_n is n/(2π(1+nτ2))exp(nc2/(2n(1+nτ2)))\sqrt{n/(2\pi(1 + n\tau^2))} \exp(-nc^2/(2n(1 + n\tau^2))). The Bayes factor BF01BF_{01} scales as nτ2\sqrt{n\tau^2} \to \infty, so the posterior probability of H0H_0 approaches 1 even though the p-value remains below α\alpha.

Why It Matters

This paradox reveals a deep disagreement between frequentist and Bayesian hypothesis testing. In large samples, p-values can be "significant" for effect sizes that are practically negligible. The Bayesian framework, by requiring specification of alternatives, naturally penalizes vague alternative hypotheses. This is directly relevant to ML experiments where large datasets can produce tiny but "significant" improvements.

Failure Mode

The paradox requires a point null hypothesis with positive prior mass. If the Bayesian uses a continuous prior over the null region (testing μ<ϵ|\mu| < \epsilon instead of μ=0\mu = 0), the paradox is reduced. The choice of prior on H1H_1 also matters: a concentrated prior (small τ2\tau^2) reduces the disagreement.

Lesson: Statistical significance is not the same as practical significance, and the two frameworks can formally disagree. Always report effect sizes alongside p-values.

Lord's Paradox

Definition

Lords Paradox

Two statisticians analyze the same data about the effect of a treatment (e.g., diet type) on an outcome (e.g., final weight), both using valid statistical methods, and reach opposite conclusions. One uses ANCOVA (adjusting for baseline), the other compares group means of change scores.

Lord's original formulation: two dining halls serve different diets. Freshman weights are measured at the start and end of the year. Statistician A compares average weight change between halls and finds no difference. Statistician B runs ANCOVA regressing final weight on diet type, controlling for initial weight, and finds a significant diet effect.

Why it happens: The paradox arises because the two analyses answer different causal questions. The change-score analysis estimates the average causal effect under different identifying assumptions than ANCOVA. When the groups differ in their baseline distributions (e.g., one hall starts heavier), the two methods can disagree because they handle the confounding differently.

Lesson: Choosing between change scores and ANCOVA is a causal question, not a statistical one. You must specify your causal model (the DAG) before choosing the analysis method.

Freedman's Paradox

Proposition

Freedmans Paradox

Statement

Generate yN(0,In)y \sim N(0, I_n) and XX as an n×pn \times p matrix of i.i.d. N(0,1)N(0, 1) entries, with yy independent of XX. Fit a linear regression of yy on XX, select the variables with t>2|t| > 2, and refit using only those variables. When p/np/n is substantial (e.g., p=0.4np = 0.4n), the refitted model will typically show multiple "significant" coefficients, high R2R^2, and significant F-test, despite there being no true signal.

Intuition

With pp predictors and nn observations, each predictor has about a 5% chance of appearing significant by chance (at the 5% level). If p=50p = 50 and n=125n = 125, about 2-3 predictors pass the screen. But the significance levels in the refitted model are invalid because the same data was used for selection and inference. The refitted model "locks in" the noise that happened to align with yy, producing inflated t-statistics and R2R^2.

Proof Sketch

The expected number of variables selected in the first stage is approximately 0.05p0.05p (under the null). After refitting on only the selected variables, the effective degrees of freedom are understated because the selection step is ignored. The standard errors from the refitted model assume the variables were chosen a priori, leading to anti-conservative inference. Freedman (1983) showed via simulation that the F-test rejects the null (all coefficients zero) far more than 5% of the time.

Why It Matters

This is the statistical version of overfitting. Any time you use the data to select features and then use the same data to assess significance, your inference is invalid. This is the core motivation for sample splitting, cross-validation, and post-selection inference procedures.

Failure Mode

The paradox is strongest when p/np/n is large. When pp is very small relative to nn, the effect is negligible. The paradox also assumes that variable selection and inference use the same data; honest sample splitting (select on one half, infer on the other) eliminates the problem.

Lesson: Never trust significance tests on variables that were selected using the same data. Split your data or use methods designed for post-selection inference.

Hand's Paradox

Definition

Hands Paradox

Each individual in a population prefers option A to option B, yet at the population level, option B appears superior. This differs from Simpson's paradox in that no subgroup aggregation is involved. Instead, the paradox arises because different individuals use different criteria, and the population-level summary combines incompatible scales.

Example: Suppose each customer rates products on different personal scales. Customer 1 rates product A as 8/10 and product B as 7/10 (prefers A). Customer 2 rates product A as 6/10 and product B as 5/10 (prefers A). But Customer 1 is a generous rater (high scores), so the average for B is (7+5)/2=6.0(7 + 5)/2 = 6.0 and the average for A is (8+6)/2=7.0(8 + 6)/2 = 7.0. No reversal here. But if the metric each person uses is different (e.g., one rates satisfaction, another rates quality), averaging can produce a reversal because the scales are incommensurable.

Why it happens: Aggregating individual preferences into a population preference requires that individual measurements be on a common scale. When they are not, the aggregate can reverse the individual-level ordering. This is related to Arrow's impossibility theorem in social choice theory.

Lesson: Population-level averages can misrepresent individual-level preferences when measurement scales differ across individuals. In ML, this appears when aggregating metrics across heterogeneous evaluation sets.

Low Birth Weight Paradox

Definition

Low Birth Weight Paradox

Among low birth weight infants, babies of smoking mothers have lower mortality than babies of non-smoking mothers. This appears to suggest that maternal smoking is protective for low birth weight babies, contradicting the known harmful effect of smoking.

Why it happens: Smoking causes low birth weight. But so do other, more severe conditions (e.g., congenital defects). Among low birth weight babies, the smoking-exposed group is "low weight for a less severe reason" (smoking alone) compared to the non-smoking group (low weight due to more dangerous causes). Conditioning on the intermediate variable "birth weight" introduces collider bias: birth weight is caused by both smoking and by other health conditions, and selecting on the collider creates a spurious association between smoking and the other conditions.

In causal DAG terms: Smoking \to Low Birth Weight \leftarrow Severe Defects \to Mortality. Conditioning on Low Birth Weight opens the path Smoking \to Low Birth Weight \leftarrow Severe Defects \to Mortality, creating a spurious negative association between smoking and mortality within the low birth weight stratum.

Lesson: Conditioning on a variable that is caused by the treatment (a post-treatment variable or mediator) can create bias. This is collider bias or selection bias. In ML, this appears when you evaluate a model on a subset selected by a variable that the model influences.

Unifying Themes

All five paradoxes share a common structure: a statistical analysis that seems reasonable produces a conclusion that is wrong or misleading.

ParadoxRoot CauseLesson
Lindley'sPoint null + diffuse alternativeP-values and Bayes factors measure different things
Lord'sDifferent causal assumptionsThe causal question determines the method
Freedman'sData reuse for selection and inferenceSplit data or use post-selection inference
Hand'sIncompatible scales across individualsDo not average incommensurable measurements
Low birth weightConditioning on a colliderDo not condition on post-treatment variables

Common Confusions

Watch Out

Paradoxes mean statistics is unreliable

These paradoxes do not show that statistics is broken. They show that applying standard methods without understanding their assumptions leads to wrong answers. Each paradox has a well-understood resolution once you identify the correct causal or inferential framework.

Watch Out

Paradoxes only matter in small samples

Lindley's paradox is worse in large samples. Freedman's paradox persists at any sample size when p/np/n is substantial. These are structural problems, not finite-sample artifacts.

Exercises

ExerciseCore

Problem

In Freedman's paradox, you have n=100n = 100 observations and p=40p = 40 pure-noise predictors. Approximately how many predictors do you expect to have t>2|t| > 2 in the initial regression? What is the expected R2R^2 of the initial regression even though there is no signal?

ExerciseAdvanced

Problem

Draw the causal DAG for the low birth weight paradox with variables: Smoking (S), Severe Defects (D), Birth Weight (W), Mortality (M). Explain why conditioning on WW creates a spurious association between SS and MM even though SS and DD are independent.

References

Canonical:

  • Lindley, "A Statistical Paradox", Biometrika (1957)
  • Lord, "A Paradox in the Interpretation of Group Comparisons", Psychological Bulletin (1967)
  • Freedman, "A Note on Screening Regression Equations", The American Statistician (1983)
  • Hand, "Deconstructing Statistical Questions", JRSS-A (1994)
  • Hernandez-Diaz, Schisterman, et al., "The Birth Weight Paradox Revisited", Epidemiology (2006)

Current:

  • Pearl, The Book of Why (2018), Chapters 6-7
  • Hernan & Robins, Causal Inference: What If (2020), Chapters 8-9

Next Topics

See the individual pages on Simpson's paradox and the base-rate fallacy for deeper treatment of those two paradoxes.

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.