Methodology
Statistical Paradoxes Collection
A curated collection of statistical paradoxes that trip up practitioners: Lindley's paradox, Lord's paradox, Freedman's paradox, Hand's paradox, and the low birth weight paradox. Each with statement, mechanism, and lesson.
Prerequisites
Why This Matters
Statistical paradoxes are not mere curiosities. Each one reveals a specific failure mode in statistical reasoning that recurs in practice. Knowing these paradoxes equips you to spot the failure before it causes harm. Several of these paradoxes have direct analogues in ML: overfitting with many features (Freedman), conflating individual and population behavior (Hand), and conditioning on post-treatment variables (low birth weight).
This page collects five paradoxes beyond the "big two" (Simpson's paradox and the base-rate fallacy, which have their own dedicated pages).
Lindley's Paradox
Lindleys Paradox
A situation where a frequentist hypothesis test rejects the null hypothesis at a given significance level, while a Bayesian analysis with a reasonable prior assigns high posterior probability to the null. The two frameworks reach opposite conclusions from the same data.
Lindleys Paradox
Statement
Consider testing vs with . As , there exist sample means that are statistically significant at any fixed level (i.e., ) while the Bayesian posterior probability approaches 1.
Concretely, if for some constant , the p-value stays below for all , but the Bayes factor in favor of grows as .
Intuition
The frequentist test asks: "is far from 0 relative to its standard error ?" When , the z-score is , which stays significant. The Bayesian analysis asks: "is the data more likely under or ?" The data point is very close to 0 in absolute terms (approaching 0 as grows), so looks increasingly plausible from the Bayesian perspective.
Proof Sketch
Under , , so the density at is . Under a diffuse prior for , the marginal density of is . The Bayes factor scales as , so the posterior probability of approaches 1 even though the p-value remains below .
Why It Matters
This paradox reveals a deep disagreement between frequentist and Bayesian hypothesis testing. In large samples, p-values can be "significant" for effect sizes that are practically negligible. The Bayesian framework, by requiring specification of alternatives, naturally penalizes vague alternative hypotheses. This is directly relevant to ML experiments where large datasets can produce tiny but "significant" improvements.
Failure Mode
The paradox requires a point null hypothesis with positive prior mass. If the Bayesian uses a continuous prior over the null region (testing instead of ), the paradox is reduced. The choice of prior on also matters: a concentrated prior (small ) reduces the disagreement.
Lesson: Statistical significance is not the same as practical significance, and the two frameworks can formally disagree. Always report effect sizes alongside p-values.
Lord's Paradox
Lords Paradox
Two statisticians analyze the same data about the effect of a treatment (e.g., diet type) on an outcome (e.g., final weight), both using valid statistical methods, and reach opposite conclusions. One uses ANCOVA (adjusting for baseline), the other compares group means of change scores.
Lord's original formulation: two dining halls serve different diets. Freshman weights are measured at the start and end of the year. Statistician A compares average weight change between halls and finds no difference. Statistician B runs ANCOVA regressing final weight on diet type, controlling for initial weight, and finds a significant diet effect.
Why it happens: The paradox arises because the two analyses answer different causal questions. The change-score analysis estimates the average causal effect under different identifying assumptions than ANCOVA. When the groups differ in their baseline distributions (e.g., one hall starts heavier), the two methods can disagree because they handle the confounding differently.
Lesson: Choosing between change scores and ANCOVA is a causal question, not a statistical one. You must specify your causal model (the DAG) before choosing the analysis method.
Freedman's Paradox
Freedmans Paradox
Statement
Generate and as an matrix of i.i.d. entries, with independent of . Fit a linear regression of on , select the variables with , and refit using only those variables. When is substantial (e.g., ), the refitted model will typically show multiple "significant" coefficients, high , and significant F-test, despite there being no true signal.
Intuition
With predictors and observations, each predictor has about a 5% chance of appearing significant by chance (at the 5% level). If and , about 2-3 predictors pass the screen. But the significance levels in the refitted model are invalid because the same data was used for selection and inference. The refitted model "locks in" the noise that happened to align with , producing inflated t-statistics and .
Proof Sketch
The expected number of variables selected in the first stage is approximately (under the null). After refitting on only the selected variables, the effective degrees of freedom are understated because the selection step is ignored. The standard errors from the refitted model assume the variables were chosen a priori, leading to anti-conservative inference. Freedman (1983) showed via simulation that the F-test rejects the null (all coefficients zero) far more than 5% of the time.
Why It Matters
This is the statistical version of overfitting. Any time you use the data to select features and then use the same data to assess significance, your inference is invalid. This is the core motivation for sample splitting, cross-validation, and post-selection inference procedures.
Failure Mode
The paradox is strongest when is large. When is very small relative to , the effect is negligible. The paradox also assumes that variable selection and inference use the same data; honest sample splitting (select on one half, infer on the other) eliminates the problem.
Lesson: Never trust significance tests on variables that were selected using the same data. Split your data or use methods designed for post-selection inference.
Hand's Paradox
Hands Paradox
Each individual in a population prefers option A to option B, yet at the population level, option B appears superior. This differs from Simpson's paradox in that no subgroup aggregation is involved. Instead, the paradox arises because different individuals use different criteria, and the population-level summary combines incompatible scales.
Example: Suppose each customer rates products on different personal scales. Customer 1 rates product A as 8/10 and product B as 7/10 (prefers A). Customer 2 rates product A as 6/10 and product B as 5/10 (prefers A). But Customer 1 is a generous rater (high scores), so the average for B is and the average for A is . No reversal here. But if the metric each person uses is different (e.g., one rates satisfaction, another rates quality), averaging can produce a reversal because the scales are incommensurable.
Why it happens: Aggregating individual preferences into a population preference requires that individual measurements be on a common scale. When they are not, the aggregate can reverse the individual-level ordering. This is related to Arrow's impossibility theorem in social choice theory.
Lesson: Population-level averages can misrepresent individual-level preferences when measurement scales differ across individuals. In ML, this appears when aggregating metrics across heterogeneous evaluation sets.
Low Birth Weight Paradox
Low Birth Weight Paradox
Among low birth weight infants, babies of smoking mothers have lower mortality than babies of non-smoking mothers. This appears to suggest that maternal smoking is protective for low birth weight babies, contradicting the known harmful effect of smoking.
Why it happens: Smoking causes low birth weight. But so do other, more severe conditions (e.g., congenital defects). Among low birth weight babies, the smoking-exposed group is "low weight for a less severe reason" (smoking alone) compared to the non-smoking group (low weight due to more dangerous causes). Conditioning on the intermediate variable "birth weight" introduces collider bias: birth weight is caused by both smoking and by other health conditions, and selecting on the collider creates a spurious association between smoking and the other conditions.
In causal DAG terms: Smoking Low Birth Weight Severe Defects Mortality. Conditioning on Low Birth Weight opens the path Smoking Low Birth Weight Severe Defects Mortality, creating a spurious negative association between smoking and mortality within the low birth weight stratum.
Lesson: Conditioning on a variable that is caused by the treatment (a post-treatment variable or mediator) can create bias. This is collider bias or selection bias. In ML, this appears when you evaluate a model on a subset selected by a variable that the model influences.
Unifying Themes
All five paradoxes share a common structure: a statistical analysis that seems reasonable produces a conclusion that is wrong or misleading.
| Paradox | Root Cause | Lesson |
|---|---|---|
| Lindley's | Point null + diffuse alternative | P-values and Bayes factors measure different things |
| Lord's | Different causal assumptions | The causal question determines the method |
| Freedman's | Data reuse for selection and inference | Split data or use post-selection inference |
| Hand's | Incompatible scales across individuals | Do not average incommensurable measurements |
| Low birth weight | Conditioning on a collider | Do not condition on post-treatment variables |
Common Confusions
Paradoxes mean statistics is unreliable
These paradoxes do not show that statistics is broken. They show that applying standard methods without understanding their assumptions leads to wrong answers. Each paradox has a well-understood resolution once you identify the correct causal or inferential framework.
Paradoxes only matter in small samples
Lindley's paradox is worse in large samples. Freedman's paradox persists at any sample size when is substantial. These are structural problems, not finite-sample artifacts.
Exercises
Problem
In Freedman's paradox, you have observations and pure-noise predictors. Approximately how many predictors do you expect to have in the initial regression? What is the expected of the initial regression even though there is no signal?
Problem
Draw the causal DAG for the low birth weight paradox with variables: Smoking (S), Severe Defects (D), Birth Weight (W), Mortality (M). Explain why conditioning on creates a spurious association between and even though and are independent.
References
Canonical:
- Lindley, "A Statistical Paradox", Biometrika (1957)
- Lord, "A Paradox in the Interpretation of Group Comparisons", Psychological Bulletin (1967)
- Freedman, "A Note on Screening Regression Equations", The American Statistician (1983)
- Hand, "Deconstructing Statistical Questions", JRSS-A (1994)
- Hernandez-Diaz, Schisterman, et al., "The Birth Weight Paradox Revisited", Epidemiology (2006)
Current:
- Pearl, The Book of Why (2018), Chapters 6-7
- Hernan & Robins, Causal Inference: What If (2020), Chapters 8-9
Next Topics
See the individual pages on Simpson's paradox and the base-rate fallacy for deeper treatment of those two paradoxes.
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Simpson's ParadoxLayer 1
- Base Rate FallacyLayer 1
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A