Methodology
Statistical Significance and Multiple Comparisons
p-values, significance levels, confidence intervals, and the multiple comparisons problem. Bonferroni correction, Benjamini-Hochberg FDR control, and why these matter for model selection and benchmark evaluation.
Prerequisites
Why This Matters
You train five models, tune three hyperparameters each, evaluate on four metrics, and report the best combination. How many implicit hypothesis tests did you just run? Roughly . Without correction, the probability of at least one spurious "significant" result exceeds 95%. This is the multiple comparisons problem, and it is the single largest source of false claims in ML evaluation.
Formal Setup
p-Value
The p-value for a test statistic is the probability of observing a value at least as extreme as the observed , assuming the null hypothesis is true:
A small means the data is unlikely under . The p-value is not the probability that is true.
Significance Level
The significance level is the threshold for rejecting . Reject if . The standard choice means you accept a 5% false positive rate for any single test.
Confidence Interval
A confidence interval for parameter is a random interval such that over repeated sampling. For the difference in model performance :
where is the standard error. If the interval excludes zero, the difference is significant at level .
The Multiple Comparisons Problem
When testing hypotheses simultaneously at level each, the probability of at least one false positive grows rapidly.
Family-Wise Error Rate
The FWER is the probability of making at least one Type I error across all tests:
when the tests are independent. For and : .
Bonferroni Correction
Bonferroni Correction
Statement
To control the family-wise error rate at level across tests, reject hypothesis only if . Then:
where is the number of true null hypotheses.
Intuition
If you test 20 hypotheses at , Bonferroni requires for each individual test. You divide your error budget equally among all tests. The union bound guarantees this works regardless of whether the tests are correlated.
Proof Sketch
By the union bound: . The key step is that for a true null, is uniform on , so .
Why It Matters
Bonferroni is the simplest multiple testing correction. It requires no assumptions about the dependence structure among tests. In ML, use it when comparing a small number of models (say, 5 or fewer) and you need a strong guarantee that no comparison is a false positive.
Failure Mode
Bonferroni is very conservative when is large. With tests, you need per test. Many true effects will be missed. When you care about controlling the proportion of false discoveries rather than their existence, use Benjamini-Hochberg instead.
Benjamini-Hochberg FDR Control
False Discovery Rate
The False Discovery Rate is the expected proportion of false positives among all rejected hypotheses:
where is the number of false rejections, is the total number of rejections, and avoids division by zero.
Benjamini-Hochberg Procedure
Statement
Order the p-values . Find the largest such that . Reject all hypotheses . Then .
Intuition
BH draws a line with slope through the ordered p-values. The largest p-value below this line determines the cutoff. Hypotheses with smaller p-values are rejected. The procedure is adaptive: if many p-values are small (many true effects), the threshold is more lenient than Bonferroni.
Proof Sketch
The original proof by Benjamini and Hochberg (1995) proceeds by induction on the number of true nulls . The key insight: under independence, the expected number of false rejections at the -th threshold is . The proof extends to positive regression dependence (PRDS) by the work of Benjamini and Yekutieli (2001).
Why It Matters
BH is the standard correction for large-scale testing in ML. When comparing models across many datasets, metrics, or hyperparameter settings, BH controls the fraction of false discoveries rather than the probability of any false discovery. This is almost always the right notion for ML practitioners.
Failure Mode
BH requires independence or positive dependence among p-values. If tests are negatively correlated (rare in practice but possible), FDR may exceed . Under arbitrary dependence, use the Benjamini-Yekutieli correction, which replaces with .
Why This Matters for ML
Model Selection
When you compare models, you are implicitly running pairwise tests. With 10 models, that is 45 tests. Reporting the "best" model without correction inflates the false positive rate.
Hyperparameter Tuning
Each hyperparameter configuration evaluated on a validation set is an implicit hypothesis test. Random search over 100 configurations at yields an expected 5 false positives. Cross-validation helps but does not eliminate this problem.
Benchmark Comparisons
Papers comparing on multiple benchmarks (GLUE has 9 tasks, for example) should correct for multiple comparisons when claiming improvements "across the board."
Bonferroni vs BH on benchmark evaluation
You compare your model against a baseline on 20 benchmark tasks. The sorted p-values are: 0.001, 0.003, 0.008, 0.012, 0.025, 0.04, 0.06, ... (remaining above 0.05).
Bonferroni (): threshold is . Only the first p-value (0.001) passes. You claim significance on 1 task.
BH (): check . Thresholds are 0.0025, 0.005, 0.0075, 0.01, 0.0125, ... The first three p-values pass (0.001 < 0.0025, 0.003 < 0.005, 0.008 > 0.0075). So BH rejects the first 2 hypotheses. You claim significance on 2 tasks.
BH discovers more true effects while controlling the false discovery proportion.
Common Confusions
p-hacking is implicit multiple testing
If you try many analysis strategies (different features, different preprocessing, different splits) and report only the one that gives , you have performed many implicit tests without correction. This is p-hacking. The fix is to preregister your analysis plan or apply multiple testing correction to all analyses you tried.
Bonferroni and BH control different error rates
Bonferroni controls the probability of any false positive (FWER). BH controls the proportion of false positives (FDR). These are different quantities. FWER is appropriate when any false positive is costly (e.g., clinical trials). FDR is appropriate when you expect some false positives and want to control their rate (e.g., screening many benchmark tasks).
Post-hoc correction does not fix bad experimental design
If your experiment has data leakage or other methodological flaws, no multiple testing correction will save you. Corrections adjust p-value thresholds; they do not fix biased test statistics. Always ensure the individual tests are valid before applying corrections.
Summary
- The p-value is , not
- Testing hypotheses at each gives FWER
- Bonferroni: reject if . Controls FWER. Conservative for large
- BH: order p-values, find largest with . Controls FDR
- ML model selection, hyperparameter tuning, and benchmark evaluation all involve implicit multiple testing
- Confidence intervals convey both significance and effect size
Exercises
Problem
You evaluate a model on 8 benchmark datasets and obtain p-values (vs. baseline) of 0.004, 0.01, 0.02, 0.03, 0.06, 0.08, 0.12, 0.25. Apply both Bonferroni and BH at level 0.05. How many datasets show significant improvement under each?
Problem
You run a random hyperparameter search with 200 configurations. The best configuration has validation accuracy 0.5% higher than the second best. You report this as your final result. Explain why this is problematic from a multiple comparisons perspective, and propose a fix.
References
Canonical:
- Benjamini and Hochberg, "Controlling the False Discovery Rate" (1995)
- Demsar, "Statistical Comparisons of Classifiers over Multiple Datasets" (JMLR 2006)
Current:
-
Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (MLSys 2021)
-
Recht et al., "Do ImageNet Classifiers Generalize to ImageNet?" (ICML 2019)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
Next Topics
- Bootstrap methods: nonparametric inference for any test statistic
- Cross-validation theory: proper evaluation under limited data
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Hypothesis Testing for MLLayer 2