Methodology
P-Hacking and Multiple Testing
How selective reporting and multiple comparisons inflate false positive rates, and how Bonferroni and Benjamini-Hochberg corrections control them. Why hyperparameter tuning is multiple testing and benchmark shopping is p-hacking.
Prerequisites
Why This Matters
A researcher tests 20 independent hypotheses at significance level . Even if every null hypothesis is true, the probability of at least one false positive is . More than half the time, something will look significant by chance.
This is not a theoretical curiosity. In ML, every time you try a new hyperparameter configuration and check validation accuracy, you are running a hypothesis test. Every time you evaluate your model on a new benchmark, you are performing another comparison. Without correction, the reported "best" result is systematically inflated.
P-Hacking
P-Hacking
The practice of manipulating the analysis pipeline until a p-value below is obtained. Common forms include: trying multiple outcome variables, adding or removing covariates, excluding data points, stopping data collection when , and switching between one-tailed and two-tailed tests.
P-hacking is often unintentional. The researcher does not plan to cheat; they make "reasonable" analysis choices at each step. But each choice represents a degree of freedom. Simmons, Nelson, and Simonsohn (2011) showed that with four common researcher degrees of freedom (flexible sample size, flexible covariates, flexible outcome variable, optional data exclusion), the effective false positive rate rises from to over .
Garden of forking paths (Gelman & Loken, 2013): even without conscious p-hacking, the analysis path depends on the data. The researcher would have made different "reasonable" choices if the data had come out differently. The number of implicit comparisons is much larger than the number of tests explicitly reported.
Multiple Testing Problem
Family-Wise Error Rate
The probability of making at least one Type I error across a family of hypothesis tests:
Controlling FWER at level means .
False Discovery Rate
The expected proportion of false discoveries among all rejections:
where is the number of false rejections and is the total number of rejections. The avoids division by zero when .
FWER is conservative: it controls the probability of any false positive. FDR is less conservative: it allows some false positives as long as their proportion among discoveries is controlled.
Main Theorems
Bonferroni Correction
Statement
Reject null hypothesis if . Then .
Intuition
By the union bound, the probability that any one of tests falsely rejects is at most . This works regardless of dependence between tests.
Proof Sketch
Let be the set of true null hypotheses. For each , by validity of the p-value. By the union bound: .
Why It Matters
Bonferroni is the simplest multiple testing correction. It requires no assumptions about dependence structure and works for any collection of valid p-values. Its simplicity makes it widely applicable as a quick conservative check.
Failure Mode
Bonferroni is highly conservative when is large. With tests (common in genomics or neural architecture search), the threshold rejects almost nothing. Power drops dramatically. When many tests are correlated (e.g., nearby genomic loci), Bonferroni wastes statistical power because the effective number of independent tests is much smaller than .
Benjamini-Hochberg Procedure
Statement
Order the p-values: . Find the largest such that . Reject all hypotheses . Then .
Intuition
Instead of dividing equally among all tests (Bonferroni), BH uses a step-up procedure: the threshold for the -th smallest p-value is , linearly increasing. This allows more rejections when many small p-values exist, while still controlling the expected false discovery proportion.
Proof Sketch
The proof (Benjamini & Hochberg, 1995) proceeds by conditioning on the number of true null hypotheses . Under independence, the p-values of true nulls are uniform on , and the expected number of false rejections at threshold is . The step-up structure ensures that this expectation, divided by the total number of rejections, is at most .
Why It Matters
BH is the standard procedure when you want to discover as many true effects as possible while limiting the fraction of false discoveries. It is far more powerful than Bonferroni when many hypotheses are tested and a meaningful fraction of them are truly non-null.
Failure Mode
Under strong negative dependence between test statistics, the FDR guarantee can fail. The PRDS condition (which includes independence and many forms of positive dependence) is required. When the dependence structure is unknown, the Benjamini-Yekutieli (2001) procedure provides FDR control at the cost of a factor.
The ML Connection
Hyperparameter tuning is multiple testing. If you try 100 hyperparameter configurations and report the best validation accuracy, you have implicitly tested 100 hypotheses. The best configuration's validation performance is an optimistic estimate of its true performance. This is why the gap between validation and test accuracy increases with the number of configurations tried. This applies to all forms of tuning: gradient descent learning rate schedules, Bayesian optimization of hyperparameters, and random search alike.
Benchmark shopping is p-hacking. Evaluating a model on many benchmarks and reporting the ones where it performs best is structurally identical to testing many hypotheses and reporting the significant ones. The "state-of-the-art" claim is inflated by the number of unreported benchmarks where the model performed poorly.
Data leakage through repeated evaluation compounds this problem. Each time you evaluate on a test set and use the result to guide further development, the test set becomes part of the training process. See model evaluation best practices for how to structure evaluation protocols that resist this.
Comparison of Correction Methods
The following table summarizes when to use each correction approach.
| Method | Controls | Threshold for test | Assumptions | Best for |
|---|---|---|---|---|
| No correction | Nothing | N/A | Single pre-specified test. No multiple comparisons. | |
| Bonferroni | FWER | Valid p-values, any dependence | Small (), or when any false positive is unacceptable (e.g., clinical trials) | |
| Holm-Bonferroni | FWER | for -th smallest | Valid p-values, any dependence | Same as Bonferroni but uniformly more powerful. No reason not to use it over Bonferroni. |
| Benjamini-Hochberg | FDR | for -th smallest | Independence or PRDS | Large where you expect many true effects (e.g., feature selection, genomics) |
| Benjamini-Yekutieli | FDR | Any dependence | Large with unknown or negative dependence structure | |
| Permutation test | FWER | Data-dependent | Exchangeability under null | When parametric assumptions fail. Computationally expensive but exact. |
P-Hacking in Neural Architecture Search
Neural architecture search (NAS) is one of the most aggressive forms of multiple testing in modern ML. A typical NAS procedure evaluates hundreds to thousands of architectures on a validation set. The best architecture's validation performance is an optimistic estimate of its true performance because it was selected from many candidates.
The magnitude of the inflation depends on the number of architectures evaluated and the variance of performance across architectures. For architectures that are structurally similar (e.g., varying depth and width of a feedforward network), the correlation between their performances is high, and the effective number of independent tests is smaller than the raw count. For architectures that are structurally diverse (e.g., comparing convolutional, recurrent, and transformer architectures), the correlation is lower and the inflation is larger.
The standard mitigation is to hold out a separate test set that is never used during the search process. The searched architecture is evaluated once on this test set, and that single number is reported. This is the same principle as separating validation from test in standard ML, but it is more commonly violated in NAS because the search is expensive and researchers are tempted to peek at test performance to decide when to stop searching.
Common Confusions
BH controls FDR, not FWER
Benjamini-Hochberg does not control the probability of any false positive. It controls the expected proportion of false positives among rejections. If you reject 100 hypotheses at FDR level 0.05, you expect about 5 of them to be false discoveries. But the probability of having at least one false discovery can be much higher than 0.05.
Bonferroni is not always too conservative
When the number of tests is small () or when you truly need to avoid any false positive, Bonferroni is appropriate. The criticism of conservatism applies mainly to large- settings where FDR control is more natural.
Pre-registration does not eliminate all bias
Pre-registration prevents p-hacking by fixing the analysis plan before seeing the data. However, it does not address publication bias (negative results not being published), specification bias (the pre-registered analysis may be suboptimal), or issues with the underlying statistical framework.
Summary
- Testing hypotheses at level each gives FWER up to , not
- Bonferroni: reject if ; controls FWER; conservative for large
- Benjamini-Hochberg: step-up procedure with thresholds ; controls FDR; more powerful
- Hyperparameter tuning = multiple testing; benchmark shopping = p-hacking
- The more comparisons you make, the more likely your best result is a false positive
Exercises
Problem
You test 50 hypotheses, each at level , and all null hypotheses are true. What is the expected number of false rejections under Bonferroni? Under no correction?
Problem
You have p-values from 8 tests. Apply the Benjamini-Hochberg procedure at FDR level . Which hypotheses are rejected?
References
Canonical:
- Benjamini & Hochberg, Controlling the False Discovery Rate (1995)
- Bonferroni, Teoria statistica delle classi (1936)
Current:
-
Simmons, Nelson, Simonsohn, False-Positive Psychology (2011)
-
Gelman & Loken, The Garden of Forking Paths (2013)
-
Recht et al., Do ImageNet Classifiers Generalize to ImageNet? (2019), an ML-specific treatment of multiple testing through benchmark reuse
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8
Next Topics
The natural continuation is understanding how these corrections apply specifically to ML evaluation and experimental design.
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Hypothesis Testing for MLLayer 2