Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

P-Hacking and Multiple Testing

How selective reporting and multiple comparisons inflate false positive rates, and how Bonferroni and Benjamini-Hochberg corrections control them. Why hyperparameter tuning is multiple testing and benchmark shopping is p-hacking.

CoreTier 2Current~45 min
0

Why This Matters

A researcher tests 20 independent hypotheses at significance level α=0.05\alpha = 0.05. Even if every null hypothesis is true, the probability of at least one false positive is 1(10.05)200.641 - (1 - 0.05)^{20} \approx 0.64. More than half the time, something will look significant by chance.

This is not a theoretical curiosity. In ML, every time you try a new hyperparameter configuration and check validation accuracy, you are running a hypothesis test. Every time you evaluate your model on a new benchmark, you are performing another comparison. Without correction, the reported "best" result is systematically inflated.

P-Hacking

Definition

P-Hacking

The practice of manipulating the analysis pipeline until a p-value below α\alpha is obtained. Common forms include: trying multiple outcome variables, adding or removing covariates, excluding data points, stopping data collection when p<0.05p < 0.05, and switching between one-tailed and two-tailed tests.

P-hacking is often unintentional. The researcher does not plan to cheat; they make "reasonable" analysis choices at each step. But each choice represents a degree of freedom. Simmons, Nelson, and Simonsohn (2011) showed that with four common researcher degrees of freedom (flexible sample size, flexible covariates, flexible outcome variable, optional data exclusion), the effective false positive rate rises from 5%5\% to over 60%60\%.

Garden of forking paths (Gelman & Loken, 2013): even without conscious p-hacking, the analysis path depends on the data. The researcher would have made different "reasonable" choices if the data had come out differently. The number of implicit comparisons is much larger than the number of tests explicitly reported.

Multiple Testing Problem

Definition

Family-Wise Error Rate

The probability of making at least one Type I error across a family of mm hypothesis tests:

FWER=P(at least one false rejection)\text{FWER} = P(\text{at least one false rejection})

Controlling FWER at level α\alpha means FWERα\text{FWER} \leq \alpha.

Definition

False Discovery Rate

The expected proportion of false discoveries among all rejections:

FDR=E[VR1]\text{FDR} = \mathbb{E}\left[\frac{V}{R \vee 1}\right]

where VV is the number of false rejections and RR is the total number of rejections. The 1\vee 1 avoids division by zero when R=0R = 0.

FWER is conservative: it controls the probability of any false positive. FDR is less conservative: it allows some false positives as long as their proportion among discoveries is controlled.

Main Theorems

Proposition

Bonferroni Correction

Statement

Reject null hypothesis HiH_i if piα/mp_i \leq \alpha / m. Then FWERα\text{FWER} \leq \alpha.

Intuition

By the union bound, the probability that any one of mm tests falsely rejects is at most m×(α/m)=αm \times (\alpha / m) = \alpha. This works regardless of dependence between tests.

Proof Sketch

Let I0{1,,m}I_0 \subseteq \{1, \ldots, m\} be the set of true null hypotheses. For each iI0i \in I_0, P(piα/m)α/mP(p_i \leq \alpha/m) \leq \alpha/m by validity of the p-value. By the union bound: FWER=P(iI0:piα/m)I0α/mα\text{FWER} = P(\exists i \in I_0: p_i \leq \alpha/m) \leq |I_0| \cdot \alpha/m \leq \alpha.

Why It Matters

Bonferroni is the simplest multiple testing correction. It requires no assumptions about dependence structure and works for any collection of valid p-values. Its simplicity makes it widely applicable as a quick conservative check.

Failure Mode

Bonferroni is highly conservative when mm is large. With m=10,000m = 10{,}000 tests (common in genomics or neural architecture search), the threshold α/m=5×106\alpha / m = 5 \times 10^{-6} rejects almost nothing. Power drops dramatically. When many tests are correlated (e.g., nearby genomic loci), Bonferroni wastes statistical power because the effective number of independent tests is much smaller than mm.

Theorem

Benjamini-Hochberg Procedure

Statement

Order the p-values: p(1)p(2)p(m)p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}. Find the largest kk such that p(k)kα/mp_{(k)} \leq k \alpha / m. Reject all hypotheses H(1),,H(k)H_{(1)}, \ldots, H_{(k)}. Then FDRα\text{FDR} \leq \alpha.

Intuition

Instead of dividing α\alpha equally among all tests (Bonferroni), BH uses a step-up procedure: the threshold for the kk-th smallest p-value is kα/mk\alpha/m, linearly increasing. This allows more rejections when many small p-values exist, while still controlling the expected false discovery proportion.

Proof Sketch

The proof (Benjamini & Hochberg, 1995) proceeds by conditioning on the number of true null hypotheses m0m_0. Under independence, the p-values of true nulls are uniform on [0,1][0,1], and the expected number of false rejections at threshold tt is m0tm_0 t. The step-up structure ensures that this expectation, divided by the total number of rejections, is at most m0α/mαm_0 \alpha / m \leq \alpha.

Why It Matters

BH is the standard procedure when you want to discover as many true effects as possible while limiting the fraction of false discoveries. It is far more powerful than Bonferroni when many hypotheses are tested and a meaningful fraction of them are truly non-null.

Failure Mode

Under strong negative dependence between test statistics, the FDR guarantee can fail. The PRDS condition (which includes independence and many forms of positive dependence) is required. When the dependence structure is unknown, the Benjamini-Yekutieli (2001) procedure provides FDR control at the cost of a log(m)\log(m) factor.

The ML Connection

Hyperparameter tuning is multiple testing. If you try 100 hyperparameter configurations and report the best validation accuracy, you have implicitly tested 100 hypotheses. The best configuration's validation performance is an optimistic estimate of its true performance. This is why the gap between validation and test accuracy increases with the number of configurations tried. This applies to all forms of tuning: gradient descent learning rate schedules, Bayesian optimization of hyperparameters, and random search alike.

Benchmark shopping is p-hacking. Evaluating a model on many benchmarks and reporting the ones where it performs best is structurally identical to testing many hypotheses and reporting the significant ones. The "state-of-the-art" claim is inflated by the number of unreported benchmarks where the model performed poorly.

Data leakage through repeated evaluation compounds this problem. Each time you evaluate on a test set and use the result to guide further development, the test set becomes part of the training process. See model evaluation best practices for how to structure evaluation protocols that resist this.

Comparison of Correction Methods

The following table summarizes when to use each correction approach.

MethodControlsThreshold for test iiAssumptionsBest for
No correctionNothingα\alphaN/ASingle pre-specified test. No multiple comparisons.
BonferroniFWER α\leq \alphaα/m\alpha / mValid p-values, any dependenceSmall mm (20\leq 20), or when any false positive is unacceptable (e.g., clinical trials)
Holm-BonferroniFWER α\leq \alphaα/(mk+1)\alpha / (m - k + 1) for kk-th smallestValid p-values, any dependenceSame as Bonferroni but uniformly more powerful. No reason not to use it over Bonferroni.
Benjamini-HochbergFDR α\leq \alphakα/mk\alpha / m for kk-th smallestIndependence or PRDSLarge mm where you expect many true effects (e.g., feature selection, genomics)
Benjamini-YekutieliFDR α\leq \alphakα/(mj=1m1/j)k\alpha / (m \cdot \sum_{j=1}^m 1/j)Any dependenceLarge mm with unknown or negative dependence structure
Permutation testFWER α\leq \alphaData-dependentExchangeability under nullWhen parametric assumptions fail. Computationally expensive but exact.

P-Hacking in Neural Architecture Search

Neural architecture search (NAS) is one of the most aggressive forms of multiple testing in modern ML. A typical NAS procedure evaluates hundreds to thousands of architectures on a validation set. The best architecture's validation performance is an optimistic estimate of its true performance because it was selected from many candidates.

The magnitude of the inflation depends on the number of architectures evaluated and the variance of performance across architectures. For architectures that are structurally similar (e.g., varying depth and width of a feedforward network), the correlation between their performances is high, and the effective number of independent tests is smaller than the raw count. For architectures that are structurally diverse (e.g., comparing convolutional, recurrent, and transformer architectures), the correlation is lower and the inflation is larger.

The standard mitigation is to hold out a separate test set that is never used during the search process. The searched architecture is evaluated once on this test set, and that single number is reported. This is the same principle as separating validation from test in standard ML, but it is more commonly violated in NAS because the search is expensive and researchers are tempted to peek at test performance to decide when to stop searching.

Common Confusions

Watch Out

BH controls FDR, not FWER

Benjamini-Hochberg does not control the probability of any false positive. It controls the expected proportion of false positives among rejections. If you reject 100 hypotheses at FDR level 0.05, you expect about 5 of them to be false discoveries. But the probability of having at least one false discovery can be much higher than 0.05.

Watch Out

Bonferroni is not always too conservative

When the number of tests is small (m10m \leq 10) or when you truly need to avoid any false positive, Bonferroni is appropriate. The criticism of conservatism applies mainly to large-mm settings where FDR control is more natural.

Watch Out

Pre-registration does not eliminate all bias

Pre-registration prevents p-hacking by fixing the analysis plan before seeing the data. However, it does not address publication bias (negative results not being published), specification bias (the pre-registered analysis may be suboptimal), or issues with the underlying statistical framework.

Summary

  • Testing mm hypotheses at level α\alpha each gives FWER up to 1(1α)m1 - (1-\alpha)^m, not α\alpha
  • Bonferroni: reject if piα/mp_i \leq \alpha/m; controls FWER; conservative for large mm
  • Benjamini-Hochberg: step-up procedure with thresholds kα/mk\alpha/m; controls FDR; more powerful
  • Hyperparameter tuning = multiple testing; benchmark shopping = p-hacking
  • The more comparisons you make, the more likely your best result is a false positive

Exercises

ExerciseCore

Problem

You test 50 hypotheses, each at level α=0.05\alpha = 0.05, and all null hypotheses are true. What is the expected number of false rejections under Bonferroni? Under no correction?

ExerciseAdvanced

Problem

You have p-values {0.001,0.008,0.039,0.041,0.051,0.10,0.32,0.67}\{0.001, 0.008, 0.039, 0.041, 0.051, 0.10, 0.32, 0.67\} from 8 tests. Apply the Benjamini-Hochberg procedure at FDR level α=0.05\alpha = 0.05. Which hypotheses are rejected?

References

Canonical:

  • Benjamini & Hochberg, Controlling the False Discovery Rate (1995)
  • Bonferroni, Teoria statistica delle classi (1936)

Current:

  • Simmons, Nelson, Simonsohn, False-Positive Psychology (2011)

  • Gelman & Loken, The Garden of Forking Paths (2013)

  • Recht et al., Do ImageNet Classifiers Generalize to ImageNet? (2019), an ML-specific treatment of multiple testing through benchmark reuse

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8

Next Topics

The natural continuation is understanding how these corrections apply specifically to ML evaluation and experimental design.

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.