P-Hacking and Multiple Testing

Sneiderman, Robby

Methodology

P-Hacking and Multiple Testing

How selective reporting and multiple comparisons inflate false positive rates, and how Bonferroni and Benjamini-Hochberg corrections control them. Why hyperparameter tuning is multiple testing and benchmark shopping is p-hacking.

CoreTier 2CurrentSupporting~45 min

Prerequisites

Hypothesis Testing for ML Meta Analysis

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

methodology | layer 2 | tier 2. This page has 2 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

A researcher tests 20 independent hypotheses at significance level $\alpha = 0.05$ . Even if every null hypothesis is true, the probability of at least one false positive is $1 - (1 - 0.05)^{20} \approx 0.64$ . More than half the time, something will look significant by chance.

This is not a theoretical curiosity. In ML, every time you try a new hyperparameter configuration and check validation accuracy, you are running a hypothesis test. Every time you evaluate your model on a new benchmark, you are performing another comparison. Without correction, the reported "best" result is systematically inflated.

P-Hacking

Definition

P-Hacking

The practice of manipulating the analysis pipeline until a p-value below $\alpha$ is obtained. Common forms include: trying multiple outcome variables, adding or removing covariates, excluding data points, stopping data collection when $p < 0.05$ , and switching between one-tailed and two-tailed tests.

P-hacking is often unintentional. The researcher does not plan to cheat; they make "reasonable" analysis choices at each step. But each choice represents a degree of freedom. Simmons, Nelson, and Simonsohn (2011) showed that with four common researcher degrees of freedom (flexible sample size, flexible covariates, flexible outcome variable, optional data exclusion), the effective false positive rate rises from $5\%$ to over $60\%$ .

Garden of forking paths (Gelman & Loken, 2013): even without conscious p-hacking, the analysis path depends on the data. The researcher would have made different "reasonable" choices if the data had come out differently. The number of implicit comparisons is much larger than the number of tests explicitly reported.

Multiple Testing Problem

Definition

Family-Wise Error Rate $F W E R$

The probability of making at least one Type I error across a family of $m$ hypothesis tests:

$\text{FWER} = P(\text{at least one false rejection})$

Controlling FWER at level $\alpha$ means $\text{FWER} \leq \alpha$ .

Definition

False Discovery Rate $F D R$

The expected proportion of false discoveries among all rejections:

$\text{FDR} = \mathbb{E}\left[\frac{V}{R \vee 1}\right]$

where $V$ is the number of false rejections and $R$ is the total number of rejections. The $\vee 1$ avoids division by zero when $R = 0$ .

FWER is conservative: it controls the probability of any false positive. FDR is less conservative: it allows some false positives as long as their proportion among discoveries is controlled.

Main Theorems

Proposition

Bonferroni Correction

Statement

Reject null hypothesis $H_i$ if $p_i \leq \alpha / m$ . Then $\text{FWER} \leq \alpha$ .

Intuition

By the union bound, the probability that any one of $m$ tests falsely rejects is at most $m \times (\alpha / m) = \alpha$ . This works regardless of dependence between tests.

Proof Sketch

Let $I_0 \subseteq \{1, \ldots, m\}$ be the set of true null hypotheses. For each $i \in I_0$ , $P(p_i \leq \alpha/m) \leq \alpha/m$ by validity of the p-value. By the union bound: $\text{FWER} = P(\exists i \in I_0: p_i \leq \alpha/m) \leq |I_0| \cdot \alpha/m \leq \alpha$ .

Why It Matters

Bonferroni is the simplest multiple testing correction. It requires no assumptions about dependence structure and works for any collection of valid p-values. Its simplicity makes it widely applicable as a quick conservative check.

Failure Mode

Bonferroni is highly conservative when $m$ is large. With $m = 10{,}000$ tests (common in genomics or neural architecture search), the threshold $\alpha / m = 5 \times 10^{-6}$ rejects almost nothing. Power drops dramatically. When many tests are correlated (e.g., nearby genomic loci), Bonferroni wastes statistical power because the effective number of independent tests is much smaller than $m$ .

report a correction →

Theorem

Benjamini-Hochberg Procedure

Statement

Order the p-values: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$ . Find the largest $k$ such that $p_{(k)} \leq k \alpha / m$ . Reject all hypotheses $H_{(1)}, \ldots, H_{(k)}$ . Then $\text{FDR} \leq \alpha$ .

Intuition

Instead of dividing $\alpha$ equally among all tests (Bonferroni), BH uses a step-up procedure: the threshold for the $k$ -th smallest p-value is $k\alpha/m$ , linearly increasing. This allows more rejections when many small p-values exist, while still controlling the expected false discovery proportion.

Proof Sketch

The proof (Benjamini & Hochberg, 1995) proceeds by conditioning on the number of true null hypotheses $m_0$ . Under independence, the p-values of true nulls are uniform on $[0,1]$ , and the expected number of false rejections at threshold $t$ is $m_0 t$ . The step-up structure ensures that this expectation, divided by the total number of rejections, is at most $m_0 \alpha / m \leq \alpha$ .

Why It Matters

BH is the standard procedure when you want to discover as many true effects as possible while limiting the fraction of false discoveries. It is far more powerful than Bonferroni when many hypotheses are tested and a meaningful fraction of them are truly non-null.

Failure Mode

Under strong negative dependence between test statistics, the FDR guarantee can fail. The PRDS condition (which includes independence and many forms of positive dependence) is required. When the dependence structure is unknown, the Benjamini-Yekutieli (2001) procedure provides FDR control at the cost of a $\log(m)$ factor.

report a correction →

The ML Connection

Hyperparameter tuning is multiple testing. If you try 100 hyperparameter configurations and report the best validation accuracy, you have implicitly tested 100 hypotheses. The best configuration's validation performance is an optimistic estimate of its true performance. This is why the gap between validation and test accuracy increases with the number of configurations tried. This applies to all forms of tuning: gradient descent learning rate schedules, Bayesian optimization of hyperparameters, and random search alike.

Benchmark shopping is p-hacking. Evaluating a model on many benchmarks and reporting the ones where it performs best is structurally identical to testing many hypotheses and reporting the significant ones. The "state-of-the-art" claim is inflated by the number of unreported benchmarks where the model performed poorly.

Data leakage through repeated evaluation compounds this problem. Each time you evaluate on a test set and use the result to guide further development, the test set becomes part of the training process. See model evaluation best practices for how to structure evaluation protocols that resist this.

Comparison of Correction Methods

The following table summarizes when to use each correction approach.

Method	Controls	Threshold for test $i$	Assumptions	Best for
No correction	Nothing	$\alpha$	N/A	Single pre-specified test. No multiple comparisons.
Bonferroni	FWER $\leq \alpha$	$\alpha / m$	Valid p-values, any dependence	Small $m$ ( $\leq 20$ ), or when any false positive is unacceptable (e.g., clinical trials)
Holm-Bonferroni	FWER $\leq \alpha$	$\alpha / (m - k + 1)$ for $k$ -th smallest	Valid p-values, any dependence	Same as Bonferroni but uniformly more powerful. No reason not to use it over Bonferroni.
Benjamini-Hochberg	FDR $\leq \alpha$	$k\alpha / m$ for $k$ -th smallest	Independence or PRDS	Large $m$ where you expect many true effects (e.g., feature selection, genomics)
Benjamini-Yekutieli	FDR $\leq \alpha$	$k\alpha / (m \cdot \sum_{j=1}^m 1/j)$	Any dependence	Large $m$ with unknown or negative dependence structure
Permutation test	FWER $\leq \alpha$	Data-dependent	Exchangeability under null	When parametric assumptions fail. Computationally expensive but exact.

P-Hacking in Neural Architecture Search

Neural architecture search (NAS) is one of the most aggressive forms of multiple testing in modern ML. A typical NAS procedure evaluates hundreds to thousands of architectures on a validation set. The best architecture's validation performance is an optimistic estimate of its true performance because it was selected from many candidates.

The magnitude of the inflation depends on the number of architectures evaluated and the variance of performance across architectures. For architectures that are structurally similar (e.g., varying depth and width of a feedforward network), the correlation between their performances is high, and the effective number of independent tests is smaller than the raw count. For architectures that are structurally diverse (e.g., comparing convolutional, recurrent, and transformer architectures), the correlation is lower and the inflation is larger.

The standard mitigation is to hold out a separate test set that is never used during the search process. The searched architecture is evaluated once on this test set, and that single number is reported. This is the same principle as separating validation from test in standard ML, but it is more commonly violated in NAS because the search is expensive and researchers are tempted to peek at test performance to decide when to stop searching.

Modern Selective Inference Toolkit

Bonferroni and Benjamini-Hochberg control error rates assuming the set of hypotheses is fixed in advance. Three modern frameworks extend this to adaptive, sequential, and selection-aware settings.

Knockoffs

Definition

Knockoff Variables

Synthetic copies $\tilde{X}_j$ of features $X_j$ designed to mimic the correlation structure of the originals while being conditionally independent of the response $Y$ given $X$ . By comparing how often each $X_j$ outranks its knockoff $\tilde{X}_j$ in a feature-importance statistic, one obtains finite-sample FDR control on variable selection without p-values.

The fixed-X knockoff filter (Barber & Candès, 2015) controls FDR for linear regression with fixed design and Gaussian noise. The model-X knockoff filter (Candès, Fan, Janson, Lv, 2018) drops the linear-model assumption: it requires only that the joint distribution of $X$ is known, making it usable with arbitrary machine-learning feature importances (random-forest splits, neural-network gradients, lasso coefficients). This is the only finite-sample FDR-controlling procedure that works with black-box ML predictors.

E-Values and Anytime-Valid Inference

Definition

E-Value $$E$$

A non-negative random variable with $\mathbb{E}_{H_0}[E] \leq 1$ under the null. By Markov's inequality, $P_{H_0}(E \geq 1/\alpha) \leq \alpha$ , so an e-value plays the role of a p-value but with stronger composition properties: products of independent e-values are still e-values, and they remain valid under optional stopping.

E-values (Shafer 2021; Vovk & Wang, 2021; Grünwald, de Heide, Koolen, 2024) underwrite anytime-valid inference: you can monitor a study, decide whether to collect more data, and stop whenever you like without inflating Type I error. This is a structural fix for the "stopping when $p < 0.05$ " form of p-hacking. Wang & Ramdas (2022) develop e-value FDR procedures (e-BH) that match BH guarantees while supporting adaptive stopping. In ML, e-values formalize valid sequential A/B tests, online model comparisons, and continuous monitoring of fairness or robustness metrics.

Post-Selection and Conditional Inference

When you fit a lasso, run forward stepwise, or pick the highest-AUC model from a sweep, classical confidence intervals and p-values for the selected parameters are invalid: the selection event itself depends on the data.

Selective inference (Fithian, Sun, Taylor, 2014; Lee, Sun, Sun, Taylor, 2016) computes p-values and intervals conditional on the selection event, restoring valid Type I error guarantees for the parameters that were chosen. For lasso with a fixed $\lambda$ , the selection event is a polyhedron in observation space, and conditional p-values follow truncated Gaussian distributions. Tibshirani et al. (2016) extend this to forward stepwise and least-angle regression. Data carving and data splitting (Tian & Taylor, 2018) trade some selection power for cleaner inference by reserving part of the sample for inference after selection.

The ML implication: when a paper reports confidence intervals for a model selected by validation performance, those intervals are anti-conservative unless they were computed conditionally on the selection. The standard practice of "select on val, evaluate on test" is the cheap, model-free version of data splitting.

Common Confusions

Watch Out

BH controls FDR, not FWER

Benjamini-Hochberg does not control the probability of any false positive. It controls the expected proportion of false positives among rejections. If you reject 100 hypotheses at FDR level 0.05, you expect about 5 of them to be false discoveries. But the probability of having at least one false discovery can be much higher than 0.05.

Watch Out

Bonferroni is not always too conservative

When the number of tests is small ( $m \leq 10$ ) or when you truly need to avoid any false positive, Bonferroni is appropriate. The criticism of conservatism applies mainly to large- $m$ settings where FDR control is more natural.

Watch Out

Pre-registration does not eliminate all bias

Pre-registration prevents p-hacking by fixing the analysis plan before seeing the data. However, it does not address publication bias (negative results not being published), specification bias (the pre-registered analysis may be suboptimal), or issues with the underlying statistical framework.

Summary

Testing $m$ hypotheses at level $\alpha$ each gives FWER up to $1 - (1-\alpha)^m$ , not $\alpha$
Bonferroni: reject if $p_i \leq \alpha/m$ ; controls FWER; conservative for large $m$
Benjamini-Hochberg: step-up procedure with thresholds $k\alpha/m$ ; controls FDR; more powerful
Hyperparameter tuning = multiple testing; benchmark shopping = p-hacking
The more comparisons you make, the more likely your best result is a false positive

Exercises

ExerciseCore

Problem

You test 50 hypotheses, each at level $\alpha = 0.05$ , and all null hypotheses are true. What is the expected number of false rejections under Bonferroni? Under no correction?

ExerciseAdvanced

Problem

You have p-values $\{0.001, 0.008, 0.039, 0.041, 0.051, 0.10, 0.32, 0.67\}$ from 8 tests. Apply the Benjamini-Hochberg procedure at FDR level $\alpha = 0.05$ . Which hypotheses are rejected?

References

Canonical:

Benjamini & Hochberg, Controlling the False Discovery Rate (1995)
Bonferroni, Teoria statistica delle classi (1936)

Current:

Simmons, Nelson, Simonsohn, False-Positive Psychology (2011)
Gelman & Loken, The Garden of Forking Paths (2013)
Recht et al., Do ImageNet Classifiers Generalize to ImageNet? (2019), an ML-specific treatment of multiple testing through benchmark reuse
Barber & Candès, Controlling the False Discovery Rate via Knockoffs (2015), Annals of Statistics — fixed-X knockoff filter
Candès, Fan, Janson, Lv, Panning for Gold: Model-X Knockoffs for High-Dimensional Controlled Variable Selection (2018), JRSSB — black-box-compatible knockoffs
Lee, Sun, Sun, Taylor, Exact Post-Selection Inference, with Application to the Lasso (2016), Annals of Statistics
Fithian, Sun, Taylor, Optimal Inference After Model Selection (2014), arXiv:1410.2597
Tibshirani, Taylor, Lockhart, Tibshirani, Exact Post-Selection Inference for Sequential Regression Procedures (2016), JASA
Vovk & Wang, E-Values: Calibration, Combination, and Applications (2021), Annals of Statistics
Grünwald, de Heide, Koolen, Safe Testing (2024), JRSSB — e-values for anytime-valid hypothesis testing
Wang & Ramdas, False Discovery Rate Control with E-Values (2022), JRSSB — e-BH procedure
Ramdas, Grünwald, Vovk, Shafer, Game-Theoretic Statistics and Safe Anytime-Valid Inference (2023), Statistical Science

Next Topics

The natural continuation is understanding how these corrections apply specifically to ML evaluation and experimental design.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Hypothesis Testing for MLlayer 2 · tier 2
Meta-Analysislayer 2 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.