Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Statistical Significance and Multiple Comparisons

p-values, significance levels, confidence intervals, and the multiple comparisons problem. Bonferroni correction, Benjamini-Hochberg FDR control, and why these matter for model selection and benchmark evaluation.

CoreTier 2Stable~50 min

Why This Matters

You train five models, tune three hyperparameters each, evaluate on four metrics, and report the best combination. How many implicit hypothesis tests did you just run? Roughly 5×3×4=605 \times 3 \times 4 = 60. Without correction, the probability of at least one spurious "significant" result exceeds 95%. This is the multiple comparisons problem, and it is the single largest source of false claims in ML evaluation.

Formal Setup

Definition

p-Value

The p-value for a test statistic TT is the probability of observing a value at least as extreme as the observed tobst_{\text{obs}}, assuming the null hypothesis H0H_0 is true:

p=P(TtobsH0)p = P(T \geq t_{\text{obs}} \mid H_0)

A small pp means the data is unlikely under H0H_0. The p-value is not the probability that H0H_0 is true.

Definition

Significance Level

The significance level α\alpha is the threshold for rejecting H0H_0. Reject H0H_0 if pαp \leq \alpha. The standard choice α=0.05\alpha = 0.05 means you accept a 5% false positive rate for any single test.

Definition

Confidence Interval

A (1α)(1 - \alpha) confidence interval for parameter θ\theta is a random interval [L,U][L, U] such that P(θ[L,U])1αP(\theta \in [L, U]) \geq 1 - \alpha over repeated sampling. For the difference in model performance δ=μAμB\delta = \mu_A - \mu_B:

δ^±zα/2SE(δ^)\hat{\delta} \pm z_{\alpha/2} \cdot \text{SE}(\hat{\delta})

where SE\text{SE} is the standard error. If the interval excludes zero, the difference is significant at level α\alpha.

The Multiple Comparisons Problem

When testing mm hypotheses simultaneously at level α\alpha each, the probability of at least one false positive grows rapidly.

Definition

Family-Wise Error Rate

The FWER is the probability of making at least one Type I error across all mm tests:

FWER=P(at least one false rejection)=1(1α)m\text{FWER} = P(\text{at least one false rejection}) = 1 - (1 - \alpha)^m

when the tests are independent. For m=20m = 20 and α=0.05\alpha = 0.05: FWER10.95200.64\text{FWER} \approx 1 - 0.95^{20} \approx 0.64.

Bonferroni Correction

Theorem

Bonferroni Correction

Statement

To control the family-wise error rate at level α\alpha across mm tests, reject hypothesis ii only if piα/mp_i \leq \alpha / m. Then:

FWER=P(iH0{piα/m})iH0P(piα/m)m0αmα\text{FWER} = P\left(\bigcup_{i \in \mathcal{H}_0} \{p_i \leq \alpha/m\}\right) \leq \sum_{i \in \mathcal{H}_0} P(p_i \leq \alpha/m) \leq m_0 \cdot \frac{\alpha}{m} \leq \alpha

where m0mm_0 \leq m is the number of true null hypotheses.

Intuition

If you test 20 hypotheses at α=0.05\alpha = 0.05, Bonferroni requires p0.0025p \leq 0.0025 for each individual test. You divide your error budget equally among all tests. The union bound guarantees this works regardless of whether the tests are correlated.

Proof Sketch

By the union bound: P(iH0:piα/m)iH0P(piα/m)=m0(α/m)αP(\exists i \in \mathcal{H}_0: p_i \leq \alpha/m) \leq \sum_{i \in \mathcal{H}_0} P(p_i \leq \alpha/m) = m_0 \cdot (\alpha/m) \leq \alpha. The key step is that for a true null, pip_i is uniform on [0,1][0,1], so P(piα/m)=α/mP(p_i \leq \alpha/m) = \alpha/m.

Why It Matters

Bonferroni is the simplest multiple testing correction. It requires no assumptions about the dependence structure among tests. In ML, use it when comparing a small number of models (say, 5 or fewer) and you need a strong guarantee that no comparison is a false positive.

Failure Mode

Bonferroni is very conservative when mm is large. With m=1000m = 1000 tests, you need p0.00005p \leq 0.00005 per test. Many true effects will be missed. When you care about controlling the proportion of false discoveries rather than their existence, use Benjamini-Hochberg instead.

Benjamini-Hochberg FDR Control

Definition

False Discovery Rate

The False Discovery Rate is the expected proportion of false positives among all rejected hypotheses:

FDR=E[VR1]\text{FDR} = \mathbb{E}\left[\frac{V}{R \vee 1}\right]

where VV is the number of false rejections, RR is the total number of rejections, and R1=max(R,1)R \vee 1 = \max(R, 1) avoids division by zero.

Theorem

Benjamini-Hochberg Procedure

Statement

Order the p-values p(1)p(2)p(m)p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}. Find the largest kk such that p(k)kmqp_{(k)} \leq \frac{k}{m} q. Reject all hypotheses H(1),,H(k)H_{(1)}, \ldots, H_{(k)}. Then FDRq\text{FDR} \leq q.

Intuition

BH draws a line with slope q/mq/m through the ordered p-values. The largest p-value below this line determines the cutoff. Hypotheses with smaller p-values are rejected. The procedure is adaptive: if many p-values are small (many true effects), the threshold is more lenient than Bonferroni.

Proof Sketch

The original proof by Benjamini and Hochberg (1995) proceeds by induction on the number of true nulls m0m_0. The key insight: under independence, the expected number of false rejections at the kk-th threshold is m0kq/(mk)=m0q/mqm_0 \cdot k q / (m \cdot k) = m_0 q / m \leq q. The proof extends to positive regression dependence (PRDS) by the work of Benjamini and Yekutieli (2001).

Why It Matters

BH is the standard correction for large-scale testing in ML. When comparing models across many datasets, metrics, or hyperparameter settings, BH controls the fraction of false discoveries rather than the probability of any false discovery. This is almost always the right notion for ML practitioners.

Failure Mode

BH requires independence or positive dependence among p-values. If tests are negatively correlated (rare in practice but possible), FDR may exceed qq. Under arbitrary dependence, use the Benjamini-Yekutieli correction, which replaces qq with q/i=1m1/iq/ln(m)q / \sum_{i=1}^m 1/i \approx q / \ln(m).

Why This Matters for ML

Model Selection

When you compare kk models, you are implicitly running (k2)\binom{k}{2} pairwise tests. With 10 models, that is 45 tests. Reporting the "best" model without correction inflates the false positive rate.

Hyperparameter Tuning

Each hyperparameter configuration evaluated on a validation set is an implicit hypothesis test. Random search over 100 configurations at α=0.05\alpha = 0.05 yields an expected 5 false positives. Cross-validation helps but does not eliminate this problem.

Benchmark Comparisons

Papers comparing on multiple benchmarks (GLUE has 9 tasks, for example) should correct for multiple comparisons when claiming improvements "across the board."

Example

Bonferroni vs BH on benchmark evaluation

You compare your model against a baseline on 20 benchmark tasks. The sorted p-values are: 0.001, 0.003, 0.008, 0.012, 0.025, 0.04, 0.06, ... (remaining above 0.05).

Bonferroni (α=0.05\alpha = 0.05): threshold is 0.05/20=0.00250.05/20 = 0.0025. Only the first p-value (0.001) passes. You claim significance on 1 task.

BH (q=0.05q = 0.05): check p(k)k0.05/20p_{(k)} \leq k \cdot 0.05/20. Thresholds are 0.0025, 0.005, 0.0075, 0.01, 0.0125, ... The first three p-values pass (0.001 < 0.0025, 0.003 < 0.005, 0.008 > 0.0075). So BH rejects the first 2 hypotheses. You claim significance on 2 tasks.

BH discovers more true effects while controlling the false discovery proportion.

Common Confusions

Watch Out

p-hacking is implicit multiple testing

If you try many analysis strategies (different features, different preprocessing, different splits) and report only the one that gives p<0.05p < 0.05, you have performed many implicit tests without correction. This is p-hacking. The fix is to preregister your analysis plan or apply multiple testing correction to all analyses you tried.

Watch Out

Bonferroni and BH control different error rates

Bonferroni controls the probability of any false positive (FWER). BH controls the proportion of false positives (FDR). These are different quantities. FWER is appropriate when any false positive is costly (e.g., clinical trials). FDR is appropriate when you expect some false positives and want to control their rate (e.g., screening many benchmark tasks).

Watch Out

Post-hoc correction does not fix bad experimental design

If your experiment has data leakage or other methodological flaws, no multiple testing correction will save you. Corrections adjust p-value thresholds; they do not fix biased test statistics. Always ensure the individual tests are valid before applying corrections.

Summary

  • The p-value is P(dataH0)P(\text{data} \mid H_0), not P(H0data)P(H_0 \mid \text{data})
  • Testing mm hypotheses at α\alpha each gives FWER 1(1α)m\approx 1 - (1-\alpha)^m
  • Bonferroni: reject if piα/mp_i \leq \alpha/m. Controls FWER. Conservative for large mm
  • BH: order p-values, find largest kk with p(k)kq/mp_{(k)} \leq kq/m. Controls FDR
  • ML model selection, hyperparameter tuning, and benchmark evaluation all involve implicit multiple testing
  • Confidence intervals convey both significance and effect size

Exercises

ExerciseCore

Problem

You evaluate a model on 8 benchmark datasets and obtain p-values (vs. baseline) of 0.004, 0.01, 0.02, 0.03, 0.06, 0.08, 0.12, 0.25. Apply both Bonferroni and BH at level 0.05. How many datasets show significant improvement under each?

ExerciseAdvanced

Problem

You run a random hyperparameter search with 200 configurations. The best configuration has validation accuracy 0.5% higher than the second best. You report this as your final result. Explain why this is problematic from a multiple comparisons perspective, and propose a fix.

References

Canonical:

  • Benjamini and Hochberg, "Controlling the False Discovery Rate" (1995)
  • Demsar, "Statistical Comparisons of Classifiers over Multiple Datasets" (JMLR 2006)

Current:

  • Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (MLSys 2021)

  • Recht et al., "Do ImageNet Classifiers Generalize to ImageNet?" (ICML 2019)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics