Methodology
Hypothesis Testing for ML
Null and alternative hypotheses, p-values, confidence intervals, error types, multiple comparisons, and proper statistical tests for comparing ML models.
Why This Matters
Machine learning papers routinely claim that model A outperforms model B. But how do you know the difference is real and not an artifact of the random train/test split, initialization, or data ordering? Hypothesis testing provides the formal framework for answering this question. Without it, you cannot distinguish signal from noise in experimental comparisons. The multiple comparisons problem is especially critical in ML, where researchers often compare many models, hyperparameter settings, or metrics simultaneously.
Mental Model
You have two models. You run them on a test set and model A scores 0.85 while model B scores 0.83. Is A better than B, or did you just get lucky? Hypothesis testing formalizes this question: assume the models are equally good (null hypothesis), compute how unlikely the observed difference is under that assumption (p-value), and reject the null if the data is sufficiently surprising.
The Neyman-Pearson Framework
Null Hypothesis
The null hypothesis is the default assumption you are trying to disprove. In ML model comparison:
where and are the true expected performances of models A and B. The null says there is no real difference between the models.
Alternative Hypothesis
The alternative hypothesis is what you conclude if you reject .
- Two-sided: (the models differ)
- One-sided: (model A is better)
In ML, two-sided tests are more conservative and generally preferred unless you have a strong prior reason to test only one direction.
p-Value
The p-value is the probability of observing a test statistic at least as extreme as the one computed from your data, assuming is true:
A small p-value means the observed data is unlikely under , which is evidence against .
The p-value is not the probability that is true. It is the probability of the data given , not the probability of given the data.
Significance Level
The significance level is the threshold for rejecting . If , we reject and declare the result "statistically significant." The standard choice is , meaning we accept a 5% chance of falsely rejecting a true null hypothesis.
p = 0.05 is a convention, not a theorem
The 0.05 threshold comes from Fisher (1925), stated as a rough guide, not a principled cutoff. The ASA's 2016 and 2019 statements on p-values warn against thresholded dichotomous decisions. The replication crisis in psychology and biomedicine (Open Science Collaboration 2015; Ioannidis 2005) showed that a large fraction of "significant" findings at fail to replicate. In ML practice: test across multiple seeds, report effect sizes and confidence intervals, and treat on a single run as weak evidence.
Error Types
Type I and Type II Errors
Statement
There are two types of errors in hypothesis testing:
| true | false | |
|---|---|---|
| Reject | Type I error () | Correct (power) |
| Do not reject | Correct | Type II error () |
- Type I error (false positive): rejecting when it is true. Probability = .
- Type II error (false negative): failing to reject when it is false. Probability = .
- Power = : probability of correctly rejecting a false .
Intuition
Type I error is a false alarm: you claim the models are different when they are not. Type II error is a missed detection: the models truly differ but your test fails to detect it. Decreasing (being more conservative) increases (more missed detections) for fixed sample size.
Proof Sketch
By definition, , which is controlled directly by the choice of significance level. The power depends on the true effect size , the variance, and the sample size : .
Why It Matters
In ML, Type I errors lead to published claims of improvement that do not replicate. Type II errors lead to dismissing genuinely better methods. Both are costly. Understanding the tradeoff is essential for designing experiments with adequate power.
Failure Mode
Most ML experiments are underpowered: they use too few random seeds or train/test splits to detect real but small improvements. An underpowered test has high , meaning it frequently fails to detect true differences.
Confidence Intervals
Confidence Interval
A confidence interval for a parameter is a random interval such that:
For the difference in model performance , a 95% CI provides a range of plausible values for the true difference. If the CI excludes zero, the difference is significant at .
Confidence intervals are more informative than p-values alone because they communicate both statistical significance and effect size.
The Multiple Comparisons Problem
When you test many hypotheses simultaneously, the probability of at least one false positive grows rapidly.
Bonferroni Correction
Statement
If you perform simultaneous hypothesis tests and want to control the family-wise error rate (FWER) at level , reject each individual test only if .
This holds regardless of the dependence structure among the tests.
Intuition
If you flip a coin 20 times looking for "evidence of bias," you expect at least one streak just by chance. Bonferroni compensates by making each individual test more stringent: if you test 20 hypotheses, you need for any single one.
Proof Sketch
By the union bound: .
Why It Matters
In ML, the multiple comparisons problem arises constantly: comparing many models, testing on multiple datasets, evaluating multiple metrics, running experiments with different hyperparameters. Without correction, the false positive rate can be far higher than the nominal .
Failure Mode
Bonferroni is very conservative: it controls the probability of any false positive, which can make it too strict when is large. Many true effects are missed. When you care about controlling the proportion of false positives rather than their existence, use FDR control instead.
FDR Control: Benjamini-Hochberg
False Discovery Rate
The False Discovery Rate is the expected proportion of false positives among all rejected hypotheses:
The Benjamini-Hochberg (BH) procedure controls FDR at level :
- Order the p-values:
- Find the largest such that
- Reject all hypotheses with
BH is less conservative than Bonferroni and is preferred when testing many hypotheses (e.g., comparing models on 50 datasets).
Paired Tests for Model Comparison
When comparing two models on the same data, you should use paired tests that account for the correlation between the two models' errors on each example.
Paired t-Test for Model Comparison
Given test examples, let be the difference in loss between models A and B on example . The paired t-test statistic is:
Under , the statistic follows a -distribution with degrees of freedom (approximately, assuming are i.i.d.).
Wilcoxon Signed-Rank Test
A non-parametric alternative to the paired t-test. It does not assume normality of the differences . Instead, it ranks the absolute differences and compares the sum of ranks for positive vs. negative differences. Use this when the differences are not approximately normal (e.g., heavy-tailed or skewed distributions).
Bootstrap Hypothesis Testing
Bootstrap Test
When the distribution of the test statistic is unknown or the sample is small:
- Compute the observed test statistic (e.g., difference in accuracy)
- Generate bootstrap resamples under (e.g., by permuting model labels)
- Compute for each resample
- The p-value is the fraction of bootstrap statistics at least as extreme as the observed:
Bootstrap tests are flexible: they work for any test statistic and make minimal distributional assumptions.
Common Confusions
p less than 0.05 does NOT mean 95% probability that H1 is true
This is the most common misinterpretation in all of statistics. The p-value is , not . To get the latter, you need Bayes' theorem and a prior on . A p-value of 0.03 means: "if the models were truly equal, there is a 3% chance of seeing this large a difference." It does not mean "there is a 97% chance that model A is better."
Statistical significance is not practical significance
A difference can be statistically significant (small p-value) but practically meaningless (tiny effect size). With enough data, you can detect arbitrarily small differences. Always report effect sizes and confidence intervals alongside p-values. A 0.1% accuracy improvement that is "statistically significant" may not matter in practice.
Do not use unpaired tests for paired data
When comparing models on the same test set, the predictions are correlated (both models see the same examples). An unpaired t-test ignores this correlation and has lower power. Always use a paired test (paired t-test, Wilcoxon signed-rank, or bootstrap with pairing) when the data is paired.
Summary
- The p-value is , not
- Type I error (false positive) rate is controlled by ; Type II (false negative) depends on power
- Multiple comparisons inflate false positives: use Bonferroni (strict) or Benjamini-Hochberg (less strict)
- Use paired tests (paired t-test or Wilcoxon) when comparing models on the same data
- Confidence intervals are more informative than p-values alone
- Bootstrap tests work when you cannot derive the test statistic distribution
Exercises
Problem
You compare 10 models on a test set and report p-values for each pairwise comparison. How many comparisons are there? If you use without correction, what is the approximate probability of at least one false positive (assuming all nulls are true)?
Problem
Design a proper statistical test to compare two classifiers (a fine-tuned BERT model and a logistic regression baseline) on a binary classification task with 1000 test examples. Specify: the null hypothesis, the test statistic, the type of test, and how you would compute the p-value.
References
Canonical:
- Demsar, "Statistical Comparisons of Classifiers over Multiple Datasets" (JMLR 2006)
- Benjamini and Hochberg, "Controlling the False Discovery Rate" (1995)
Current:
-
Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (MLSys 2021)
-
Benavoli et al., "Should We Really Use Post-Hoc Tests Based on Mean-Ranks?" (JMLR 2016)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
Next Topics
- Statistical significance and multiple comparisons: deeper treatment of FDR, permutation tests, and replication
- Bootstrap methods: the general bootstrap framework for inference
Last reviewed: April 2026
Builds on This
- Ablation Study DesignLayer 3
- Causal Inference BasicsLayer 3
- Data Contamination and EvaluationLayer 5
- Detection TheoryLayer 2
- Goodness-of-Fit TestsLayer 1
- Meta-AnalysisLayer 2
- P-Hacking and Multiple TestingLayer 2
- Sample Size DeterminationLayer 2
- Signal Detection TheoryLayer 2
- Statistical Significance and Multiple ComparisonsLayer 2