Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Hypothesis Testing for ML

Null and alternative hypotheses, p-values, confidence intervals, error types, multiple comparisons, and proper statistical tests for comparing ML models.

CoreTier 2Stable~55 min
0

Why This Matters

Machine learning papers routinely claim that model A outperforms model B. But how do you know the difference is real and not an artifact of the random train/test split, initialization, or data ordering? Hypothesis testing provides the formal framework for answering this question. Without it, you cannot distinguish signal from noise in experimental comparisons. The multiple comparisons problem is especially critical in ML, where researchers often compare many models, hyperparameter settings, or metrics simultaneously.

Mental Model

You have two models. You run them on a test set and model A scores 0.85 while model B scores 0.83. Is A better than B, or did you just get lucky? Hypothesis testing formalizes this question: assume the models are equally good (null hypothesis), compute how unlikely the observed difference is under that assumption (p-value), and reject the null if the data is sufficiently surprising.

The Neyman-Pearson Framework

Definition

Null Hypothesis

The null hypothesis H0H_0 is the default assumption you are trying to disprove. In ML model comparison:

H0:μA=μBH_0: \mu_A = \mu_B

where μA\mu_A and μB\mu_B are the true expected performances of models A and B. The null says there is no real difference between the models.

Definition

Alternative Hypothesis

The alternative hypothesis H1H_1 is what you conclude if you reject H0H_0.

  • Two-sided: H1:μAμBH_1: \mu_A \neq \mu_B (the models differ)
  • One-sided: H1:μA>μBH_1: \mu_A > \mu_B (model A is better)

In ML, two-sided tests are more conservative and generally preferred unless you have a strong prior reason to test only one direction.

Definition

p-Value

The p-value is the probability of observing a test statistic at least as extreme as the one computed from your data, assuming H0H_0 is true:

p=P(test statistictobsH0)p = P(\text{test statistic} \geq t_{\text{obs}} \mid H_0)

A small p-value means the observed data is unlikely under H0H_0, which is evidence against H0H_0.

The p-value is not the probability that H0H_0 is true. It is the probability of the data given H0H_0, not the probability of H0H_0 given the data.

Definition

Significance Level

The significance level α\alpha is the threshold for rejecting H0H_0. If pαp \leq \alpha, we reject H0H_0 and declare the result "statistically significant." The standard choice is α=0.05\alpha = 0.05, meaning we accept a 5% chance of falsely rejecting a true null hypothesis.

Watch Out

p = 0.05 is a convention, not a theorem

The 0.05 threshold comes from Fisher (1925), stated as a rough guide, not a principled cutoff. The ASA's 2016 and 2019 statements on p-values warn against thresholded dichotomous decisions. The replication crisis in psychology and biomedicine (Open Science Collaboration 2015; Ioannidis 2005) showed that a large fraction of "significant" findings at p<0.05p < 0.05 fail to replicate. In ML practice: test across multiple seeds, report effect sizes and confidence intervals, and treat p<0.05p < 0.05 on a single run as weak evidence.

Error Types

Proposition

Type I and Type II Errors

Statement

There are two types of errors in hypothesis testing:

H0H_0 trueH0H_0 false
Reject H0H_0Type I error (α\alpha)Correct (power)
Do not reject H0H_0CorrectType II error (β\beta)
  • Type I error (false positive): rejecting H0H_0 when it is true. Probability = α\alpha.
  • Type II error (false negative): failing to reject H0H_0 when it is false. Probability = β\beta.
  • Power = 1β1 - \beta: probability of correctly rejecting a false H0H_0.

Intuition

Type I error is a false alarm: you claim the models are different when they are not. Type II error is a missed detection: the models truly differ but your test fails to detect it. Decreasing α\alpha (being more conservative) increases β\beta (more missed detections) for fixed sample size.

Proof Sketch

By definition, α=P(reject H0H0 true)\alpha = P(\text{reject } H_0 \mid H_0 \text{ true}), which is controlled directly by the choice of significance level. The power depends on the true effect size δ=μAμB\delta = |\mu_A - \mu_B|, the variance, and the sample size nn: powerΦ(δnσzα/2)\text{power} \approx \Phi\left(\frac{\delta\sqrt{n}}{\sigma} - z_{\alpha/2}\right).

Why It Matters

In ML, Type I errors lead to published claims of improvement that do not replicate. Type II errors lead to dismissing genuinely better methods. Both are costly. Understanding the tradeoff is essential for designing experiments with adequate power.

Failure Mode

Most ML experiments are underpowered: they use too few random seeds or train/test splits to detect real but small improvements. An underpowered test has high β\beta, meaning it frequently fails to detect true differences.

Confidence Intervals

Definition

Confidence Interval

A (1α)(1-\alpha) confidence interval for a parameter θ\theta is a random interval [L,U][L, U] such that:

P(θ[L,U])=1αP(\theta \in [L, U]) = 1 - \alpha

For the difference in model performance δ=μAμB\delta = \mu_A - \mu_B, a 95% CI provides a range of plausible values for the true difference. If the CI excludes zero, the difference is significant at α=0.05\alpha = 0.05.

Confidence intervals are more informative than p-values alone because they communicate both statistical significance and effect size.

The Multiple Comparisons Problem

When you test many hypotheses simultaneously, the probability of at least one false positive grows rapidly.

Theorem

Bonferroni Correction

Statement

If you perform mm simultaneous hypothesis tests and want to control the family-wise error rate (FWER) at level α\alpha, reject each individual test ii only if piα/mp_i \leq \alpha/m.

FWER=P(at least one false rejection)α\text{FWER} = P(\text{at least one false rejection}) \leq \alpha

This holds regardless of the dependence structure among the tests.

Intuition

If you flip a coin 20 times looking for "evidence of bias," you expect at least one streak just by chance. Bonferroni compensates by making each individual test more stringent: if you test 20 hypotheses, you need p0.05/20=0.0025p \leq 0.05/20 = 0.0025 for any single one.

Proof Sketch

By the union bound: P(i:reject H0,iall H0,i true)i=1mP(piα/m)=m(α/m)=αP(\exists i: \text{reject } H_{0,i} \mid \text{all } H_{0,i} \text{ true}) \leq \sum_{i=1}^m P(p_i \leq \alpha/m) = m \cdot (\alpha/m) = \alpha.

Why It Matters

In ML, the multiple comparisons problem arises constantly: comparing many models, testing on multiple datasets, evaluating multiple metrics, running experiments with different hyperparameters. Without correction, the false positive rate can be far higher than the nominal α\alpha.

Failure Mode

Bonferroni is very conservative: it controls the probability of any false positive, which can make it too strict when mm is large. Many true effects are missed. When you care about controlling the proportion of false positives rather than their existence, use FDR control instead.

FDR Control: Benjamini-Hochberg

Definition

False Discovery Rate

The False Discovery Rate is the expected proportion of false positives among all rejected hypotheses:

FDR=E[false rejectionstotal rejections]\text{FDR} = \mathbb{E}\left[\frac{\text{false rejections}}{\text{total rejections}}\right]

The Benjamini-Hochberg (BH) procedure controls FDR at level qq:

  1. Order the mm p-values: p(1)p(2)p(m)p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}
  2. Find the largest kk such that p(k)kmqp_{(k)} \leq \frac{k}{m} \cdot q
  3. Reject all hypotheses with p(i)p(k)p_{(i)} \leq p_{(k)}

BH is less conservative than Bonferroni and is preferred when testing many hypotheses (e.g., comparing models on 50 datasets).

Paired Tests for Model Comparison

When comparing two models on the same data, you should use paired tests that account for the correlation between the two models' errors on each example.

Definition

Paired t-Test for Model Comparison

Given nn test examples, let di=(A,xi)(B,xi)d_i = \ell(A, x_i) - \ell(B, x_i) be the difference in loss between models A and B on example ii. The paired t-test statistic is:

t=dˉsd/n,dˉ=1nidi,sd2=1n1i(didˉ)2t = \frac{\bar{d}}{s_d / \sqrt{n}}, \quad \bar{d} = \frac{1}{n}\sum_i d_i, \quad s_d^2 = \frac{1}{n-1}\sum_i (d_i - \bar{d})^2

Under H0:E[di]=0H_0: \mathbb{E}[d_i] = 0, the statistic tt follows a tt-distribution with n1n-1 degrees of freedom (approximately, assuming did_i are i.i.d.).

Definition

Wilcoxon Signed-Rank Test

A non-parametric alternative to the paired t-test. It does not assume normality of the differences did_i. Instead, it ranks the absolute differences di|d_i| and compares the sum of ranks for positive vs. negative differences. Use this when the differences are not approximately normal (e.g., heavy-tailed or skewed distributions).

Bootstrap Hypothesis Testing

Definition

Bootstrap Test

When the distribution of the test statistic is unknown or the sample is small:

  1. Compute the observed test statistic TobsT_{\text{obs}} (e.g., difference in accuracy)
  2. Generate BB bootstrap resamples under H0H_0 (e.g., by permuting model labels)
  3. Compute TbT_b^* for each resample
  4. The p-value is the fraction of bootstrap statistics at least as extreme as the observed: p=1Bb=1B1[TbTobs]p = \frac{1}{B}\sum_{b=1}^B \mathbf{1}[|T_b^*| \geq |T_{\text{obs}}|]

Bootstrap tests are flexible: they work for any test statistic and make minimal distributional assumptions.

Common Confusions

Watch Out

p less than 0.05 does NOT mean 95% probability that H1 is true

This is the most common misinterpretation in all of statistics. The p-value is P(dataH0)P(\text{data} \mid H_0), not P(H0data)P(H_0 \mid \text{data}). To get the latter, you need Bayes' theorem and a prior on H0H_0. A p-value of 0.03 means: "if the models were truly equal, there is a 3% chance of seeing this large a difference." It does not mean "there is a 97% chance that model A is better."

Watch Out

Statistical significance is not practical significance

A difference can be statistically significant (small p-value) but practically meaningless (tiny effect size). With enough data, you can detect arbitrarily small differences. Always report effect sizes and confidence intervals alongside p-values. A 0.1% accuracy improvement that is "statistically significant" may not matter in practice.

Watch Out

Do not use unpaired tests for paired data

When comparing models on the same test set, the predictions are correlated (both models see the same examples). An unpaired t-test ignores this correlation and has lower power. Always use a paired test (paired t-test, Wilcoxon signed-rank, or bootstrap with pairing) when the data is paired.

Summary

  • The p-value is P(dataH0)P(\text{data} \mid H_0), not P(H0data)P(H_0 \mid \text{data})
  • Type I error (false positive) rate is controlled by α\alpha; Type II (false negative) depends on power
  • Multiple comparisons inflate false positives: use Bonferroni (strict) or Benjamini-Hochberg (less strict)
  • Use paired tests (paired t-test or Wilcoxon) when comparing models on the same data
  • Confidence intervals are more informative than p-values alone
  • Bootstrap tests work when you cannot derive the test statistic distribution

Exercises

ExerciseCore

Problem

You compare 10 models on a test set and report p-values for each pairwise comparison. How many comparisons are there? If you use α=0.05\alpha = 0.05 without correction, what is the approximate probability of at least one false positive (assuming all nulls are true)?

ExerciseCore

Problem

Design a proper statistical test to compare two classifiers (a fine-tuned BERT model and a logistic regression baseline) on a binary classification task with 1000 test examples. Specify: the null hypothesis, the test statistic, the type of test, and how you would compute the p-value.

References

Canonical:

  • Demsar, "Statistical Comparisons of Classifiers over Multiple Datasets" (JMLR 2006)
  • Benjamini and Hochberg, "Controlling the False Discovery Rate" (1995)

Current:

  • Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (MLSys 2021)

  • Benavoli et al., "Should We Really Use Post-Hoc Tests Based on Mean-Ranks?" (JMLR 2016)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

Next Topics

  • Statistical significance and multiple comparisons: deeper treatment of FDR, permutation tests, and replication
  • Bootstrap methods: the general bootstrap framework for inference

Last reviewed: April 2026

Builds on This

Next Topics