p-values

Sneiderman, Robby

Sequential Inference

p-values

The classical evidence statistic against a null hypothesis: probability under the null of observing a test statistic at least as extreme as the one seen. Valid for a single test at a pre-specified sample size; breaks under optional stopping, repeated peeking, and selective reporting. The p-value is what e-values and confidence sequences are designed to replace or augment in sequential settings.

ImportantCoreTier 1StableCore spine~45 min

For:MLStatsResearch

Prerequisites

Hypothesis Testing for ML Neyman Pearson and Hypothesis Testing Theory Random Variables Modes of Convergence Random Variables

Prereq Map

Why This Matters

The $p$ -value is the dominant evidence statistic in published statistical practice. It is also the object most often misinterpreted and most often used outside its validity conditions. A $p$ -value is a function of the data computed under a specified null hypothesis $H_0$ at a fixed sample size $n$ . Its sole guarantee is distributional: under $H_0$ , $p$ is stochastically larger than uniform on $[0, 1]$ , so $\Pr_{H_0}(p \leq \alpha) \leq \alpha$ . Reject when $p \leq \alpha$ and the long-run Type I error rate is bounded by $\alpha$ .

That guarantee is brittle. It assumes a single hypothesis, a single dataset, a sample size fixed in advance, and a decision made once. Three behaviors that violate the assumption are routine in practice: looking at the data multiple times and stopping when $p$ first drops below $\alpha$ (optional stopping); trying many hypotheses and reporting only the smallest $p$ (multiple testing without adjustment); and modeling-conditional-on-the-data choices that bias the test statistic toward rejection ( $p$ -hacking). Each inflates Type I error well above the nominal level.

E-values and anytime-valid inference were developed to preserve frequentist control under exactly the sequential and selective behaviors that break classical $p$ -values. Understanding what a $p$ -value is and is not is prerequisite to using either.

Formal Setup

Definition

p-value $p$

Given a null hypothesis $H_0$ specifying a distribution (or composite family) for the observed data $X$ , and a test statistic $T(X)$ for which large values argue against $H_0$ , the $p$ -value is $p(X) = \sup_{P \in H_0} \Pr_P\!\left[T(X^*) \geq T(X) \mid X^* \sim P\right]$ where $X^*$ is a hypothetical replication. For a simple null $H_0 = \{P_0\}$ , the supremum collapses to $\Pr_{P_0}(T(X^*) \geq T(X))$ .

Definition

Type I error

The probability that a test rejects $H_0$ when $H_0$ is true. For a test that rejects when $p \leq \alpha$ , the Type I error is at most $\alpha$ provided the $p$ -value is valid.

Definition

Valid p-value

A statistic $p(X) \in [0, 1]$ such that under every $P \in H_0$ , $\Pr_P(p(X) \leq u) \leq u$ for all $u \in [0, 1]$ . Equivalently, $p$ is stochastically dominated by $\mathrm{Uniform}[0,1]$ under the null.

The Null Distribution and Type I Error

The single fact that organizes the rest is that the $p$ -value of a valid test is uniformly distributed under the null, or at worst stochastically larger than uniform.

Theorem

p-value is Uniform Under the Null

Statement

Let $T$ be a real-valued test statistic with continuous CDF $F$ under the simple null $H_0 = \{P_0\}$ . Define the $p$ -value as $p = 1 - F(T)$ . Then under $P_0$ , $p \sim \mathrm{Uniform}[0, 1].$ Consequently $\Pr_{P_0}(p \leq \alpha) = \alpha$ for every $\alpha \in [0, 1]$ .

Intuition

The probability integral transform: any continuous random variable, run through its own CDF, becomes Uniform on $[0, 1]$ . The $p$ -value is the right tail of that transform, which inherits the uniform distribution. Discretely distributed test statistics give super-uniform $p$ -values, which keep the inequality direction but lose exactness.

Proof Sketch

For $u \in [0, 1]$ , $\Pr_{P_0}(p \leq u) = \Pr_{P_0}(1 - F(T) \leq u) = \Pr_{P_0}(F(T) \geq 1 - u)$ . Since $F$ is continuous, $F(T) \sim \mathrm{Uniform}[0,1]$ by the probability integral transform, so $\Pr_{P_0}(F(T) \geq 1 - u) = u$ .

Why It Matters

The Type I error guarantee for the test "reject when $p \leq \alpha$ " follows directly. Every classical confidence interval, $t$ -test, $F$ -test, and likelihood-ratio test reduces to this uniformity statement, sometimes with continuity corrections for discrete statistics.

Failure Mode

The uniformity statement is conditional on a single hypothesis tested once at a sample size $n$ fixed before data collection. Repeating the test after each new observation (optional stopping) destroys uniformity: the running minimum $p$ -value over $n = 1, 2, \ldots$ is not uniform under the null. Selecting the test or the data subset based on the data ( $p$ -hacking) likewise breaks the construction.

report a correction →

What a p-value Is Not

The $p$ -value is not the probability that the null is true. It is not the probability that the result is due to chance. It is not one minus the probability that the alternative is true. It is not a measure of effect size.

Watch Out

p is not Pr(null is true)

$\Pr(H_0 \mid \text{data})$ is a posterior probability and requires a prior over hypotheses; the $p$ -value is a tail probability under the assumption that $H_0$ is true. The two are equal only under specific priors that the analyst usually has no reason to adopt. Reporting "p = 0.03, so there is a 3 percent chance the null is true" is wrong by definition.

Watch Out

A small p is not a large effect

The $p$ -value depends on both effect size and sample size. With enough data, vanishing effects produce arbitrarily small $p$ -values. Report a confidence interval alongside the $p$ -value; the interval tells you about magnitude, the $p$ -value tells you about distinguishability from the null.

Watch Out

p > alpha does not mean H_0 is true

Failure to reject is failure to detect, not evidence of equivalence. The $p$ -value tests against the null; the alternative is not the symmetric counterpart. Use equivalence tests, posterior model probabilities, or e-values for two-sided evidence claims.

Failure Modes: Optional Stopping, Multiple Testing, and p-hacking

Optional stopping

Suppose you observe $X_1, X_2, \ldots$ iid and recompute the $p$ -value at each $n$ . Define $\tau = \inf\{n : p_n \leq \alpha\}$ with $\tau = \infty$ if the threshold is never crossed. Under $H_0$ , $\Pr(\tau < \infty)$ can be strictly larger than $\alpha$ ; for the standard $z$ -test it is $1$ in the limit. Stopping the first time $p_n \leq \alpha$ inflates Type I error to $1$ in expectation over infinite time.

The numerical reality is more forgiving but still bad: with $n_{\max} = 1000$ and per-look $\alpha = 0.05$ , classical interim analyses on a Gaussian mean inflate the empirical Type I error from $5\%$ to about $30\%$ if every observation triggers a test. Group-sequential corrections (Pocock, O'Brien-Fleming, alpha-spending) recover validity for pre-specified look schedules; they do not handle arbitrary peeking.

Multiple testing

Performing $m$ independent tests at level $\alpha$ and reporting the smallest $p$ -value as if it were a single test inflates the Type I error to $1 - (1 - \alpha)^m$ . For $m = 20$ , $\alpha = 0.05$ produces a $64\%$ chance of at least one spurious rejection under the global null. Bonferroni, Holm, Benjamini-Hochberg, and Storey's procedure are the standard corrections; their validity depends on whether the family is fixed or data-dependent.

p-hacking

The selection problem in its modeling form. The analyst tries multiple model specifications, subsets, transformations, or covariates and reports the one with the smallest $p$ -value. Pre-registration and rigid analysis protocols reduce the freedom; e-values eliminate the inflation because their validity survives data-driven choices of which evidence to combine.

Canonical Example: One-sample z-test

Observe $X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2)$ with $\sigma^2$ known and $H_0: \mu = \mu_0$ . The test statistic is $Z = \sqrt{n}\,(\bar X_n - \mu_0)/\sigma$ , which is $\mathcal{N}(0, 1)$ under the null. The one-sided $p$ -value is $p = 1 - \Phi(Z)$ .

Worked numerical case: $n = 100$ , $\bar X_n = 0.21$ , $\mu_0 = 0$ , $\sigma = 1$ . Then $Z = 10 \times 0.21 = 2.1$ and $p = 1 - \Phi(2.1) \approx 0.0179$ . At $\alpha = 0.05$ the test rejects $H_0$ . The interpretation: under $H_0$ , the probability of seeing a sample mean at least $0.21$ for $n = 100$ is about $1.8\%$ . This says nothing about whether the alternative is true or how large $\mu$ might be.

Worked Exercise

ExerciseCore

Problem

You run a one-sample $z$ -test against $H_0: \mu = 0$ on iid $\mathcal{N}(\mu, 1)$ data. At sample size $n = 100$ you compute the two-sided $p$ -value and find $p_{100} = 0.06$ . The rule for the experiment was to stop at $n = 100$ . Curious, you collect $100$ more observations and recompute, getting $p_{200} = 0.03$ . Is reporting $p_{200} = 0.03$ as a Type I error guarantee valid? Quantify the inflation if you would have stopped at the first $n \in \{100, 200\}$ for which $p \leq 0.05$ .

Implementation Note

For continuous test statistics, the $p$ -value is computed from the survival function of the null distribution. SciPy exposes this as 1 - stats.norm.cdf(z) for the $z$ -test, 1 - stats.t.cdf(t, df) for the $t$ -test, and the survival-function variants stats.chi2.sf, stats.f.sf for chi-square and $F$ statistics. Always use the survival function sf rather than 1 - cdf when the tail is in the right end; subtracting from $1$ loses floating-point precision for small $p$ and gives literal zeros when the true $p$ is around $10^{-16}$ .

For discrete test statistics (binomial, hypergeometric, Poisson), the $p$ -value is super-uniform. Mid- $p$ corrections recover near-uniformity at the cost of stochastic dominance; the choice depends on whether you want a strict upper bound on Type I error (use the standard $p$ -value) or a better-calibrated decision (use the mid- $p$ -value).

For sequential or multiple-testing settings, do not adjust the raw $p$ -value with ad-hoc rules. Use either a pre-registered procedure (Bonferroni, Holm, BH, group-sequential boundaries) or move to e-values and anytime-valid inference.

Where p-values Stand in 2026

The $p$ -value remains the standard evidence summary for single, fixed-design hypothesis tests in published statistical work. Its limitations have been catalogued repeatedly: the ASA statement (Wasserstein and Lazar 2016, The American Statistician) explicitly warned against six common misinterpretations. The technical response from the post-2010 literature has been to develop tools that match the way modern data are actually generated: continuous monitoring, large numbers of subgroups, sequentially-adaptive experiments. E-values, e-processes, anytime-valid confidence sequences, and safe testing are the formal replacements where the classical $p$ -value's guarantees break.

For ML practice, the $p$ -value still appears in benchmark significance tests, A/B test reporting, and the headline summary of medical-trial papers. The rigorous practitioner reports the $p$ -value, but also reports the design (fixed- $n$ ? sequential?), the family (single hypothesis? many?), and either a confidence interval, an e-value, or a Bayes factor for the same null. The $p$ -value alone is rarely enough.

References

Canonical:

Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses (3rd ed., Springer). Chapters 3-5 develop the formal framework: simple and composite nulls, uniformly most powerful tests, the role of the $p$ -value as a level function.
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference (Springer). Chapter 10 ("Hypothesis Testing and p-values") gives the standard textbook treatment.
Casella, G. and Berger, R. L. (2002). Statistical Inference (2nd ed., Duxbury). Chapter 8 covers $p$ -values, power, and the duality with confidence sets.

Current:

Wasserstein, R. L. and Lazar, N. A. (2016). "The ASA Statement on p-Values: Context, Process, and Purpose." The American Statistician 70(2). The official American Statistical Association position on common misinterpretations.
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., and Altman, D. G. (2016). "Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations." European Journal of Epidemiology 31(4). The companion guide, exhaustive on misinterpretation.

Frontier:

Vovk, V. and Wang, R. (2021). "E-values: Calibration, combination and applications." Annals of Statistics 49(3), pp. 1736-1754. Establishes the e-to-p and p-to-e calibration maps that link the two evidence frameworks.
Ramdas, A., Grünwald, P., Vovk, V., and Shafer, G. (2023). "Game-theoretic statistics and safe anytime-valid inference." Statistical Science 38(4). Position paper on why the sequential setting needs e-values rather than p-values.

Next Topics

E-values: the modern evidence statistic that survives optional stopping.
Confidence sequences: time-uniform confidence intervals built from e-processes.
Anytime-valid inference: the framework e-values are designed for.
p-hacking and multiple testing: the family-wise and false-discovery corrections that classical $p$ -values need.
Statistical significance and multiple comparisons: the broader replication-crisis context.

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Random Variableslayer 0A · tier 1
Modes of Convergence of Random Variableslayer 0B · tier 1
Likelihood-Ratio, Wald, and Score Testslayer 2 · tier 1
Hypothesis Testing for MLlayer 2 · tier 2
Neyman-Pearson and Hypothesis Testing Theorylayer 2 · tier 2

Derived topics

2

e-valueslayer 2 · tier 1
Anytime-Valid Inferencelayer 3 · tier 1

Graph-backed continuations

e-values Anytime-Valid Inference