Skip to main content

Sequential Inference

p-values

The classical evidence statistic against a null hypothesis: probability under the null of observing a test statistic at least as extreme as the one seen. Valid for a single test at a pre-specified sample size; breaks under optional stopping, repeated peeking, and selective reporting. The p-value is what e-values and confidence sequences are designed to replace or augment in sequential settings.

ImportantCoreTier 1StableCore spine~45 min
For:MLStatsResearch

Why This Matters

The pp-value is the dominant evidence statistic in published statistical practice. It is also the object most often misinterpreted and most often used outside its validity conditions. A pp-value is a function of the data computed under a specified null hypothesis H0H_0 at a fixed sample size nn. Its sole guarantee is distributional: under H0H_0, pp is stochastically larger than uniform on [0,1][0, 1], so PrH0(pα)α\Pr_{H_0}(p \leq \alpha) \leq \alpha. Reject when pαp \leq \alpha and the long-run Type I error rate is bounded by α\alpha.

That guarantee is brittle. It assumes a single hypothesis, a single dataset, a sample size fixed in advance, and a decision made once. Three behaviors that violate the assumption are routine in practice: looking at the data multiple times and stopping when pp first drops below α\alpha (optional stopping); trying many hypotheses and reporting only the smallest pp (multiple testing without adjustment); and modeling-conditional-on-the-data choices that bias the test statistic toward rejection (pp-hacking). Each inflates Type I error well above the nominal level.

E-values and anytime-valid inference were developed to preserve frequentist control under exactly the sequential and selective behaviors that break classical pp-values. Understanding what a pp-value is and is not is prerequisite to using either.

Formal Setup

Definition

p-value

Given a null hypothesis H0H_0 specifying a distribution (or composite family) for the observed data XX, and a test statistic T(X)T(X) for which large values argue against H0H_0, the pp-value is p(X)=supPH0PrP ⁣[T(X)T(X)XP]p(X) = \sup_{P \in H_0} \Pr_P\!\left[T(X^*) \geq T(X) \mid X^* \sim P\right] where XX^* is a hypothetical replication. For a simple null H0={P0}H_0 = \{P_0\}, the supremum collapses to PrP0(T(X)T(X))\Pr_{P_0}(T(X^*) \geq T(X)).

Definition

Type I error

The probability that a test rejects H0H_0 when H0H_0 is true. For a test that rejects when pαp \leq \alpha, the Type I error is at most α\alpha provided the pp-value is valid.

Definition

Valid p-value

A statistic p(X)[0,1]p(X) \in [0, 1] such that under every PH0P \in H_0, PrP(p(X)u)u\Pr_P(p(X) \leq u) \leq u for all u[0,1]u \in [0, 1]. Equivalently, pp is stochastically dominated by Uniform[0,1]\mathrm{Uniform}[0,1] under the null.

The Null Distribution and Type I Error

The single fact that organizes the rest is that the pp-value of a valid test is uniformly distributed under the null, or at worst stochastically larger than uniform.

Theorem

p-value is Uniform Under the Null

Statement

Let TT be a real-valued test statistic with continuous CDF FF under the simple null H0={P0}H_0 = \{P_0\}. Define the pp-value as p=1F(T)p = 1 - F(T). Then under P0P_0, pUniform[0,1].p \sim \mathrm{Uniform}[0, 1]. Consequently PrP0(pα)=α\Pr_{P_0}(p \leq \alpha) = \alpha for every α[0,1]\alpha \in [0, 1].

Intuition

The probability integral transform: any continuous random variable, run through its own CDF, becomes Uniform on [0,1][0, 1]. The pp-value is the right tail of that transform, which inherits the uniform distribution. Discretely distributed test statistics give super-uniform pp-values, which keep the inequality direction but lose exactness.

Proof Sketch

For u[0,1]u \in [0, 1], PrP0(pu)=PrP0(1F(T)u)=PrP0(F(T)1u)\Pr_{P_0}(p \leq u) = \Pr_{P_0}(1 - F(T) \leq u) = \Pr_{P_0}(F(T) \geq 1 - u). Since FF is continuous, F(T)Uniform[0,1]F(T) \sim \mathrm{Uniform}[0,1] by the probability integral transform, so PrP0(F(T)1u)=u\Pr_{P_0}(F(T) \geq 1 - u) = u.

Why It Matters

The Type I error guarantee for the test "reject when pαp \leq \alpha" follows directly. Every classical confidence interval, tt-test, FF-test, and likelihood-ratio test reduces to this uniformity statement, sometimes with continuity corrections for discrete statistics.

Failure Mode

The uniformity statement is conditional on a single hypothesis tested once at a sample size nn fixed before data collection. Repeating the test after each new observation (optional stopping) destroys uniformity: the running minimum pp-value over n=1,2,n = 1, 2, \ldots is not uniform under the null. Selecting the test or the data subset based on the data (pp-hacking) likewise breaks the construction.

What a p-value Is Not

The pp-value is not the probability that the null is true. It is not the probability that the result is due to chance. It is not one minus the probability that the alternative is true. It is not a measure of effect size.

Watch Out

p is not Pr(null is true)

Pr(H0data)\Pr(H_0 \mid \text{data}) is a posterior probability and requires a prior over hypotheses; the pp-value is a tail probability under the assumption that H0H_0 is true. The two are equal only under specific priors that the analyst usually has no reason to adopt. Reporting "p = 0.03, so there is a 3 percent chance the null is true" is wrong by definition.

Watch Out

A small p is not a large effect

The pp-value depends on both effect size and sample size. With enough data, vanishing effects produce arbitrarily small pp-values. Report a confidence interval alongside the pp-value; the interval tells you about magnitude, the pp-value tells you about distinguishability from the null.

Watch Out

p > alpha does not mean H_0 is true

Failure to reject is failure to detect, not evidence of equivalence. The pp-value tests against the null; the alternative is not the symmetric counterpart. Use equivalence tests, posterior model probabilities, or e-values for two-sided evidence claims.

Failure Modes: Optional Stopping, Multiple Testing, and p-hacking

Optional stopping

Suppose you observe X1,X2,X_1, X_2, \ldots iid and recompute the pp-value at each nn. Define τ=inf{n:pnα}\tau = \inf\{n : p_n \leq \alpha\} with τ=\tau = \infty if the threshold is never crossed. Under H0H_0, Pr(τ<)\Pr(\tau < \infty) can be strictly larger than α\alpha; for the standard zz-test it is 11 in the limit. Stopping the first time pnαp_n \leq \alpha inflates Type I error to 11 in expectation over infinite time.

The numerical reality is more forgiving but still bad: with nmax=1000n_{\max} = 1000 and per-look α=0.05\alpha = 0.05, classical interim analyses on a Gaussian mean inflate the empirical Type I error from 5%5\% to about 30%30\% if every observation triggers a test. Group-sequential corrections (Pocock, O'Brien-Fleming, alpha-spending) recover validity for pre-specified look schedules; they do not handle arbitrary peeking.

Multiple testing

Performing mm independent tests at level α\alpha and reporting the smallest pp-value as if it were a single test inflates the Type I error to 1(1α)m1 - (1 - \alpha)^m. For m=20m = 20, α=0.05\alpha = 0.05 produces a 64%64\% chance of at least one spurious rejection under the global null. Bonferroni, Holm, Benjamini-Hochberg, and Storey's procedure are the standard corrections; their validity depends on whether the family is fixed or data-dependent.

p-hacking

The selection problem in its modeling form. The analyst tries multiple model specifications, subsets, transformations, or covariates and reports the one with the smallest pp-value. Pre-registration and rigid analysis protocols reduce the freedom; e-values eliminate the inflation because their validity survives data-driven choices of which evidence to combine.

Canonical Example: One-sample z-test

Observe X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2) with σ2\sigma^2 known and H0:μ=μ0H_0: \mu = \mu_0. The test statistic is Z=n(Xˉnμ0)/σZ = \sqrt{n}\,(\bar X_n - \mu_0)/\sigma, which is N(0,1)\mathcal{N}(0, 1) under the null. The one-sided pp-value is p=1Φ(Z)p = 1 - \Phi(Z).

Worked numerical case: n=100n = 100, Xˉn=0.21\bar X_n = 0.21, μ0=0\mu_0 = 0, σ=1\sigma = 1. Then Z=10×0.21=2.1Z = 10 \times 0.21 = 2.1 and p=1Φ(2.1)0.0179p = 1 - \Phi(2.1) \approx 0.0179. At α=0.05\alpha = 0.05 the test rejects H0H_0. The interpretation: under H0H_0, the probability of seeing a sample mean at least 0.210.21 for n=100n = 100 is about 1.8%1.8\%. This says nothing about whether the alternative is true or how large μ\mu might be.

Worked Exercise

ExerciseCore

Problem

You run a one-sample zz-test against H0:μ=0H_0: \mu = 0 on iid N(μ,1)\mathcal{N}(\mu, 1) data. At sample size n=100n = 100 you compute the two-sided pp-value and find p100=0.06p_{100} = 0.06. The rule for the experiment was to stop at n=100n = 100. Curious, you collect 100100 more observations and recompute, getting p200=0.03p_{200} = 0.03. Is reporting p200=0.03p_{200} = 0.03 as a Type I error guarantee valid? Quantify the inflation if you would have stopped at the first n{100,200}n \in \{100, 200\} for which p0.05p \leq 0.05.

Implementation Note

For continuous test statistics, the pp-value is computed from the survival function of the null distribution. SciPy exposes this as 1 - stats.norm.cdf(z) for the zz-test, 1 - stats.t.cdf(t, df) for the tt-test, and the survival-function variants stats.chi2.sf, stats.f.sf for chi-square and FF statistics. Always use the survival function sf rather than 1 - cdf when the tail is in the right end; subtracting from 11 loses floating-point precision for small pp and gives literal zeros when the true pp is around 101610^{-16}.

For discrete test statistics (binomial, hypergeometric, Poisson), the pp-value is super-uniform. Mid-pp corrections recover near-uniformity at the cost of stochastic dominance; the choice depends on whether you want a strict upper bound on Type I error (use the standard pp-value) or a better-calibrated decision (use the mid-pp-value).

For sequential or multiple-testing settings, do not adjust the raw pp-value with ad-hoc rules. Use either a pre-registered procedure (Bonferroni, Holm, BH, group-sequential boundaries) or move to e-values and anytime-valid inference.

Where p-values Stand in 2026

The pp-value remains the standard evidence summary for single, fixed-design hypothesis tests in published statistical work. Its limitations have been catalogued repeatedly: the ASA statement (Wasserstein and Lazar 2016, The American Statistician) explicitly warned against six common misinterpretations. The technical response from the post-2010 literature has been to develop tools that match the way modern data are actually generated: continuous monitoring, large numbers of subgroups, sequentially-adaptive experiments. E-values, e-processes, anytime-valid confidence sequences, and safe testing are the formal replacements where the classical pp-value's guarantees break.

For ML practice, the pp-value still appears in benchmark significance tests, A/B test reporting, and the headline summary of medical-trial papers. The rigorous practitioner reports the pp-value, but also reports the design (fixed-nn? sequential?), the family (single hypothesis? many?), and either a confidence interval, an e-value, or a Bayes factor for the same null. The pp-value alone is rarely enough.

References

Canonical:

  • Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses (3rd ed., Springer). Chapters 3-5 develop the formal framework: simple and composite nulls, uniformly most powerful tests, the role of the pp-value as a level function.
  • Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference (Springer). Chapter 10 ("Hypothesis Testing and p-values") gives the standard textbook treatment.
  • Casella, G. and Berger, R. L. (2002). Statistical Inference (2nd ed., Duxbury). Chapter 8 covers pp-values, power, and the duality with confidence sets.

Current:

  • Wasserstein, R. L. and Lazar, N. A. (2016). "The ASA Statement on p-Values: Context, Process, and Purpose." The American Statistician 70(2). The official American Statistical Association position on common misinterpretations.
  • Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., and Altman, D. G. (2016). "Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations." European Journal of Epidemiology 31(4). The companion guide, exhaustive on misinterpretation.

Frontier:

  • Vovk, V. and Wang, R. (2021). "E-values: Calibration, combination and applications." Annals of Statistics 49(3), pp. 1736-1754. Establishes the e-to-p and p-to-e calibration maps that link the two evidence frameworks.
  • Ramdas, A., Grünwald, P., Vovk, V., and Shafer, G. (2023). "Game-theoretic statistics and safe anytime-valid inference." Statistical Science 38(4). Position paper on why the sequential setting needs e-values rather than p-values.

Next Topics

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

5

Derived topics

2

Graph-backed continuations