Sequential Inference
p-values
The classical evidence statistic against a null hypothesis: probability under the null of observing a test statistic at least as extreme as the one seen. Valid for a single test at a pre-specified sample size; breaks under optional stopping, repeated peeking, and selective reporting. The p-value is what e-values and confidence sequences are designed to replace or augment in sequential settings.
Prerequisites
Why This Matters
The -value is the dominant evidence statistic in published statistical practice. It is also the object most often misinterpreted and most often used outside its validity conditions. A -value is a function of the data computed under a specified null hypothesis at a fixed sample size . Its sole guarantee is distributional: under , is stochastically larger than uniform on , so . Reject when and the long-run Type I error rate is bounded by .
That guarantee is brittle. It assumes a single hypothesis, a single dataset, a sample size fixed in advance, and a decision made once. Three behaviors that violate the assumption are routine in practice: looking at the data multiple times and stopping when first drops below (optional stopping); trying many hypotheses and reporting only the smallest (multiple testing without adjustment); and modeling-conditional-on-the-data choices that bias the test statistic toward rejection (-hacking). Each inflates Type I error well above the nominal level.
E-values and anytime-valid inference were developed to preserve frequentist control under exactly the sequential and selective behaviors that break classical -values. Understanding what a -value is and is not is prerequisite to using either.
Formal Setup
p-value
Given a null hypothesis specifying a distribution (or composite family) for the observed data , and a test statistic for which large values argue against , the -value is where is a hypothetical replication. For a simple null , the supremum collapses to .
Type I error
The probability that a test rejects when is true. For a test that rejects when , the Type I error is at most provided the -value is valid.
Valid p-value
A statistic such that under every , for all . Equivalently, is stochastically dominated by under the null.
The Null Distribution and Type I Error
The single fact that organizes the rest is that the -value of a valid test is uniformly distributed under the null, or at worst stochastically larger than uniform.
p-value is Uniform Under the Null
Statement
Let be a real-valued test statistic with continuous CDF under the simple null . Define the -value as . Then under , Consequently for every .
Intuition
The probability integral transform: any continuous random variable, run through its own CDF, becomes Uniform on . The -value is the right tail of that transform, which inherits the uniform distribution. Discretely distributed test statistics give super-uniform -values, which keep the inequality direction but lose exactness.
Proof Sketch
For , . Since is continuous, by the probability integral transform, so .
Why It Matters
The Type I error guarantee for the test "reject when " follows directly. Every classical confidence interval, -test, -test, and likelihood-ratio test reduces to this uniformity statement, sometimes with continuity corrections for discrete statistics.
Failure Mode
The uniformity statement is conditional on a single hypothesis tested once at a sample size fixed before data collection. Repeating the test after each new observation (optional stopping) destroys uniformity: the running minimum -value over is not uniform under the null. Selecting the test or the data subset based on the data (-hacking) likewise breaks the construction.
What a p-value Is Not
The -value is not the probability that the null is true. It is not the probability that the result is due to chance. It is not one minus the probability that the alternative is true. It is not a measure of effect size.
p is not Pr(null is true)
is a posterior probability and requires a prior over hypotheses; the -value is a tail probability under the assumption that is true. The two are equal only under specific priors that the analyst usually has no reason to adopt. Reporting "p = 0.03, so there is a 3 percent chance the null is true" is wrong by definition.
A small p is not a large effect
The -value depends on both effect size and sample size. With enough data, vanishing effects produce arbitrarily small -values. Report a confidence interval alongside the -value; the interval tells you about magnitude, the -value tells you about distinguishability from the null.
p > alpha does not mean H_0 is true
Failure to reject is failure to detect, not evidence of equivalence. The -value tests against the null; the alternative is not the symmetric counterpart. Use equivalence tests, posterior model probabilities, or e-values for two-sided evidence claims.
Failure Modes: Optional Stopping, Multiple Testing, and p-hacking
Optional stopping
Suppose you observe iid and recompute the -value at each . Define with if the threshold is never crossed. Under , can be strictly larger than ; for the standard -test it is in the limit. Stopping the first time inflates Type I error to in expectation over infinite time.
The numerical reality is more forgiving but still bad: with and per-look , classical interim analyses on a Gaussian mean inflate the empirical Type I error from to about if every observation triggers a test. Group-sequential corrections (Pocock, O'Brien-Fleming, alpha-spending) recover validity for pre-specified look schedules; they do not handle arbitrary peeking.
Multiple testing
Performing independent tests at level and reporting the smallest -value as if it were a single test inflates the Type I error to . For , produces a chance of at least one spurious rejection under the global null. Bonferroni, Holm, Benjamini-Hochberg, and Storey's procedure are the standard corrections; their validity depends on whether the family is fixed or data-dependent.
p-hacking
The selection problem in its modeling form. The analyst tries multiple model specifications, subsets, transformations, or covariates and reports the one with the smallest -value. Pre-registration and rigid analysis protocols reduce the freedom; e-values eliminate the inflation because their validity survives data-driven choices of which evidence to combine.
Canonical Example: One-sample z-test
Observe with known and . The test statistic is , which is under the null. The one-sided -value is .
Worked numerical case: , , , . Then and . At the test rejects . The interpretation: under , the probability of seeing a sample mean at least for is about . This says nothing about whether the alternative is true or how large might be.
Worked Exercise
Problem
You run a one-sample -test against on iid data. At sample size you compute the two-sided -value and find . The rule for the experiment was to stop at . Curious, you collect more observations and recompute, getting . Is reporting as a Type I error guarantee valid? Quantify the inflation if you would have stopped at the first for which .
Implementation Note
For continuous test statistics, the -value is computed from the survival function of the null distribution. SciPy exposes this as 1 - stats.norm.cdf(z) for the -test, 1 - stats.t.cdf(t, df) for the -test, and the survival-function variants stats.chi2.sf, stats.f.sf for chi-square and statistics. Always use the survival function sf rather than 1 - cdf when the tail is in the right end; subtracting from loses floating-point precision for small and gives literal zeros when the true is around .
For discrete test statistics (binomial, hypergeometric, Poisson), the -value is super-uniform. Mid- corrections recover near-uniformity at the cost of stochastic dominance; the choice depends on whether you want a strict upper bound on Type I error (use the standard -value) or a better-calibrated decision (use the mid--value).
For sequential or multiple-testing settings, do not adjust the raw -value with ad-hoc rules. Use either a pre-registered procedure (Bonferroni, Holm, BH, group-sequential boundaries) or move to e-values and anytime-valid inference.
Where p-values Stand in 2026
The -value remains the standard evidence summary for single, fixed-design hypothesis tests in published statistical work. Its limitations have been catalogued repeatedly: the ASA statement (Wasserstein and Lazar 2016, The American Statistician) explicitly warned against six common misinterpretations. The technical response from the post-2010 literature has been to develop tools that match the way modern data are actually generated: continuous monitoring, large numbers of subgroups, sequentially-adaptive experiments. E-values, e-processes, anytime-valid confidence sequences, and safe testing are the formal replacements where the classical -value's guarantees break.
For ML practice, the -value still appears in benchmark significance tests, A/B test reporting, and the headline summary of medical-trial papers. The rigorous practitioner reports the -value, but also reports the design (fixed-? sequential?), the family (single hypothesis? many?), and either a confidence interval, an e-value, or a Bayes factor for the same null. The -value alone is rarely enough.
References
Canonical:
- Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses (3rd ed., Springer). Chapters 3-5 develop the formal framework: simple and composite nulls, uniformly most powerful tests, the role of the -value as a level function.
- Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference (Springer). Chapter 10 ("Hypothesis Testing and p-values") gives the standard textbook treatment.
- Casella, G. and Berger, R. L. (2002). Statistical Inference (2nd ed., Duxbury). Chapter 8 covers -values, power, and the duality with confidence sets.
Current:
- Wasserstein, R. L. and Lazar, N. A. (2016). "The ASA Statement on p-Values: Context, Process, and Purpose." The American Statistician 70(2). The official American Statistical Association position on common misinterpretations.
- Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., and Altman, D. G. (2016). "Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations." European Journal of Epidemiology 31(4). The companion guide, exhaustive on misinterpretation.
Frontier:
- Vovk, V. and Wang, R. (2021). "E-values: Calibration, combination and applications." Annals of Statistics 49(3), pp. 1736-1754. Establishes the e-to-p and p-to-e calibration maps that link the two evidence frameworks.
- Ramdas, A., Grünwald, P., Vovk, V., and Shafer, G. (2023). "Game-theoretic statistics and safe anytime-valid inference." Statistical Science 38(4). Position paper on why the sequential setting needs e-values rather than p-values.
Next Topics
- E-values: the modern evidence statistic that survives optional stopping.
- Confidence sequences: time-uniform confidence intervals built from e-processes.
- Anytime-valid inference: the framework e-values are designed for.
- p-hacking and multiple testing: the family-wise and false-discovery corrections that classical -values need.
- Statistical significance and multiple comparisons: the broader replication-crisis context.
Last reviewed: May 13, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
5- Random Variableslayer 0A · tier 1
- Modes of Convergence of Random Variableslayer 0B · tier 1
- Likelihood-Ratio, Wald, and Score Testslayer 2 · tier 1
- Hypothesis Testing for MLlayer 2 · tier 2
- Neyman-Pearson and Hypothesis Testing Theorylayer 2 · tier 2
Derived topics
2- e-valueslayer 2 · tier 1
- Anytime-Valid Inferencelayer 3 · tier 1
Graph-backed continuations