Sequential Inference
E-Values and Anytime-Valid Inference
A framework for hypothesis testing whose validity survives optional stopping, optional continuation, and data-dependent peeking. E-values are nonnegative random variables with expectation at most one under the null; Ville's inequality makes the guarantee anytime-valid.
Prerequisites
Why This Matters
Every classical hypothesis test fixes the sample size in advance. The -value guarantee, , breaks the moment you peek at the data and stop early when the result looks good. This is the reason online A/B tests, adaptive clinical trials, and rolling LLM evaluations produce elevated false-positive rates in practice. The analyst follows the mathematics of a fixed- test but behaves as if they were running a sequential test, and the gap is filled with Type I errors.
E-values and anytime-valid inference close the gap. An e-value is a nonnegative statistic whose expectation under the null is at most one. By Markov's inequality, acts like a -value in the sense that . The key constructive fact is that a sequentially-built product of conditional e-values (equivalently, the running product of likelihood ratios under filtration ) is a nonnegative martingale or supermartingale under the null, hence an e-process. This gives a test that is valid at every stopping time simultaneously, not just at the sample size you had originally in mind. Peeking becomes a feature, not a failure mode. Arbitrary products of marginal (unconditional) e-values across experiments are not in general e-values, because without independence; combination of evidence across experiments without conditional structure relies on arithmetic averaging or weighted means, which preserve the e-value property.
The regulatory case is live: the FDA's January 2026 draft Bayesian methodology guidance is openly receptive to non-traditional success criteria, which is the opening that e-value methods have been waiting for. The tech case has been live for a decade and is finally being taught.
Formal Setup
Let be a null hypothesis specifying a set of probability measures on a data stream Fix a significance level .
E-Value
A nonnegative random variable is an e-value for if for every . By Markov,
The test rejects when .
E-Process
A sequence adapted to a filtration is an e-process for if for every stopping time (including ) and every ,
Equivalently, is upper-bounded by a nonnegative supermartingale under every . When equality holds, is a test martingale.
Ville's Inequality
Ville's Inequality
Statement
For any ,
Consequently, the test "reject when at any " has Type I error at most uniformly over all stopping times.
Intuition
A nonnegative supermartingale with expected starting value at most one cannot consistently grow large, because the expected value is bounded forever. Markov's inequality applied at a maximum (rather than at a fixed time) gives a maximal inequality. This single bound handles every continuation strategy: no matter how the analyst decides when to stop, the supremum control holds.
Proof Sketch
Let and apply optional stopping to with any bounded horizon . The stopped expectation is bounded by , and on the stopped value is at least . Markov gives . Let by monotone convergence.
Why It Matters
This is the mathematical engine underneath every anytime-valid procedure. It converts the fixed-sample Markov inequality into a sample-size-free maximal bound, at the cost of requiring the statistic to be a supermartingale under the null. Building such a supermartingale is the constructive content of the theory.
Failure Mode
The supermartingale property is with respect to a specific filtration; test statistics that peek at future data, or that are not adapted, violate it silently. Composite nulls require the supermartingale property to hold simultaneously under every null distribution, which is where universal inference and reverse information projection (RIPr) come in.
Construction Methods
Likelihood ratios. When is simple and is an alternative, the likelihood ratio is a -test martingale. This recovers Wald's sequential probability ratio test as a special case.
Universal inference (Wasserman, Ramdas, Balakrishnan 2020). For any composite null , split the data into two halves. Fit a maximum likelihood estimator on the first half and use
evaluated on the second half. This is an e-value with no regularity conditions, trading efficiency for universal validity.
Reverse information projection (RIPr) (Grünwald, de Heide, Koolen 2024). Take the reverse KL projection of the alternative onto the null, giving a minimax-optimal e-value within a specified class.
Method of mixtures and betting martingales (Shafer 2021). Interpret the e-process as the wealth of a player in a sequential betting game against the null, with the null's best strategy being to offer fair odds. The analyst bets; wealth accumulates under misspecification.
p-to-e and e-to-p Calibration
Vovk and Wang (2021) characterize the admissible calibrators converting between the two. A function is a valid p-to-e calibrator if in expectation for every super-uniform under , equivalently if is non-increasing with .
- p-to-e: Two valid examples are (which has ) and the family for . The proposal does not integrate to a finite value over because diverges, so it is not a valid calibrator without additional truncation or restriction.
- e-to-p: is a valid p-value from any valid e-value (Markov's inequality applied to ).
The conversions are lossy in an expected-value sense; the standalone strength of e-values is for sequential combination via running products of conditional e-values, and for arithmetic averaging of e-values across arbitrarily dependent experiments. The analogous operations on p-values do not preserve the validity guarantee.
Confidence Sequences
Inverting an e-process for over all gives a confidence sequence: a random sequence of sets such that . The guarantee is uniform in time. Howard, Ramdas, McAuliffe, Sekhon (2021) give time-uniform nonparametric confidence sequences for bounded means, quantiles, and Bernoulli rates.
Multiple Testing: The e-BH Procedure
e-BH Controls FDR Under Arbitrary Dependence
Statement
For e-values corresponding to null hypotheses, the e-BH procedure rejects the largest e-values where is the largest index such that . This procedure controls the false discovery rate at , under arbitrary dependence among the e-values.
Intuition
The BH procedure for p-values requires positive regression dependence (PRDS) to control FDR. e-values replace that assumption with a structural property: a weighted average of e-values is an e-value, regardless of dependence. The FDR bound follows from a single application of Markov to the ratio of false discoveries to total discoveries.
Why It Matters
In LLM evaluation and multi-arm A/B testing, dependence among hypotheses is the rule, not the exception, and PRDS is rarely verifiable. e-BH gives FDR control with no dependence assumption, at the cost of slightly wider rejection regions than BH under favourable dependence.
Applications
Adaptive clinical trials. The R package evalinger and its successors
implement e-process monitoring with futility stops and platform-trial
multiplicity control. Regulatory adoption is the live frontier (FDA draft
guidance, January 2026).
Online A/B testing. Confidence sequences replace fixed- tests at Netflix, Adobe, Optimizely, and similar shops. Any decision made "when the result looks stable" is an implicit sequential test, and a confidence sequence is the honest version.
LLM evaluation. Leaderboard comparisons are sequential by design: new models arrive, benchmarks are rerun, the top- ordering shifts. Selective inference on the final ranking requires e-processes, not -values.
Relationship to Bayes and Classical Sequential Analysis
Bayes factors are e-values from the null's perspective: the expectation of the Bayes factor under the null is one. Wald's sequential probability ratio test is a likelihood-ratio test martingale. The modern e-value framework unifies these threads and adds explicit universal-inference and RIPr constructions for composite nulls that the classical theories handle only clumsily.
Exercises
Problem
A classical -value satisfies . An analyst peeks every and stops at the first . Under , roughly what is the Type I error rate, assuming the -values at different stopping points are independent?
Problem
Build a test martingale for against an unspecified alternative on an i.i.d.\ Bernoulli stream, using a Beta mixture prior on the alternative parameter. Explain why the mixture construction gives a universally valid e-process.
Problem
State the safe-testing characterization of Grünwald, de Heide, Koolen (2024): under what conditions does the reverse information projection give a GROW-optimal e-value, and what goes wrong for composite nulls where RIPr does not exist in closed form?
Open Problems and Frontier
Tight e-values for composite nulls without closed-form RIPr is a central open line, with partial progress via numerical RIPr and GROW approximations.
Time-uniform CLT (Waudby-Smith 2024-26) extends e-values into settings that previously required fixed- asymptotics, including conditional independence testing without Model-X, observational causal inference, and semiparametric inference more broadly.
Anytime-valid conformal prediction via e-value-based prediction sets is an active frontier; the split-conformal guarantees extend to sequential prediction when the calibration scores generate a test martingale.
Regulatory adoption. The FDA January 2026 draft Bayesian guidance is the policy opening. e-value methods map cleanly onto the guidance's flexibility for non-traditional success criteria. Practical adoption in platform trials, futility monitoring, and adaptive design is the next 2-3 years.
Hypothesis testing with e-values (Ramdas, Wang 2025, Waterloo book draft) is the first textbook treatment; field-wide adoption will track the book's completion.
References
Foundational:
- Ville, Étude Critique de la Notion de Collectif (Gauthier-Villars, 1939). The original maximal inequality.
- Shafer, Vovk, Probability and Finance: It's Only a Game! (Wiley, 2001). Chapters 3-5.
Modern e-value theory:
- Ramdas, Grünwald, Vovk, Shafer, "Game-Theoretic Statistics and Safe Anytime-Valid Inference." Statistical Science 38(4) (2023), 576-601. The canonical survey.
- Vovk, Wang, "E-Values: Calibration, Combination and Applications." Annals of Statistics 49(3) (2021), 1736-1754.
- Grünwald, de Heide, Koolen, "Safe Testing." Journal of the Royal Statistical Society B 86(4) (2024), 1091-1128.
- Shafer, "Testing by Betting: A Strategy for Statistical and Scientific Communication." Journal of the Royal Statistical Society A 184(2) (2021), 407-431.
Confidence sequences and time-uniform bounds:
- Howard, Ramdas, McAuliffe, Sekhon, "Time-Uniform, Nonparametric, Nonasymptotic Confidence Sequences." Annals of Statistics 49(2) (2021), 1055-1080.
- Waudby-Smith, Ramdas, "Estimating Means of Bounded Random Variables by Betting." Journal of the Royal Statistical Society B 86(1) (2024), 1-27.
Multiple testing:
- Wang, Ramdas, "False Discovery Rate Control with E-Values." Journal of the Royal Statistical Society B 84(3) (2022), 822-852.
- Wasserman, Ramdas, Balakrishnan, "Universal Inference." Proceedings of the National Academy of Sciences 117(29) (2020), 16880-16890.
Textbook (in preparation):
- Ramdas, Wang, Hypothesis Testing with E-Values. Book draft, University of Waterloo, 2025.
Next Topics
- Split conformal prediction: the companion uncertainty-quantification framework, with anytime-valid extensions an open line.
- Hypothesis testing for ML: the fixed-sample baseline e-values replace.
- Martingale theory: the mathematical substrate; Ville's inequality is a maximal inequality for nonnegative supermartingales.
Last reviewed: April 26, 2026
Prerequisites
Foundations this topic depends on.
- Measure-Theoretic ProbabilityLayer 0B
- Martingale TheoryLayer 0B
- Hypothesis Testing for MLLayer 2
- Neyman-Pearson and Hypothesis Testing TheoryLayer 2
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyLayer 0B
- Differentiation in RnLayer 0A
- Vectors, Matrices, and Linear MapsLayer 0A
- Continuity in RⁿLayer 0A
- Metric Spaces, Convergence, and CompletenessLayer 0A
- Central Limit TheoremLayer 0B
- Law of Large NumbersLayer 0B
- Random VariablesLayer 0A
- Kolmogorov Probability AxiomsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- KL DivergenceLayer 1
- Information Theory FoundationsLayer 0B