Sequential Inference
e-values
A nonnegative random variable whose expectation under the null is at most one. Reciprocals of e-values behave like p-values via Markov's inequality, with the structural advantage that products of conditional e-values remain valid evidence under filtration. E-values were developed to replace p-values where optional stopping or selective combination is unavoidable.
Why This Matters
The -value is the dominant evidence statistic for fixed-sample hypothesis tests. It breaks under optional stopping and combination across studies. An e-value is the natural replacement: a nonnegative test statistic with . The constraint is much weaker than uniform distribution but stronger in two operational ways. First, Markov's inequality immediately gives , so reciprocals of e-values behave like (potentially conservative) -values at level . Second, conditional e-values multiply: if is an e-value with respect to a filtration , then the running product is a nonnegative supermartingale under the null, and Ville's inequality gives a time-uniform Type I error guarantee. Optional stopping, peeking, and accumulating evidence across rounds become safe operations.
The framing has both technical and historical roots. Vovk and Wang (2021, Annals of Statistics) formalized e-values as the dual object to -values for the optional-continuation problem. Shafer (2021) recast e-values as the payoff of a betting strategy against the null, recovering Wald's SPRT (1945) as a special case. Ramdas, Grünwald, Vovk, and Shafer (2023) named the resulting framework "game-theoretic statistics" and showed it generalizes the sequential analysis literature dating back to the 1940s.
The practical payoff for ML evaluation is direct. LLM benchmarking, RLHF reward shaping, and online A/B experiments all involve sequential evidence streams where the analyst peeks at results, decides to stop or continue, and reports a number. E-values give a frequentist guarantee that survives all three of these behaviors when applied correctly.
Formal Setup
e-value
Given a null hypothesis specifying a (possibly composite) family of distributions for the observed data, an e-value for is a nonnegative measurable function of the data such that A test that rejects when has Type I error at most , by Markov's inequality.
Betting interpretation
Treat as a casino offering fair bets under the null. The analyst starts with one unit of capital and chooses, before each observation, a non-negative payoff function with unit expectation under . The capital after observations is an e-value: large capital is evidence against the null because predicted unit expected capital. The reciprocal plays the role of a -value with the calibration "p implies an -level test."
Likelihood-ratio e-value
For a simple null and any alternative density , the likelihood ratio is an e-value, since Wald's sequential probability ratio test (SPRT) is exactly this construction applied sequentially.
The Markov Bound: e-to-p Calibration
The basic guarantee is one line of probability.
Markov Bound for e-values
Statement
Let be an e-value for the null . Then for every , Equivalently, the test that rejects when has Type I error at most .
Intuition
The expected payoff of under the null is at most , so the chance of seeing a payoff much larger than is small. Markov's inequality makes this quantitative: probabilities of large deviations are bounded by the mean.
Proof Sketch
By Markov, for any . Set .
Why It Matters
The reciprocal is a (super-uniform) -value: . The conservativeness factor is the slack in Markov; for the likelihood-ratio e-value under the alternative, the conservativeness vanishes asymptotically by the Neyman-Pearson lemma.
Failure Mode
The bound is loose by definition; it uses only the first moment. For Gaussian shifts, the -value calibrated from the likelihood-ratio e-value is roughly the square of the optimal -test -value at the same significance level. This conservativeness is the price for distribution-free optional-stopping validity. When optional stopping is not a concern, the -test is more powerful.
Comparison with p-values
The structural difference between p-values and e-values is what each guarantees.
| Property | -value | e-value |
|---|---|---|
| Definition | ||
| Direction | Small is strong evidence | Large is strong evidence |
| Type I error test | reject if | reject if |
| Combination | requires independence + meta-analysis tools | conditional products are valid (e-processes) |
| Optional stopping | breaks (uniformity lost) | preserved (supermartingale) |
| Sharpest test under alternative | optimal under Neyman-Pearson | conservative by factor 2-3 in typical Gaussian shifts |
| Calibration | : | : is not an e-value; calibrators are nontrivial |
The headline trade-off: e-values give up power at any fixed to gain validity at every stopping time simultaneously. The exchange is favorable whenever the cost of inflated Type I error under sequential peeking exceeds the power loss.
Canonical Example: Coin-Bias Evidence
Suppose are iid Bernoulli and we test against for some fixed .
The likelihood-ratio e-value after flips is where is the number of heads. Under , for every ; under the e-value grows exponentially in at rate equal to the KL divergence .
Numerical case: flips, , observed . Then . The corresponding -value calibration is , which rejects at . Under this rejection happens with probability at most , regardless of whether the experiment stopped at or at any other time.
Beyond Likelihood Ratios: Other Construction Methods
The likelihood-ratio e-value is canonical but specialized. Several general constructions cover composite nulls:
- Universal inference (Wasserman, Ramdas, Balakrishnan 2020, Annals of Statistics): split the sample, fit the alternative on one half, evaluate the likelihood ratio on the other. The resulting statistic is an e-value for arbitrary composite nulls. Always applicable; loses a factor of 2 in power from the split.
- Betting strategies (Shafer 2021): construct a sequence of payoff functions adapted to the filtration. The capital path is an e-process; its value at any stopping time is an e-value.
- Reverse information projection (Grünwald, de Heide, Koolen 2024): for composite nulls, project the alternative onto the null in KL divergence. The resulting "RIPr" e-value is admissible in a precise decision-theoretic sense.
- GROW e-values (Grünwald-Heide-Koolen "growth-rate-optimal worst case"): maximize the expected log-payoff against the worst-case alternative; produces e-values that grow fastest under composite alternatives.
For ML evaluation specifically, simple sample-mean-based e-values for bounded outcomes (clicks, win-rates) are constructed from Hoeffding-style bets and have closed-form intervals.
Common Misconceptions
E is not Pr(alternative is true)
An e-value of is not "twenty times more likely that is true." It is the payoff of a betting strategy that started with one unit and would have unit expected return under . The relation to posterior probability requires a prior; under a flat prior the e-value coincides with the Bayes factor, but generally they differ.
Product of marginal e-values is not an e-value
If and are e-values for two independent experiments, is generally not an e-value because requires independence and unit expectation for both. The right combination of marginal e-values is averaging (still an e-value by linearity) or weighting; multiplicative combination is reserved for conditional e-values under filtration, which is the e-process construction.
E < 1 is not evidence for H_0
An e-value below means the betting strategy lost money against . It is uninformative about being true; it only fails to refute . Evidence for the null requires a different statistic (a posterior, a Bayes factor, or a one-sided e-value for the alternative being false).
p to e calibration is asymmetric
is not in general an e-value: under the null, has infinite expectation when is exactly uniform. The valid p-to-e calibrators (Vovk-Wang 2021) are of the form for and similar formulas, with growth controlled to keep . e-to-p calibration via is loss-free in the right direction.
Worked Exercise
Problem
A coin-flipping experiment tests against . You run the experiment for flips and observe heads. Compute the likelihood-ratio e-value, the calibrated -value , and the classical -value from the one-sided -test. Comment on the gap.
Implementation Note
For a likelihood-ratio e-value, work in log space to avoid floating-point overflow on long sequences:
log_e = 0.0
for x in data_stream:
log_e += np.log(p1_density(x)) - np.log(p0_density(x))
# e = np.exp(log_e); reject when log_e >= np.log(1 / alpha)
Equivalent in practice and stable for up to or more.
For composite nulls or distributions without a clean likelihood, the betting-strategy construction is more flexible. The confseq Python package (companion to Howard et al. 2021) provides empirical-Bernstein-style e-processes for sample-mean estimation on bounded outcomes. Wraps the construction in a streaming API that returns the e-value, the calibrated p-value, and the current confidence sequence after each new observation.
For multiple testing with arbitrary dependence, the e-BH procedure (Wang-Ramdas 2022, Annals of Statistics) replaces Benjamini-Hochberg and controls FDR without independence assumptions; the input is a vector of e-values, the output is the rejection set.
Practical Example: LLM Evaluation
Standard LLM evaluation runs benchmark items and reports an accuracy number with a Wald confidence interval. Under continuous evaluation (new items arrive, the analyst peeks, decides to keep evaluating or stop), the Wald interval is anti-conservative. An e-value-based approach:
- Define (the threshold for "passes").
- Construct an e-value via the betting-strategy construction over the running mean.
- Continue evaluation until (for ) or the budget is exhausted.
The Type I error guarantee holds across the entire stopping rule. If the model genuinely scores at , the procedure stops with at most of the time; if the model scores higher, the e-value grows exponentially and the procedure stops early. The OpenAI Evals codebase and the open-source evalsync tooling have begun adopting this framing for adaptive benchmark sizing.
References
Canonical:
- Vovk, V. and Wang, R. (2021). "E-values: Calibration, combination and applications." Annals of Statistics 49(3), pp. 1736-1754. Foundational paper defining e-values, calibrators, and the e-to-p and p-to-e maps.
- Shafer, G. (2021). "Testing by betting: A strategy for statistical and scientific communication." Journal of the Royal Statistical Society, Series A 184(2), pp. 407-431. The betting interpretation, with discussion contributions from Vovk, Grünwald, and Wasserman.
- Ramdas, A., Grünwald, P., Vovk, V., and Shafer, G. (2023). "Game-theoretic statistics and safe anytime-valid inference." Statistical Science 38(4), pp. 576-601. Survey of the framework as of late 2022, mapping every classical sequential test to the e-value/e-process language.
Current:
- Wasserman, L., Ramdas, A., and Balakrishnan, S. (2020). "Universal inference." Proceedings of the National Academy of Sciences 117(29). Split-sample construction of e-values for composite nulls.
- Grünwald, P., de Heide, R., and Koolen, W. (2024). "Safe testing." Journal of the Royal Statistical Society, Series B. Construction of admissible e-values via reverse information projection (RIPr).
- Wang, R. and Ramdas, A. (2022). "False discovery rate control with e-values." Journal of the Royal Statistical Society, Series B 84(3). The e-BH procedure for multiple testing under arbitrary dependence.
Historical:
- Wald, A. (1945). "Sequential tests of statistical hypotheses." Annals of Mathematical Statistics 16(2). The SPRT, which is exactly the likelihood-ratio e-process with two stopping boundaries.
Next Topics
- e-processes: the sequential version, where a running product of conditional e-values is a nonnegative supermartingale.
- Confidence sequences: the interval estimates derived from e-processes, valid at every sample size.
- Anytime-valid inference: the framing of inference under continuous monitoring.
- Safe testing: the Grünwald-de Heide-Koolen formal framework built on e-values.
- E-values and anytime-valid inference: the umbrella reference page with proofs, applications, and the e-BH multiple-testing procedure.
Last reviewed: May 13, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
6- Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
- Random Variableslayer 0A · tier 1
- Concentration Inequalitieslayer 1 · tier 1
- Likelihood-Ratio, Wald, and Score Testslayer 2 · tier 1
- p-valueslayer 2 · tier 1
Derived topics
4- Confidence Sequenceslayer 2 · tier 1
- Anytime-Valid Inferencelayer 3 · tier 1
- e-processeslayer 3 · tier 1
- Safe Testinglayer 3 · tier 1
Graph-backed continuations