Skip to main content

Sequential Inference

e-values

A nonnegative random variable whose expectation under the null is at most one. Reciprocals of e-values behave like p-values via Markov's inequality, with the structural advantage that products of conditional e-values remain valid evidence under filtration. E-values were developed to replace p-values where optional stopping or selective combination is unavoidable.

ImportantAdvancedTier 1CurrentCore spine~50 min
For:MLStatsResearch

Why This Matters

The pp-value is the dominant evidence statistic for fixed-sample hypothesis tests. It breaks under optional stopping and combination across studies. An e-value is the natural replacement: a nonnegative test statistic EE with EH0[E]1\mathbb{E}_{H_0}[E] \leq 1. The constraint is much weaker than uniform distribution but stronger in two operational ways. First, Markov's inequality immediately gives PrH0(E1/α)α\Pr_{H_0}(E \geq 1/\alpha) \leq \alpha, so reciprocals of e-values behave like (potentially conservative) pp-values at level α\alpha. Second, conditional e-values multiply: if EtE_t is an e-value with respect to a filtration (Ft)(\mathcal{F}_t), then the running product is a nonnegative supermartingale under the null, and Ville's inequality gives a time-uniform Type I error guarantee. Optional stopping, peeking, and accumulating evidence across rounds become safe operations.

The framing has both technical and historical roots. Vovk and Wang (2021, Annals of Statistics) formalized e-values as the dual object to pp-values for the optional-continuation problem. Shafer (2021) recast e-values as the payoff of a betting strategy against the null, recovering Wald's SPRT (1945) as a special case. Ramdas, Grünwald, Vovk, and Shafer (2023) named the resulting framework "game-theoretic statistics" and showed it generalizes the sequential analysis literature dating back to the 1940s.

The practical payoff for ML evaluation is direct. LLM benchmarking, RLHF reward shaping, and online A/B experiments all involve sequential evidence streams where the analyst peeks at results, decides to stop or continue, and reports a number. E-values give a frequentist guarantee that survives all three of these behaviors when applied correctly.

Formal Setup

Definition

e-value

Given a null hypothesis H0H_0 specifying a (possibly composite) family of distributions for the observed data, an e-value for H0H_0 is a nonnegative measurable function EE of the data such that supPH0EP[E]1.\sup_{P \in H_0} \mathbb{E}_P[E] \leq 1. A test that rejects H0H_0 when E1/αE \geq 1/\alpha has Type I error at most α\alpha, by Markov's inequality.

Definition

Betting interpretation

Treat H0H_0 as a casino offering fair bets under the null. The analyst starts with one unit of capital and chooses, before each observation, a non-negative payoff function with unit expectation under H0H_0. The capital after nn observations is an e-value: large capital is evidence against the null because H0H_0 predicted unit expected capital. The reciprocal 1/E1/E plays the role of a pp-value with the calibration "p =1/E= 1/E implies an α\alpha-level test."

Definition

Likelihood-ratio e-value

For a simple null P0P_0 and any alternative density P1P_1, the likelihood ratio E(X)=p1(X)/p0(X)E(X) = p_1(X)/p_0(X) is an e-value, since EP0[p1(X)p0(X)]=p1(x)dx=1.\mathbb{E}_{P_0}\left[\tfrac{p_1(X)}{p_0(X)}\right] = \int p_1(x)\, dx = 1. Wald's sequential probability ratio test (SPRT) is exactly this construction applied sequentially.

The Markov Bound: e-to-p Calibration

The basic guarantee is one line of probability.

Theorem

Markov Bound for e-values

Statement

Let EE be an e-value for the null H0H_0. Then for every α(0,1)\alpha \in (0, 1), PrH0 ⁣[E1α]α.\Pr_{H_0}\!\left[E \geq \tfrac{1}{\alpha}\right] \leq \alpha. Equivalently, the test that rejects H0H_0 when E1/αE \geq 1/\alpha has Type I error at most α\alpha.

Intuition

The expected payoff of EE under the null is at most 11, so the chance of seeing a payoff much larger than 11 is small. Markov's inequality makes this quantitative: probabilities of large deviations are bounded by the mean.

Proof Sketch

By Markov, PrH0(Ec)EH0[E]/c1/c\Pr_{H_0}(E \geq c) \leq \mathbb{E}_{H_0}[E] / c \leq 1/c for any c>0c > 0. Set c=1/αc = 1/\alpha.

Why It Matters

The reciprocal 1/E1/E is a (super-uniform) pp-value: PrH0(1/Eα)=PrH0(E1/α)α\Pr_{H_0}(1/E \leq \alpha) = \Pr_{H_0}(E \geq 1/\alpha) \leq \alpha. The conservativeness factor is the slack in Markov; for the likelihood-ratio e-value under the alternative, the conservativeness vanishes asymptotically by the Neyman-Pearson lemma.

Failure Mode

The bound is loose by definition; it uses only the first moment. For Gaussian shifts, the pp-value calibrated from the likelihood-ratio e-value is roughly the square of the optimal zz-test pp-value at the same significance level. This conservativeness is the price for distribution-free optional-stopping validity. When optional stopping is not a concern, the zz-test is more powerful.

Comparison with p-values

The structural difference between p-values and e-values is what each guarantees.

Propertypp-valuee-value
DefinitionPrH0(pu)u\Pr_{H_0}(p \leq u) \leq uEH0[E]1\mathbb{E}_{H_0}[E] \leq 1
DirectionSmall is strong evidenceLarge is strong evidence
Type I error testreject if pαp \leq \alphareject if E1/αE \geq 1/\alpha
Combinationrequires independence + meta-analysis toolsconditional products are valid (e-processes)
Optional stoppingbreaks (uniformity lost)preserved (supermartingale)
Sharpest test under alternativeoptimal under Neyman-Pearsonconservative by factor 2-3 in typical Gaussian shifts
CalibrationEpE \to p: p=1/Ep = 1/EpEp \to E: E=log(p)/1E = -\log(p) / 1 is not an e-value; calibrators are nontrivial

The headline trade-off: e-values give up power at any fixed nn to gain validity at every stopping time simultaneously. The exchange is favorable whenever the cost of inflated Type I error under sequential peeking exceeds the power loss.

Canonical Example: Coin-Bias Evidence

Suppose X1,X2,X_1, X_2, \ldots are iid Bernoulli(θ)(\theta) and we test H0:θ=1/2H_0: \theta = 1/2 against H1:θ=qH_1: \theta = q for some fixed q>1/2q > 1/2.

The likelihood-ratio e-value after nn flips is En=i=1npq(Xi)p1/2(Xi)=(2q)Sn(2(1q))nSn,E_n = \prod_{i=1}^n \frac{p_q(X_i)}{p_{1/2}(X_i)} = (2q)^{S_n}\,(2(1-q))^{n - S_n}, where Sn=XiS_n = \sum X_i is the number of heads. Under H0H_0, E[En]=1\mathbb{E}[E_n] = 1 for every nn; under H1H_1 the e-value grows exponentially in nn at rate equal to the KL divergence KL(q1/2)\mathrm{KL}(q \| 1/2).

Numerical case: n=100n = 100 flips, q=0.6q = 0.6, observed Sn=60S_n = 60. Then E100=(1.2)60(0.8)4030.7E_{100} = (1.2)^{60}\,(0.8)^{40} \approx 30.7. The corresponding pp-value calibration is 1/E1000.0331/E_{100} \approx 0.033, which rejects at α=0.05\alpha = 0.05. Under H0H_0 this rejection happens with probability at most 5%5\%, regardless of whether the experiment stopped at n=100n = 100 or at any other time.

Beyond Likelihood Ratios: Other Construction Methods

The likelihood-ratio e-value is canonical but specialized. Several general constructions cover composite nulls:

  • Universal inference (Wasserman, Ramdas, Balakrishnan 2020, Annals of Statistics): split the sample, fit the alternative on one half, evaluate the likelihood ratio on the other. The resulting statistic is an e-value for arbitrary composite nulls. Always applicable; loses a factor of 2 in power from the split.
  • Betting strategies (Shafer 2021): construct a sequence of payoff functions adapted to the filtration. The capital path is an e-process; its value at any stopping time is an e-value.
  • Reverse information projection (Grünwald, de Heide, Koolen 2024): for composite nulls, project the alternative onto the null in KL divergence. The resulting "RIPr" e-value is admissible in a precise decision-theoretic sense.
  • GROW e-values (Grünwald-Heide-Koolen "growth-rate-optimal worst case"): maximize the expected log-payoff against the worst-case alternative; produces e-values that grow fastest under composite alternatives.

For ML evaluation specifically, simple sample-mean-based e-values for bounded outcomes (clicks, win-rates) are constructed from Hoeffding-style bets and have closed-form intervals.

Common Misconceptions

Watch Out

E is not Pr(alternative is true)

An e-value of 2020 is not "twenty times more likely that H1H_1 is true." It is the payoff of a betting strategy that started with one unit and would have unit expected return under H0H_0. The relation to posterior probability requires a prior; under a flat prior the e-value coincides with the Bayes factor, but generally they differ.

Watch Out

Product of marginal e-values is not an e-value

If E1E_1 and E2E_2 are e-values for two independent experiments, E1E2E_1 \cdot E_2 is generally not an e-value because E[E1E2]=E[E1]E[E2]1\mathbb{E}[E_1 E_2] = \mathbb{E}[E_1] \mathbb{E}[E_2] \leq 1 requires independence and unit expectation for both. The right combination of marginal e-values is averaging (still an e-value by linearity) or weighting; multiplicative combination is reserved for conditional e-values under filtration, which is the e-process construction.

Watch Out

E < 1 is not evidence for H_0

An e-value below 11 means the betting strategy lost money against H0H_0. It is uninformative about H0H_0 being true; it only fails to refute H0H_0. Evidence for the null requires a different statistic (a posterior, a Bayes factor, or a one-sided e-value for the alternative being false).

Watch Out

p to e calibration is asymmetric

E=1/pE = 1/p is not in general an e-value: under the null, 1/p1/p has infinite expectation when pp is exactly uniform. The valid p-to-e calibrators (Vovk-Wang 2021) are of the form E=(1p+plogp)/(p(logp)2)E = (1 - p + p \log p)/(p(\log p)^2) for p(0,1/e)p \in (0, 1/e) and similar formulas, with growth controlled to keep EH0[E]1\mathbb{E}_{H_0}[E] \leq 1. e-to-p calibration via p=1/Ep = 1/E is loss-free in the right direction.

Worked Exercise

ExerciseAdvanced

Problem

A coin-flipping experiment tests H0:θ=1/2H_0: \theta = 1/2 against H1:θ=0.55H_1: \theta = 0.55. You run the experiment for n=500n = 500 flips and observe Sn=280S_n = 280 heads. Compute the likelihood-ratio e-value, the calibrated pp-value 1/En1/E_n, and the classical pp-value from the one-sided zz-test. Comment on the gap.

Implementation Note

For a likelihood-ratio e-value, work in log space to avoid floating-point overflow on long sequences:

log_e = 0.0
for x in data_stream:
    log_e += np.log(p1_density(x)) - np.log(p0_density(x))
# e = np.exp(log_e); reject when log_e >= np.log(1 / alpha)

Equivalent in practice and stable for nn up to 10610^6 or more.

For composite nulls or distributions without a clean likelihood, the betting-strategy construction is more flexible. The confseq Python package (companion to Howard et al. 2021) provides empirical-Bernstein-style e-processes for sample-mean estimation on bounded outcomes. Wraps the construction in a streaming API that returns the e-value, the calibrated p-value, and the current confidence sequence after each new observation.

For multiple testing with arbitrary dependence, the e-BH procedure (Wang-Ramdas 2022, Annals of Statistics) replaces Benjamini-Hochberg and controls FDR without independence assumptions; the input is a vector of e-values, the output is the rejection set.

Practical Example: LLM Evaluation

Standard LLM evaluation runs nn benchmark items and reports an accuracy number with a Wald confidence interval. Under continuous evaluation (new items arrive, the analyst peeks, decides to keep evaluating or stop), the Wald interval is anti-conservative. An e-value-based approach:

  1. Define H0:model accuracy0.7H_0: \text{model accuracy} \leq 0.7 (the threshold for "passes").
  2. Construct an e-value EnE_n via the betting-strategy construction over the running mean.
  3. Continue evaluation until En1/α=20E_n \geq 1/\alpha = 20 (for α=0.05\alpha = 0.05) or the budget is exhausted.

The 5%5\% Type I error guarantee holds across the entire stopping rule. If the model genuinely scores at 0.70.7, the procedure stops with E20E \geq 20 at most 5%5\% of the time; if the model scores higher, the e-value grows exponentially and the procedure stops early. The OpenAI Evals codebase and the open-source evalsync tooling have begun adopting this framing for adaptive benchmark sizing.

References

Canonical:

  • Vovk, V. and Wang, R. (2021). "E-values: Calibration, combination and applications." Annals of Statistics 49(3), pp. 1736-1754. Foundational paper defining e-values, calibrators, and the e-to-p and p-to-e maps.
  • Shafer, G. (2021). "Testing by betting: A strategy for statistical and scientific communication." Journal of the Royal Statistical Society, Series A 184(2), pp. 407-431. The betting interpretation, with discussion contributions from Vovk, Grünwald, and Wasserman.
  • Ramdas, A., Grünwald, P., Vovk, V., and Shafer, G. (2023). "Game-theoretic statistics and safe anytime-valid inference." Statistical Science 38(4), pp. 576-601. Survey of the framework as of late 2022, mapping every classical sequential test to the e-value/e-process language.

Current:

  • Wasserman, L., Ramdas, A., and Balakrishnan, S. (2020). "Universal inference." Proceedings of the National Academy of Sciences 117(29). Split-sample construction of e-values for composite nulls.
  • Grünwald, P., de Heide, R., and Koolen, W. (2024). "Safe testing." Journal of the Royal Statistical Society, Series B. Construction of admissible e-values via reverse information projection (RIPr).
  • Wang, R. and Ramdas, A. (2022). "False discovery rate control with e-values." Journal of the Royal Statistical Society, Series B 84(3). The e-BH procedure for multiple testing under arbitrary dependence.

Historical:

  • Wald, A. (1945). "Sequential tests of statistical hypotheses." Annals of Mathematical Statistics 16(2). The SPRT, which is exactly the likelihood-ratio e-process with two stopping boundaries.

Next Topics

  • e-processes: the sequential version, where a running product of conditional e-values is a nonnegative supermartingale.
  • Confidence sequences: the interval estimates derived from e-processes, valid at every sample size.
  • Anytime-valid inference: the framing of inference under continuous monitoring.
  • Safe testing: the Grünwald-de Heide-Koolen formal framework built on e-values.
  • E-values and anytime-valid inference: the umbrella reference page with proofs, applications, and the e-BH multiple-testing procedure.

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

6

Derived topics

4