e-values

Sneiderman, Robby

Sequential Inference

e-values

A nonnegative random variable whose expectation under the null is at most one. Reciprocals of e-values behave like p-values via Markov's inequality, with the structural advantage that products of conditional e-values remain valid evidence under filtration. E-values were developed to replace p-values where optional stopping or selective combination is unavoidable.

ImportantAdvancedTier 1CurrentCore spine~50 min

For:MLStatsResearch

Prerequisites

P Values Hypothesis Testing for ML Likelihood Ratio Wald Score Tests Random Variables

Prereq Map

Why This Matters

The $p$ -value is the dominant evidence statistic for fixed-sample hypothesis tests. It breaks under optional stopping and combination across studies. An e-value is the natural replacement: a nonnegative test statistic $E$ with $\mathbb{E}_{H_0}[E] \leq 1$ . The constraint is much weaker than uniform distribution but stronger in two operational ways. First, Markov's inequality immediately gives $\Pr_{H_0}(E \geq 1/\alpha) \leq \alpha$ , so reciprocals of e-values behave like (potentially conservative) $p$ -values at level $\alpha$ . Second, conditional e-values multiply: if $E_t$ is an e-value with respect to a filtration $(\mathcal{F}_t)$ , then the running product is a nonnegative supermartingale under the null, and Ville's inequality gives a time-uniform Type I error guarantee. Optional stopping, peeking, and accumulating evidence across rounds become safe operations.

The framing has both technical and historical roots. Vovk and Wang (2021, Annals of Statistics) formalized e-values as the dual object to $p$ -values for the optional-continuation problem. Shafer (2021) recast e-values as the payoff of a betting strategy against the null, recovering Wald's SPRT (1945) as a special case. Ramdas, Grünwald, Vovk, and Shafer (2023) named the resulting framework "game-theoretic statistics" and showed it generalizes the sequential analysis literature dating back to the 1940s.

The practical payoff for ML evaluation is direct. LLM benchmarking, RLHF reward shaping, and online A/B experiments all involve sequential evidence streams where the analyst peeks at results, decides to stop or continue, and reports a number. E-values give a frequentist guarantee that survives all three of these behaviors when applied correctly.

Formal Setup

Definition

e-value $E$

Given a null hypothesis $H_0$ specifying a (possibly composite) family of distributions for the observed data, an e-value for $H_0$ is a nonnegative measurable function $E$ of the data such that $\sup_{P \in H_0} \mathbb{E}_P[E] \leq 1.$ A test that rejects $H_0$ when $E \geq 1/\alpha$ has Type I error at most $\alpha$ , by Markov's inequality.

Definition

Betting interpretation

Treat $H_0$ as a casino offering fair bets under the null. The analyst starts with one unit of capital and chooses, before each observation, a non-negative payoff function with unit expectation under $H_0$ . The capital after $n$ observations is an e-value: large capital is evidence against the null because $H_0$ predicted unit expected capital. The reciprocal $1/E$ plays the role of a $p$ -value with the calibration "p $= 1/E$ implies an $\alpha$ -level test."

Definition

Likelihood-ratio e-value

For a simple null $P_0$ and any alternative density $P_1$ , the likelihood ratio $E(X) = p_1(X)/p_0(X)$ is an e-value, since $\mathbb{E}_{P_0}\left[\tfrac{p_1(X)}{p_0(X)}\right] = \int p_1(x)\, dx = 1.$ Wald's sequential probability ratio test (SPRT) is exactly this construction applied sequentially.

The Markov Bound: e-to-p Calibration

The basic guarantee is one line of probability.

Theorem

Markov Bound for e-values

Statement

Let $E$ be an e-value for the null $H_0$ . Then for every $\alpha \in (0, 1)$ , $\Pr_{H_0}\!\left[E \geq \tfrac{1}{\alpha}\right] \leq \alpha.$ Equivalently, the test that rejects $H_0$ when $E \geq 1/\alpha$ has Type I error at most $\alpha$ .

Intuition

The expected payoff of $E$ under the null is at most $1$ , so the chance of seeing a payoff much larger than $1$ is small. Markov's inequality makes this quantitative: probabilities of large deviations are bounded by the mean.

Proof Sketch

By Markov, $\Pr_{H_0}(E \geq c) \leq \mathbb{E}_{H_0}[E] / c \leq 1/c$ for any $c > 0$ . Set $c = 1/\alpha$ .

Why It Matters

The reciprocal $1/E$ is a (super-uniform) $p$ -value: $\Pr_{H_0}(1/E \leq \alpha) = \Pr_{H_0}(E \geq 1/\alpha) \leq \alpha$ . The conservativeness factor is the slack in Markov; for the likelihood-ratio e-value under the alternative, the conservativeness vanishes asymptotically by the Neyman-Pearson lemma.

Failure Mode

The bound is loose by definition; it uses only the first moment. For Gaussian shifts, the $p$ -value calibrated from the likelihood-ratio e-value is roughly the square of the optimal $z$ -test $p$ -value at the same significance level. This conservativeness is the price for distribution-free optional-stopping validity. When optional stopping is not a concern, the $z$ -test is more powerful.

report a correction →

Comparison with p-values

The structural difference between p-values and e-values is what each guarantees.

Property	$p$ -value	e-value
Definition	$\Pr_{H_0}(p \leq u) \leq u$	$\mathbb{E}_{H_0}[E] \leq 1$
Direction	Small is strong evidence	Large is strong evidence
Type I error test	reject if $p \leq \alpha$	reject if $E \geq 1/\alpha$
Combination	requires independence + meta-analysis tools	conditional products are valid (e-processes)
Optional stopping	breaks (uniformity lost)	preserved (supermartingale)
Sharpest test under alternative	optimal under Neyman-Pearson	conservative by factor 2-3 in typical Gaussian shifts
Calibration	$E \to p$ : $p = 1/E$	$p \to E$ : $E = -\log(p) / 1$ is not an e-value; calibrators are nontrivial

The headline trade-off: e-values give up power at any fixed $n$ to gain validity at every stopping time simultaneously. The exchange is favorable whenever the cost of inflated Type I error under sequential peeking exceeds the power loss.

Canonical Example: Coin-Bias Evidence

Suppose $X_1, X_2, \ldots$ are iid Bernoulli $(\theta)$ and we test $H_0: \theta = 1/2$ against $H_1: \theta = q$ for some fixed $q > 1/2$ .

The likelihood-ratio e-value after $n$ flips is $E_n = \prod_{i=1}^n \frac{p_q(X_i)}{p_{1/2}(X_i)} = (2q)^{S_n}\,(2(1-q))^{n - S_n},$ where $S_n = \sum X_i$ is the number of heads. Under $H_0$ , $\mathbb{E}[E_n] = 1$ for every $n$ ; under $H_1$ the e-value grows exponentially in $n$ at rate equal to the KL divergence $\mathrm{KL}(q \| 1/2)$ .

Numerical case: $n = 100$ flips, $q = 0.6$ , observed $S_n = 60$ . Then $E_{100} = (1.2)^{60}\,(0.8)^{40} \approx 30.7$ . The corresponding $p$ -value calibration is $1/E_{100} \approx 0.033$ , which rejects at $\alpha = 0.05$ . Under $H_0$ this rejection happens with probability at most $5\%$ , regardless of whether the experiment stopped at $n = 100$ or at any other time.

Beyond Likelihood Ratios: Other Construction Methods

The likelihood-ratio e-value is canonical but specialized. Several general constructions cover composite nulls:

Universal inference (Wasserman, Ramdas, Balakrishnan 2020, Annals of Statistics): split the sample, fit the alternative on one half, evaluate the likelihood ratio on the other. The resulting statistic is an e-value for arbitrary composite nulls. Always applicable; loses a factor of 2 in power from the split.
Betting strategies (Shafer 2021): construct a sequence of payoff functions adapted to the filtration. The capital path is an e-process; its value at any stopping time is an e-value.
Reverse information projection (Grünwald, de Heide, Koolen 2024): for composite nulls, project the alternative onto the null in KL divergence. The resulting "RIPr" e-value is admissible in a precise decision-theoretic sense.
GROW e-values (Grünwald-Heide-Koolen "growth-rate-optimal worst case"): maximize the expected log-payoff against the worst-case alternative; produces e-values that grow fastest under composite alternatives.

For ML evaluation specifically, simple sample-mean-based e-values for bounded outcomes (clicks, win-rates) are constructed from Hoeffding-style bets and have closed-form intervals.

Common Misconceptions

Watch Out

E is not Pr(alternative is true)

An e-value of $20$ is not "twenty times more likely that $H_1$ is true." It is the payoff of a betting strategy that started with one unit and would have unit expected return under $H_0$ . The relation to posterior probability requires a prior; under a flat prior the e-value coincides with the Bayes factor, but generally they differ.

Watch Out

Product of marginal e-values is not an e-value

If $E_1$ and $E_2$ are e-values for two independent experiments, $E_1 \cdot E_2$ is generally not an e-value because $\mathbb{E}[E_1 E_2] = \mathbb{E}[E_1] \mathbb{E}[E_2] \leq 1$ requires independence and unit expectation for both. The right combination of marginal e-values is averaging (still an e-value by linearity) or weighting; multiplicative combination is reserved for conditional e-values under filtration, which is the e-process construction.

Watch Out

E < 1 is not evidence for H_0

An e-value below $1$ means the betting strategy lost money against $H_0$ . It is uninformative about $H_0$ being true; it only fails to refute $H_0$ . Evidence for the null requires a different statistic (a posterior, a Bayes factor, or a one-sided e-value for the alternative being false).

Watch Out

p to e calibration is asymmetric

$E = 1/p$ is not in general an e-value: under the null, $1/p$ has infinite expectation when $p$ is exactly uniform. The valid p-to-e calibrators (Vovk-Wang 2021) are of the form $E = (1 - p + p \log p)/(p(\log p)^2)$ for $p \in (0, 1/e)$ and similar formulas, with growth controlled to keep $\mathbb{E}_{H_0}[E] \leq 1$ . e-to-p calibration via $p = 1/E$ is loss-free in the right direction.

Worked Exercise

ExerciseAdvanced

Problem

A coin-flipping experiment tests $H_0: \theta = 1/2$ against $H_1: \theta = 0.55$ . You run the experiment for $n = 500$ flips and observe $S_n = 280$ heads. Compute the likelihood-ratio e-value, the calibrated $p$ -value $1/E_n$ , and the classical $p$ -value from the one-sided $z$ -test. Comment on the gap.

Implementation Note

For a likelihood-ratio e-value, work in log space to avoid floating-point overflow on long sequences:

log_e = 0.0
for x in data_stream:
    log_e += np.log(p1_density(x)) - np.log(p0_density(x))
# e = np.exp(log_e); reject when log_e >= np.log(1 / alpha)

Equivalent in practice and stable for $n$ up to $10^6$ or more.

For composite nulls or distributions without a clean likelihood, the betting-strategy construction is more flexible. The confseq Python package (companion to Howard et al. 2021) provides empirical-Bernstein-style e-processes for sample-mean estimation on bounded outcomes. Wraps the construction in a streaming API that returns the e-value, the calibrated p-value, and the current confidence sequence after each new observation.

For multiple testing with arbitrary dependence, the e-BH procedure (Wang-Ramdas 2022, Annals of Statistics) replaces Benjamini-Hochberg and controls FDR without independence assumptions; the input is a vector of e-values, the output is the rejection set.

Practical Example: LLM Evaluation

Standard LLM evaluation runs $n$ benchmark items and reports an accuracy number with a Wald confidence interval. Under continuous evaluation (new items arrive, the analyst peeks, decides to keep evaluating or stop), the Wald interval is anti-conservative. An e-value-based approach:

Define $H_0: \text{model accuracy} \leq 0.7$ (the threshold for "passes").
Construct an e-value $E_n$ via the betting-strategy construction over the running mean.
Continue evaluation until $E_n \geq 1/\alpha = 20$ (for $\alpha = 0.05$ ) or the budget is exhausted.

The $5\%$ Type I error guarantee holds across the entire stopping rule. If the model genuinely scores at $0.7$ , the procedure stops with $E \geq 20$ at most $5\%$ of the time; if the model scores higher, the e-value grows exponentially and the procedure stops early. The OpenAI Evals codebase and the open-source evalsync tooling have begun adopting this framing for adaptive benchmark sizing.

References

Canonical:

Vovk, V. and Wang, R. (2021). "E-values: Calibration, combination and applications." Annals of Statistics 49(3), pp. 1736-1754. Foundational paper defining e-values, calibrators, and the e-to-p and p-to-e maps.
Shafer, G. (2021). "Testing by betting: A strategy for statistical and scientific communication." Journal of the Royal Statistical Society, Series A 184(2), pp. 407-431. The betting interpretation, with discussion contributions from Vovk, Grünwald, and Wasserman.
Ramdas, A., Grünwald, P., Vovk, V., and Shafer, G. (2023). "Game-theoretic statistics and safe anytime-valid inference." Statistical Science 38(4), pp. 576-601. Survey of the framework as of late 2022, mapping every classical sequential test to the e-value/e-process language.

Current:

Wasserman, L., Ramdas, A., and Balakrishnan, S. (2020). "Universal inference." Proceedings of the National Academy of Sciences 117(29). Split-sample construction of e-values for composite nulls.
Grünwald, P., de Heide, R., and Koolen, W. (2024). "Safe testing." Journal of the Royal Statistical Society, Series B. Construction of admissible e-values via reverse information projection (RIPr).
Wang, R. and Ramdas, A. (2022). "False discovery rate control with e-values." Journal of the Royal Statistical Society, Series B 84(3). The e-BH procedure for multiple testing under arbitrary dependence.

Historical:

Wald, A. (1945). "Sequential tests of statistical hypotheses." Annals of Mathematical Statistics 16(2). The SPRT, which is exactly the likelihood-ratio e-process with two stopping boundaries.

Next Topics

e-processes: the sequential version, where a running product of conditional e-values is a nonnegative supermartingale.
Confidence sequences: the interval estimates derived from e-processes, valid at every sample size.
Anytime-valid inference: the framing of inference under continuous monitoring.
Safe testing: the Grünwald-de Heide-Koolen formal framework built on e-values.
E-values and anytime-valid inference: the umbrella reference page with proofs, applications, and the e-BH multiple-testing procedure.

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Random Variableslayer 0A · tier 1
Concentration Inequalitieslayer 1 · tier 1
Likelihood-Ratio, Wald, and Score Testslayer 2 · tier 1
p-valueslayer 2 · tier 1

Derived topics

4

Confidence Sequenceslayer 2 · tier 1
Anytime-Valid Inferencelayer 3 · tier 1
e-processeslayer 3 · tier 1
Safe Testinglayer 3 · tier 1

Graph-backed continuations

e-processes Confidence Sequences Anytime-Valid Inference Safe Testing