Skip to main content

Sequential Inference

Safe Testing

A formal framework for hypothesis testing where every test statistic is an e-value and every sequential procedure is an e-process. Safe tests survive optional stopping, optional continuation, and selective combination by construction. The Grünwald-de Heide-Koolen 2024 formalization replaces Neyman-Pearson tests with admissible tests built on reverse information projection.

AdvancedResearchTier 1FrontierCore spine~50 min
For:MLStatsResearch

Why This Matters

Classical Neyman-Pearson testing fixes a sample size, a null, and a (possibly composite) alternative. The test is constructed to maximize power at level α\alpha. The framework breaks under optional stopping, optional continuation, and the combination of evidence across studies that share the same null. The patches (group-sequential boundaries, alpha-spending, meta-analytic adjustments) are rigid and brittle.

Safe testing, formalized by Grünwald, de Heide, and Koolen (2024, Journal of the Royal Statistical Society Series B), is the framework where every test is built from e-values by construction. The framework imposes a single requirement: the test statistic is a nonnegative random variable whose expectation under the null is at most one. From that constraint, every standard sequential operation (peeking, stopping early, continuing past a planned stop, combining studies, adapting the analysis) is automatically valid.

The unification matters because it replaces a constellation of mutually-incompatible classical techniques with a single coherent framework. Neyman-Pearson tests, group-sequential tests, Bayes factors, likelihood-ratio tests, and the SPRT all become special cases of safe tests under different choices of evidence functional. The cost of the unification is a small power loss at the optimum fixed-nn alternative; the gain is robust validity across the actual decision rules used in practice.

Formal Setup

Definition

Safe test

A safe test at level α\alpha for a null hypothesis H0H_0 is a function S(X)S(X) of the data such that S(X)1/αS(X) \geq 1/\alpha implies rejection, and SS is an e-value: EP[S]1\mathbb{E}_{P}[S] \leq 1 for every PH0P \in H_0. The Type I error of the rejection rule is at most α\alpha by Markov's inequality.

Definition

Safe anytime-valid test

A safe anytime-valid test is a sequential procedure with rejection rule "stop and reject the first time St1/αS_t \geq 1/\alpha" where (St)(S_t) is an e-process. By Ville's inequality, the Type I error at every stopping time is at most α\alpha.

Definition

Reverse information projection (RIPr)

For an alternative density P1P_1 and a composite null P0\mathcal{P}_0, the RIPr is the element of P0\mathcal{P}_0 that minimizes Kullback-Leibler divergence to P1P_1: P0=argminP0P0KL(P1P0).P_0^* = \arg\min_{P_0 \in \mathcal{P}_0} \mathrm{KL}(P_1 \| P_0). The likelihood ratio p1(X)/p0(X)p_1(X) / p_0^*(X) is an e-value for the composite null P0\mathcal{P}_0 and is admissible in the safe-testing decision-theoretic sense.

The Core Result

The construction that makes safe testing work is the e-value itself.

Theorem

E-Value Construction Yields a Safe Test

Statement

Let S(X)S(X) be a nonnegative function of the data XX with supPH0EP[S]1\sup_{P \in H_0} \mathbb{E}_P[S] \leq 1. The test that rejects H0H_0 when S(X)1/αS(X) \geq 1/\alpha is a safe test at level α\alpha: its Type I error is at most α\alpha, uniformly over H0H_0.

If additionally (St)(S_t) is an e-process for H0H_0, the sequential rule "stop and reject at the first tt with St1/αS_t \geq 1/\alpha" is a safe anytime-valid test at level α\alpha: Type I error at most α\alpha for every stopping rule.

Intuition

The Markov inequality gives the single-shot guarantee; Ville's inequality gives the sequential guarantee. The construction does not optimize for power; it guarantees validity. Power comes from clever choice of the e-value or e-process.

Proof Sketch

Single-shot: PrP(S1/α)αEP[S]α\Pr_P(S \geq 1/\alpha) \leq \alpha \cdot \mathbb{E}_P[S] \leq \alpha by Markov.

Sequential: Ville's inequality applied to the supermartingale (St)(S_t).

Validity holds under any PH0P \in H_0, simultaneously over the entire null family, because the e-value/e-process bound is uniform in PP.

Why It Matters

Safe tests are the operational object the analyst computes. The framework substitutes "construct an e-value" for the classical recipe "compute a sufficient statistic, normalize it, look up the tail probability of its null distribution." Every classical Neyman-Pearson test maps to a safe test (the likelihood ratio is the e-value); the converse is not true (some safe tests have no Neyman-Pearson analog).

Failure Mode

The e-value must satisfy EP[S]1\mathbb{E}_P[S] \leq 1 for every PH0P \in H_0, not just for a chosen point null. Composite nulls require constructions that handle the entire null family at once: universal inference, RIPr, or numerical worst-case search.

Construction Methods

The Grünwald-de Heide-Koolen 2024 paper organizes safe-test construction into four families:

Simple-versus-simple (likelihood ratio). When both H0={P0}H_0 = \{P_0\} and H1={P1}H_1 = \{P_1\} are simple, the e-value is the likelihood ratio p1/p0p_1/p_0. This is the SPRT in the sequential case. The construction is exact and the safe test is the (uniformly most powerful) Neyman-Pearson test up to the Markov calibration.

Simple null, composite alternative. Pick a single point P1P_1^* in the alternative (Bayes-marginalized over a prior, or the worst-case under a minimax criterion, or growth-rate-optimal). The likelihood ratio against this fixed P1P_1^* is an e-value. The GROW (growth-rate-optimal worst-case) e-value maximizes expected log-payoff against the alternative.

Composite null, composite alternative. Two main constructions:

  1. Universal inference (Wasserman-Ramdas-Balakrishnan 2020): split the sample, fit a model on one half, evaluate likelihood ratio on the other. Always applicable; loses a factor of 2\sqrt{2} in power.
  2. Reverse information projection (Grünwald-de Heide-Koolen 2024): replace P0P_0 in the likelihood ratio by the RIPr of the alternative onto the null. The resulting e-value is admissible and approaches Bayes-factor power.

Sample-mean tests on bounded outcomes. The betting-strategy construction (Waudby-Smith-Ramdas 2024) builds e-processes for H0:μμ0H_0: \mu \leq \mu_0 via predictable bets. Tightest known intervals for bounded-outcome A/B tests.

Comparison: Neyman-Pearson vs Safe Tests

PropertyNeyman-Pearson testSafe test
Decision ruleReject if TtαT \geq t_\alpha where tαt_\alpha is the 1α1 - \alpha quantile of TT under H0H_0Reject if S1/αS \geq 1/\alpha
Sample sizeFixed before data collectionAny stopping rule
Type I errorExactly α\alpha (continuous case)At most α\alpha (often strictly less)
Power at fixed nnOptimal under simple alternativesWithin constants of optimal
Composite nullNeeds uniform tail bound (often hard)Handled by universal inference or RIPr
Multiple testingBonferroni / BH adjustments after the facte-BH integrates into the framework natively
Combination across studiesMeta-analytic, needs independenceConditional product or averaging is automatic

The headline trade-off: safe tests sacrifice a power factor (the Markov-to-uniform-distribution gap, typically a factor of 22 to 33 in pp-value calibration) to gain robustness across every operational dimension that makes classical testing fragile in practice.

Canonical Example: Safe Sequential t-Test

Test H0:μ=0H_0: \mu = 0 for iid Gaussian data X1,X2,N(μ,σ2)X_1, X_2, \ldots \sim \mathcal{N}(\mu, \sigma^2) with σ2\sigma^2 unknown. The classical tt-test fixes nn, computes Tn=nXˉn/snT_n = \sqrt{n} \bar X_n / s_n, and rejects when Tntα/2|T_n| \geq t_{\alpha/2} for the tt-distribution with n1n - 1 degrees of freedom.

The Grünwald-Henzi-Lardy safe sequential tt-test (2024) replaces the fixed nn with an e-process built from the Bayesian mixture of tt-densities against a fixed alternative variance prior. The construction:

  1. Place a Cauchy prior on the alternative μ\mu and a scaled inverse-chi-squared prior on σ2\sigma^2.
  2. The marginal likelihood under H0H_0 versus the Bayesian alternative is the e-value: a Bayes factor.
  3. The running Bayes factor is an e-process under H0H_0. Reject at the first tt where the e-value exceeds 1/α=201/\alpha = 20.

The safe tt-test gives Type I error 5%\leq 5\% across all stopping times. At a fixed n=100n = 100, the safe-tt power is approximately 85%85\% of the classical tt-test power for μ=0.3σ\mu = 0.3 \sigma, a 15%15\% relative power loss. For sequential designs where the analyst would peek 10+ times, the safe test dominates the corrected classical test.

Worked Exercise

ExerciseAdvanced

Problem

A standard Neyman-Pearson likelihood-ratio test for H0:θ=θ0H_0: \theta = \theta_0 versus H1:θ=θ1H_1: \theta = \theta_1 on iid data rejects when i=1npθ1(Xi)/pθ0(Xi)c\prod_{i=1}^n p_{\theta_1}(X_i)/p_{\theta_0}(X_i) \geq c for a threshold cc chosen to give Type I error α\alpha. Show that this is exactly a safe test (in the single-shot sense) with the threshold c=1/αc = 1/\alpha. Quantify the difference in cc between the Markov calibration and the exact-distribution calibration for N(0,1)\mathcal{N}(0, 1) versus N(1,1)\mathcal{N}(1, 1) at n=16n = 16.

Implementation Note

The safetest R package (companion to Grünwald-de Heide-Koolen 2024) implements safe tests for the binomial, Gaussian, tt, χ2\chi^2, and Poisson cases with mixture priors chosen for growth-rate optimality against bounded alternatives.

For ML-style applications with bounded outcomes (accuracy on a benchmark, conversion-rate differential in an A/B test), the confseq Python package provides empirical-Bernstein safe tests with closed-form e-process update rules. Each new observation contributes a multiplicative factor; the cumulative log-e-value can be tracked in O(1)O(1) per observation.

A common implementation pitfall: the e-value must be constructed before the data are seen. The betting strategy or the alternative density choice is a pre-specified element of the design. Adapting the e-value construction after the data are in violates the supermartingale property and breaks the guarantee. Pre-register the e-value functional, allow free choice of stopping time.

Practical Example: Multi-Site Clinical Trial Meta-Analysis

A pharmaceutical sponsor coordinates a multi-center trial of a new treatment against placebo. Centers complete enrollment at different times. The classical approach pre-specifies a meta-analysis at the planned end date and combines pp-values via Fisher's combination. Peeking at interim results is forbidden.

A safe-testing approach:

  1. Each center constructs an e-value for the per-site null hypothesis.
  2. Combine the per-site e-values via averaging (still a valid e-value).
  3. The aggregate e-value can be monitored continuously across centers without inflating Type I error.
  4. The combined trial can stop when the aggregate e-value crosses 1/α1/\alpha, even if some centers have not finished enrollment.

The Type I error guarantee survives the unequal enrollment, the rolling site completions, and the continuous monitoring. The classical alternative requires pre-specifying every interim look and using O'Brien-Fleming boundaries scaled by information fraction; the safe approach is operationally simpler and matches or beats the classical power at moderate effect sizes.

References

Canonical:

  • Grünwald, P., de Heide, R., and Koolen, W. (2024). "Safe testing." Journal of the Royal Statistical Society, Series B 86(4), pp. 1091-1128. The foundational paper that defines the safe-testing framework, the RIPr construction, and the admissibility theory.
  • Ramdas, A., Grünwald, P., Vovk, V., and Shafer, G. (2023). "Game-theoretic statistics and safe anytime-valid inference." Statistical Science 38(4), pp. 576-601. The survey paper that places safe testing in the broader game-theoretic-statistics framework.
  • Vovk, V. and Wang, R. (2021). "E-values: Calibration, combination and applications." Annals of Statistics 49(3), pp. 1736-1754. The e-value foundations on which safe testing is built.

Current:

  • Wasserman, L., Ramdas, A., and Balakrishnan, S. (2020). "Universal inference." Proceedings of the National Academy of Sciences 117(29). Split-sample safe tests for composite nulls.
  • Henzi, A. and Ziegel, J. F. (2022). "Valid sequential inference on probability forecast performance." Biometrika 109(3). Safe testing for forecast evaluation, central to modern ML model comparison.
  • Waudby-Smith, I. and Ramdas, A. (2024). "Estimating means of bounded random variables by betting." Journal of the Royal Statistical Society, Series B 86(1), pp. 1-27. Bounded-outcome safe tests with state-of-the-art intervals.

Critique and context:

  • Pawel, S. and Held, L. (2022). "The sceptical Bayes factor for the assessment of replication success." Journal of the Royal Statistical Society, Series A 185(2). Compares safe testing to Bayes factor for replication.
  • Berger, J. O. and Sellke, T. (1987). "Testing a point null hypothesis: the irreconcilability of p values and evidence." Journal of the American Statistical Association 82(397). The historical critique of pp-values that safe testing addresses.

Next Topics

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

6

Derived topics

0

No published topic currently declares this as a prerequisite.