Safe Testing

Sneiderman, Robby

Sequential Inference

Safe Testing

A formal framework for hypothesis testing where every test statistic is an e-value and every sequential procedure is an e-process. Safe tests survive optional stopping, optional continuation, and selective combination by construction. The Grünwald-de Heide-Koolen 2024 formalization replaces Neyman-Pearson tests with admissible tests built on reverse information projection.

AdvancedResearchTier 1FrontierCore spine~50 min

For:MLStatsResearch

Prerequisites

E Values E Processes Hypothesis Testing for ML Neyman Pearson and Hypothesis Testing Theory

Prereq Map

Why This Matters

Classical Neyman-Pearson testing fixes a sample size, a null, and a (possibly composite) alternative. The test is constructed to maximize power at level $\alpha$ . The framework breaks under optional stopping, optional continuation, and the combination of evidence across studies that share the same null. The patches (group-sequential boundaries, alpha-spending, meta-analytic adjustments) are rigid and brittle.

Safe testing, formalized by Grünwald, de Heide, and Koolen (2024, Journal of the Royal Statistical Society Series B), is the framework where every test is built from e-values by construction. The framework imposes a single requirement: the test statistic is a nonnegative random variable whose expectation under the null is at most one. From that constraint, every standard sequential operation (peeking, stopping early, continuing past a planned stop, combining studies, adapting the analysis) is automatically valid.

The unification matters because it replaces a constellation of mutually-incompatible classical techniques with a single coherent framework. Neyman-Pearson tests, group-sequential tests, Bayes factors, likelihood-ratio tests, and the SPRT all become special cases of safe tests under different choices of evidence functional. The cost of the unification is a small power loss at the optimum fixed- $n$ alternative; the gain is robust validity across the actual decision rules used in practice.

Formal Setup

Definition

Safe test

A safe test at level $\alpha$ for a null hypothesis $H_0$ is a function $S(X)$ of the data such that $S(X) \geq 1/\alpha$ implies rejection, and $S$ is an e-value: $\mathbb{E}_{P}[S] \leq 1$ for every $P \in H_0$ . The Type I error of the rejection rule is at most $\alpha$ by Markov's inequality.

Definition

Safe anytime-valid test

A safe anytime-valid test is a sequential procedure with rejection rule "stop and reject the first time $S_t \geq 1/\alpha$ " where $(S_t)$ is an e-process. By Ville's inequality, the Type I error at every stopping time is at most $\alpha$ .

Definition

Reverse information projection (RIPr) $RIPr (P_{1}, P_{0})$

For an alternative density $P_1$ and a composite null $\mathcal{P}_0$ , the RIPr is the element of $\mathcal{P}_0$ that minimizes Kullback-Leibler divergence to $P_1$ : $P_0^* = \arg\min_{P_0 \in \mathcal{P}_0} \mathrm{KL}(P_1 \| P_0).$ The likelihood ratio $p_1(X) / p_0^*(X)$ is an e-value for the composite null $\mathcal{P}_0$ and is admissible in the safe-testing decision-theoretic sense.

The Core Result

The construction that makes safe testing work is the e-value itself.

Theorem

E-Value Construction Yields a Safe Test

Statement

Let $S(X)$ be a nonnegative function of the data $X$ with $\sup_{P \in H_0} \mathbb{E}_P[S] \leq 1$ . The test that rejects $H_0$ when $S(X) \geq 1/\alpha$ is a safe test at level $\alpha$ : its Type I error is at most $\alpha$ , uniformly over $H_0$ .

If additionally $(S_t)$ is an e-process for $H_0$ , the sequential rule "stop and reject at the first $t$ with $S_t \geq 1/\alpha$ " is a safe anytime-valid test at level $\alpha$ : Type I error at most $\alpha$ for every stopping rule.

Intuition

The Markov inequality gives the single-shot guarantee; Ville's inequality gives the sequential guarantee. The construction does not optimize for power; it guarantees validity. Power comes from clever choice of the e-value or e-process.

Proof Sketch

Single-shot: $\Pr_P(S \geq 1/\alpha) \leq \alpha \cdot \mathbb{E}_P[S] \leq \alpha$ by Markov.

Sequential: Ville's inequality applied to the supermartingale $(S_t)$ .

Validity holds under any $P \in H_0$ , simultaneously over the entire null family, because the e-value/e-process bound is uniform in $P$ .

Why It Matters

Safe tests are the operational object the analyst computes. The framework substitutes "construct an e-value" for the classical recipe "compute a sufficient statistic, normalize it, look up the tail probability of its null distribution." Every classical Neyman-Pearson test maps to a safe test (the likelihood ratio is the e-value); the converse is not true (some safe tests have no Neyman-Pearson analog).

Failure Mode

The e-value must satisfy $\mathbb{E}_P[S] \leq 1$ for every $P \in H_0$ , not just for a chosen point null. Composite nulls require constructions that handle the entire null family at once: universal inference, RIPr, or numerical worst-case search.

report a correction →

Construction Methods

The Grünwald-de Heide-Koolen 2024 paper organizes safe-test construction into four families:

Simple-versus-simple (likelihood ratio). When both $H_0 = \{P_0\}$ and $H_1 = \{P_1\}$ are simple, the e-value is the likelihood ratio $p_1/p_0$ . This is the SPRT in the sequential case. The construction is exact and the safe test is the (uniformly most powerful) Neyman-Pearson test up to the Markov calibration.

Simple null, composite alternative. Pick a single point $P_1^*$ in the alternative (Bayes-marginalized over a prior, or the worst-case under a minimax criterion, or growth-rate-optimal). The likelihood ratio against this fixed $P_1^*$ is an e-value. The GROW (growth-rate-optimal worst-case) e-value maximizes expected log-payoff against the alternative.

Composite null, composite alternative. Two main constructions:

Universal inference (Wasserman-Ramdas-Balakrishnan 2020): split the sample, fit a model on one half, evaluate likelihood ratio on the other. Always applicable; loses a factor of $\sqrt{2}$ in power.
Reverse information projection (Grünwald-de Heide-Koolen 2024): replace $P_0$ in the likelihood ratio by the RIPr of the alternative onto the null. The resulting e-value is admissible and approaches Bayes-factor power.

Sample-mean tests on bounded outcomes. The betting-strategy construction (Waudby-Smith-Ramdas 2024) builds e-processes for $H_0: \mu \leq \mu_0$ via predictable bets. Tightest known intervals for bounded-outcome A/B tests.

Comparison: Neyman-Pearson vs Safe Tests

Property	Neyman-Pearson test	Safe test
Decision rule	Reject if $T \geq t_\alpha$ where $t_\alpha$ is the $1 - \alpha$ quantile of $T$ under $H_0$	Reject if $S \geq 1/\alpha$
Sample size	Fixed before data collection	Any stopping rule
Type I error	Exactly $\alpha$ (continuous case)	At most $\alpha$ (often strictly less)
Power at fixed $n$	Optimal under simple alternatives	Within constants of optimal
Composite null	Needs uniform tail bound (often hard)	Handled by universal inference or RIPr
Multiple testing	Bonferroni / BH adjustments after the fact	e-BH integrates into the framework natively
Combination across studies	Meta-analytic, needs independence	Conditional product or averaging is automatic

The headline trade-off: safe tests sacrifice a power factor (the Markov-to-uniform-distribution gap, typically a factor of $2$ to $3$ in $p$ -value calibration) to gain robustness across every operational dimension that makes classical testing fragile in practice.

Canonical Example: Safe Sequential t-Test

Test $H_0: \mu = 0$ for iid Gaussian data $X_1, X_2, \ldots \sim \mathcal{N}(\mu, \sigma^2)$ with $\sigma^2$ unknown. The classical $t$ -test fixes $n$ , computes $T_n = \sqrt{n} \bar X_n / s_n$ , and rejects when $|T_n| \geq t_{\alpha/2}$ for the $t$ -distribution with $n - 1$ degrees of freedom.

The Grünwald-Henzi-Lardy safe sequential $t$ -test (2024) replaces the fixed $n$ with an e-process built from the Bayesian mixture of $t$ -densities against a fixed alternative variance prior. The construction:

Place a Cauchy prior on the alternative $\mu$ and a scaled inverse-chi-squared prior on $\sigma^2$ .
The marginal likelihood under $H_0$ versus the Bayesian alternative is the e-value: a Bayes factor.
The running Bayes factor is an e-process under $H_0$ . Reject at the first $t$ where the e-value exceeds $1/\alpha = 20$ .

The safe $t$ -test gives Type I error $\leq 5\%$ across all stopping times. At a fixed $n = 100$ , the safe- $t$ power is approximately $85\%$ of the classical $t$ -test power for $\mu = 0.3 \sigma$ , a $15\%$ relative power loss. For sequential designs where the analyst would peek 10+ times, the safe test dominates the corrected classical test.

Worked Exercise

ExerciseAdvanced

Problem

A standard Neyman-Pearson likelihood-ratio test for $H_0: \theta = \theta_0$ versus $H_1: \theta = \theta_1$ on iid data rejects when $\prod_{i=1}^n p_{\theta_1}(X_i)/p_{\theta_0}(X_i) \geq c$ for a threshold $c$ chosen to give Type I error $\alpha$ . Show that this is exactly a safe test (in the single-shot sense) with the threshold $c = 1/\alpha$ . Quantify the difference in $c$ between the Markov calibration and the exact-distribution calibration for $\mathcal{N}(0, 1)$ versus $\mathcal{N}(1, 1)$ at $n = 16$ .

Implementation Note

The safetest R package (companion to Grünwald-de Heide-Koolen 2024) implements safe tests for the binomial, Gaussian, $t$ , $\chi^2$ , and Poisson cases with mixture priors chosen for growth-rate optimality against bounded alternatives.

For ML-style applications with bounded outcomes (accuracy on a benchmark, conversion-rate differential in an A/B test), the confseq Python package provides empirical-Bernstein safe tests with closed-form e-process update rules. Each new observation contributes a multiplicative factor; the cumulative log-e-value can be tracked in $O(1)$ per observation.

A common implementation pitfall: the e-value must be constructed before the data are seen. The betting strategy or the alternative density choice is a pre-specified element of the design. Adapting the e-value construction after the data are in violates the supermartingale property and breaks the guarantee. Pre-register the e-value functional, allow free choice of stopping time.

Practical Example: Multi-Site Clinical Trial Meta-Analysis

A pharmaceutical sponsor coordinates a multi-center trial of a new treatment against placebo. Centers complete enrollment at different times. The classical approach pre-specifies a meta-analysis at the planned end date and combines $p$ -values via Fisher's combination. Peeking at interim results is forbidden.

A safe-testing approach:

Each center constructs an e-value for the per-site null hypothesis.
Combine the per-site e-values via averaging (still a valid e-value).
The aggregate e-value can be monitored continuously across centers without inflating Type I error.
The combined trial can stop when the aggregate e-value crosses $1/\alpha$ , even if some centers have not finished enrollment.

The Type I error guarantee survives the unequal enrollment, the rolling site completions, and the continuous monitoring. The classical alternative requires pre-specifying every interim look and using O'Brien-Fleming boundaries scaled by information fraction; the safe approach is operationally simpler and matches or beats the classical power at moderate effect sizes.

References

Canonical:

Grünwald, P., de Heide, R., and Koolen, W. (2024). "Safe testing." Journal of the Royal Statistical Society, Series B 86(4), pp. 1091-1128. The foundational paper that defines the safe-testing framework, the RIPr construction, and the admissibility theory.
Ramdas, A., Grünwald, P., Vovk, V., and Shafer, G. (2023). "Game-theoretic statistics and safe anytime-valid inference." Statistical Science 38(4), pp. 576-601. The survey paper that places safe testing in the broader game-theoretic-statistics framework.
Vovk, V. and Wang, R. (2021). "E-values: Calibration, combination and applications." Annals of Statistics 49(3), pp. 1736-1754. The e-value foundations on which safe testing is built.

Current:

Wasserman, L., Ramdas, A., and Balakrishnan, S. (2020). "Universal inference." Proceedings of the National Academy of Sciences 117(29). Split-sample safe tests for composite nulls.
Henzi, A. and Ziegel, J. F. (2022). "Valid sequential inference on probability forecast performance." Biometrika 109(3). Safe testing for forecast evaluation, central to modern ML model comparison.
Waudby-Smith, I. and Ramdas, A. (2024). "Estimating means of bounded random variables by betting." Journal of the Royal Statistical Society, Series B 86(1), pp. 1-27. Bounded-outcome safe tests with state-of-the-art intervals.

Critique and context:

Pawel, S. and Held, L. (2022). "The sceptical Bayes factor for the assessment of replication success." Journal of the Royal Statistical Society, Series A 185(2). Compares safe testing to Bayes factor for replication.
Berger, J. O. and Sellke, T. (1987). "Testing a point null hypothesis: the irreconcilability of p values and evidence." Journal of the American Statistical Association 82(397). The historical critique of $p$ -values that safe testing addresses.

Next Topics

Anytime-valid inference: the framing of statistical inference under continuous monitoring.
Confidence sequences: time-uniform interval estimates that complement safe tests.
E-values: the evidence statistic that safe tests use as their primitive.
E-processes: the sequential version that powers safe anytime-valid tests.
E-values and anytime-valid inference: the umbrella reference with applications and the e-BH multiple-testing procedure.

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

KL Divergencelayer 1 · tier 1
e-valueslayer 2 · tier 1
Likelihood-Ratio, Wald, and Score Testslayer 2 · tier 1
e-processeslayer 3 · tier 1
Hypothesis Testing for MLlayer 2 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.