Permutation Tests

Sneiderman, Robby

Statistical Estimation

Permutation Tests

Exchangeability-based hypothesis testing: under the null of no group effect, the labels are exchangeable, so the distribution of any test statistic under random relabeling gives an exact null reference. Exact for small samples, approximated by Monte Carlo for large samples, robust under non-Normality and heavy tails.

CoreTier 1StableSupporting~45 min

Prerequisites

Hypothesis Testing for ML Bootstrap Methods Neyman Pearson and Hypothesis Testing Theory

Prereq Map

Why This Matters

Permutation tests answer a question that classical asymptotic tests evade: how do you test a hypothesis when you do not want to assume a parametric model, the central limit theorem has not had time to converge, or the test statistic of interest has no closed-form null distribution? The answer is to build the null reference distribution from the data itself, by reshuffling the labels under a constraint (exchangeability) that is true under the null.

The result is a test that is exact in size (not asymptotic, not approximate; the Type I error rate is the nominal $\alpha$ at every $n$ ), distribution-free under exchangeability, and applicable to any test statistic the researcher chooses. The cost is computational: an exact permutation test enumerates $n!$ relabelings (or $\binom{n_X + n_Y}{n_X}$ in the two-sample case), which is feasible for small $n$ and infeasible for large. The Monte Carlo permutation test samples a random subset of relabelings, trading exactness for tractability, and recovers an arbitrarily-close approximation by drawing more samples.

Permutation tests are the basis of most modern nonparametric inference. They power the standard t-test alternative under non-Normality, the standard two-sample test alternative to Welch, and most modern kernel-based, distance-based, and feature-importance tests. They are sometimes confused with the bootstrap; the two are related but solve different problems. The distinction is in the constraint: permutation tests preserve the null distribution; the bootstrap preserves the empirical distribution.

Exchangeability

A finite collection of random variables $(X_1,\dots,X_n)$ is exchangeable if its joint distribution is invariant under permutation of indices:

$(X_1,\dots,X_n)\stackrel{d}{=}(X_{\pi(1)},\dots,X_{\pi(n)})\qquad\text{for every permutation }\pi.$

I.i.d. samples are exchangeable. Exchangeable random variables can be dependent (e.g., draws from a Polya urn), so the class is strictly larger than i.i.d. The use in permutation testing is one-directional: under the null hypothesis of no group effect, the labels we attach to the observations are exchangeable, even if the observations themselves are dependent in some structured way.

For a two-sample problem with observed values $(X_1,\dots,X_{n_X})$ and $(Y_1,\dots,Y_{n_Y})$ , the null hypothesis $H_0:$ "the $X$ and $Y$ samples come from the same distribution" implies the pooled sample is exchangeable. Permutation testing exploits this directly.

Permutation Test for Two Samples

Suppose we observe two samples with sizes $n_X$ and $n_Y$ and want to test $H_0: F_X = F_Y$ , where $F_X$ and $F_Y$ are the underlying distributions. Let $T(\cdot,\cdot)$ be any two-sample test statistic (difference in sample means, Mann-Whitney rank-sum, KS statistic, kernel maximum mean discrepancy, etc.).

The permutation-test procedure is:

Compute the observed value $T^* = T(\mathbf{X}, \mathbf{Y})$ .
Pool the $N = n_X + n_Y$ observations.
For every way to split the pooled sample into groups of sizes $n_X$ and $n_Y$ , compute $T$ for that split. There are $\binom{N}{n_X}$ splits.
The permutation $p$ -value for a two-sided test is the proportion of splits with $|T| \ge |T^*|$ .

The randomness in the test is purely a property of the labels under the null. The data values are held fixed; the labels are reshuffled. This is the key difference from the bootstrap, which resamples observations with replacement and changes the data values.

Theorem

Permutation Test Exactness

Statement

For any test statistic $T$ and any level $\alpha\in(0,1)$ , the permutation test that rejects when the observed $T$ exceeds the $1-\alpha$ quantile of the permutation distribution has Type I error rate at most $\alpha$ . If the test is randomized at the boundary (toss a fair coin when $T$ equals the $1-\alpha$ quantile), the size is exactly $\alpha$ .

Intuition

Under the null, the joint distribution of the labels and the data is invariant under relabeling. The permutation distribution of $T$ under random relabeling is therefore the conditional distribution of $T$ given the data values, marginally over the labels. Rejecting when $T$ is in the upper- $\alpha$ tail of this conditional distribution has exact level $\alpha$ conditional on the data, hence unconditionally.

Proof Sketch

By exchangeability under the null, the data $(D_1,\dots,D_N) = (\text{labels},\text{values})$ is jointly distributed identically to its image under any relabeling. The conditional distribution of $T$ given the unordered multiset of values is therefore uniform over the labelings. A rejection rule based on the upper- $\alpha$ quantile of this uniform distribution has size $\alpha$ conditional on the values, and by the tower property, size $\alpha$ marginally. Tie-handling at the boundary uses a randomized rule to handle the discrete grid of permutation $p$ -values.

Why It Matters

Exact size at every $n$ , not asymptotic. The test is distribution-free under exchangeability: no parametric assumptions, no central limit theorem invocation, no degree-of-freedom adjustment. The choice of test statistic $T$ is unconstrained; any function of the data and labels works. Powerful $T$ statistics give powerful tests, but the choice does not affect the test's level.

Failure Mode

Exchangeability is the load-bearing assumption. If the null hypothesis does not imply exchangeability, the test is not valid. Two-sample tests of "do the samples have the same mean" do not yield exchangeability when the distributions differ in variance but share a mean; for those, the permutation test on the difference of means has the wrong size. Workarounds include studentizing the test statistic (use the t-statistic instead of the raw difference) or testing distributional equality rather than just the mean.

report a correction →

Monte Carlo Approximation

Enumerating all $\binom{N}{n_X}$ splits is feasible only for small samples. For $n_X = n_Y = 20$ , there are about $1.4\times 10^{11}$ splits, impractical. The Monte Carlo permutation test draws $B$ random permutations and uses the empirical proportion as the $p$ -value:

$\hat p_B = \frac{1 + \#\{b: |T_b|\ge |T^*|\}}{1 + B},$

where the " $+1$ " in numerator and denominator ensures the $p$ -value is conservative (does not undershoot when $T^*$ has positive probability under the permutation distribution).

Theorem

Monte Carlo Permutation Validity

Statement

The Monte Carlo permutation test with $B$ random permutations and the $(B+1)$ -corrected $p$ -value satisfies $\mathbb{P}_{H_0}(\hat p_B \le \alpha)\le\alpha$ for every $\alpha\in(0,1)$ . The size of the test is bounded above by $\alpha$ , with equality in the limit $B\to\infty$ .

Intuition

Under the null, the observed statistic $T^*$ and the $B$ permutation-resampled statistics $T_1,\dots,T_B$ are exchangeable (they are $B+1$ independent draws from the same permutation distribution conditional on the data, plus the "extra" original ordering). The rank of $T^*$ among the $B+1$ values is uniform on $\{1,2,\dots,B+1\}$ , so the $p$ -value $\hat p_B$ is uniform on the grid $\{1/(B+1), 2/(B+1),\dots\}$ . Rejecting when $\hat p_B\le\alpha$ gives size at most $\alpha$ exactly.

Proof Sketch

Under $H_0$ , the original-ordering statistic $T^*$ has the same conditional distribution as each $T_b$ given the data. The vector $(T^*, T_1,\dots,T_B)$ is exchangeable, so the rank of $T^*$ is uniformly distributed on $\{1,\dots,B+1\}$ . The event $\{\hat p_B\le\alpha\}$ corresponds to the rank of $|T^*|$ being among the top $\lfloor\alpha(B+1)\rfloor$ values; this has probability $\lfloor\alpha(B+1)\rfloor/(B+1)\le\alpha$ .

Why It Matters

The " $+1$ " correction is what makes Monte Carlo permutation tests exact at every $B$ , not just asymptotically. Without the correction the $p$ -value is anticonservative. Modern statistical software uses the $+1$ formulation by default for permutation $p$ -values.

Failure Mode

The validity argument assumes the $B$ permutations are drawn uniformly without replacement from the set of all permutations, or with replacement; both work because the exchangeability argument is the same. Drawing $B$ permutations with replacement is the standard implementation. The Monte Carlo $p$ -value is on the discrete grid $\{0, 1/(B+1),\dots, 1\}$ ; very small $p$ -values require correspondingly large $B$ . For $\alpha = 0.001$ , take at least $B = 10{,}000$ .

report a correction →

Permutation Tests Versus the Bootstrap

The two procedures are often confused. They are related but solve different problems.

Aspect	Permutation test	Bootstrap
Goal	Test a null hypothesis	Estimate a sampling distribution
Resampling unit	Labels (or sign flips, ranks)	Observations
Resampling method	Without replacement (relabeling)	With replacement
Preserved quantity	Null distribution	Empirical distribution
Output	$p$ -value	Standard error or confidence interval
Exact at finite $n$ ?	Yes, under exchangeability	No (consistent only)
Assumptions	Exchangeability under null	Bootstrap consistency for the statistic

Use the bootstrap when you want to estimate variability or build confidence intervals without a parametric model. Use a permutation test when you want to test a null hypothesis with exact size and minimal assumptions. The two are complementary, not competitors.

A common pattern: use the bootstrap for the confidence interval and the permutation test for the $p$ -value. The two can yield slightly different conclusions (e.g., a bootstrap CI that excludes zero and a permutation $p$ -value above 0.05) when the test statistic is studentized differently or the bootstrap targets a different parameter.

Choosing a Test Statistic

Any function of the data works. The choice affects power, not size. Some canonical choices:

Setting	Test statistic	Comments
Two-sample mean comparison	$\bar X - \bar Y$	Sensitive to heavy tails
Two-sample mean comparison (studentized)	$t$ -statistic with pooled $S_p$	Studentized; better behavior under heavy tails
Two-sample distributional comparison	Two-sample KS statistic	Sensitive to any difference
Two-sample rank comparison	Wilcoxon rank-sum	Distribution-free under null, even without permutation
Independence in $r\times c$ table	Pearson Chi-squared	Permutation version replaces asymptotic Chi-squared with exact randomization
Independence of two variables	Sample correlation	Or distance correlation for nonlinear dependence
Two-sample kernel comparison	Maximum Mean Discrepancy	Sensitive to high-dimensional differences

Studentized statistics (the t-statistic instead of the raw mean difference) typically have more reliable size when the null does not imply full distributional equality but only equal means. The permutation test on the raw mean difference is exact only when distributions are fully equal under the null; the permutation test on the studentized statistic is asymptotically valid even when only the means agree.

Sign-Flip and Rotation Tests

Permutation tests apply more broadly than label swapping.

Sign-flip permutation. For paired data with hypothesized zero treatment effect, randomly flipping the sign of each within-pair difference $D_i$ preserves the null distribution. The Wilcoxon signed-rank test is a sign-flip permutation test on the rank-transformed differences.
Rotation tests. For data with rotational symmetry under the null (e.g., directional data, or some functional-data settings), random rotations of the residuals give the null reference distribution.
Stratified permutation. When the data have an external structure (e.g., blocks, time, location) that should be preserved, permutation is restricted to within-stratum relabelings. This is the basis of randomization tests in randomized block designs.

The unifying theme is that the group of transformations under which the null distribution is invariant determines the right resampling scheme. Exchangeability gives the symmetric group of all permutations; sign symmetry gives the sign-flip group; rotational invariance gives the rotation group.

Common Confusions

Watch Out

Permutation does not equal bootstrap

Permutation and bootstrap are different procedures with different purposes. Permutation resamples labels under a null; bootstrap resamples observations to estimate variability. Using "bootstrap" loosely to mean "resampling-based" obscures the distinction; the size guarantees are different.

Watch Out

Permutation tests are not assumption-free

The exchangeability assumption under the null is real and can fail. Two-sample mean tests on data with unequal variances violate exchangeability under the equal-means null (because exchangeable data must have equal variances too). Use studentized statistics or test the broader null of equal distributions, not just equal means.

Watch Out

Choice of statistic affects power, not validity

The level of a permutation test does not depend on the test statistic. The power does. A poor choice of statistic gives a valid but powerless test; a good choice gives a valid and powerful test. This is the opposite of asymptotic tests, where the choice of statistic affects both the level (through the asymptotic distribution) and the power.

Watch Out

Discrete p-values from finite B

Monte Carlo permutation tests produce $p$ -values on the discrete grid $\{0, 1/(B+1),\dots, 1\}$ . To detect a true $p$ -value of $0.01$ reliably, $B$ should be at least a few thousand. For $\alpha = 0.001$ , use $B \ge 10^4$ to ensure the test can resolve the rejection region.

Exercises

ExerciseCore

Problem

A two-sample test compares $n_X = 5$ values $(3, 5, 7, 8, 11)$ against $n_Y = 4$ values $(2, 4, 6, 9)$ . Use the difference of sample means as the test statistic, enumerate all $\binom{9}{5} = 126$ permutations, and compute the two-sided exact permutation $p$ -value.

ExerciseCore

Problem

A paired study reports differences $D = (3, -1, 5, 2, -2, 4, 6)$ for $n = 7$ pairs. Use a sign-flip permutation test with the sum-of-differences statistic to test $H_0:\mathbb{E}[D] = 0$ exactly.

ExerciseAdvanced

Problem

Show by simulation argument that when the two-sample test uses the studentized t-statistic instead of the raw difference of means, the permutation test is asymptotically valid even under unequal variances, while the permutation test using the raw difference is not.

ExerciseAdvanced

Problem

Suppose you want a permutation test for the null of zero correlation between two variables $X$ and $Y$ measured on $n$ paired observations. Construct the test, identify the exact permutation scheme, and explain why pairing-permutation (shuffling one variable's order while keeping the other fixed) is the right scheme.

References

Canonical:

Lehmann and Romano, Testing Statistical Hypotheses (2005), Chapter 15 (permutation and randomization tests).
Casella and Berger, Statistical Inference (2002), Chapter 8 (Section 8.4 on randomization tests).
Good, Permutation, Parametric, and Bootstrap Tests of Hypotheses (2005), Chapter 1 and 2.

Foundational papers:

Fisher, Design of Experiments (1935), introduced randomization tests in the context of agricultural experiments.
Pitman, "Significance tests which may be applied to samples from any populations" (JRSS Supplement, 1937), the systematic treatment.

Studentized permutation and modern theory:

Janssen and Pauls, "How do bootstrap and permutation tests work?" (Annals of Statistics, 2003), studentized permutation under unequal variances.
Romano and Wolf, Multiple Hypothesis Testing (Cambridge, 2010), permutation tests in multiple-testing settings.

Last reviewed: May 11, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Bootstrap Methodslayer 2 · tier 1
Hypothesis Testing for MLlayer 2 · tier 2
Neyman-Pearson and Hypothesis Testing Theorylayer 2 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.