Statistical Estimation
Permutation Tests
Exchangeability-based hypothesis testing: under the null of no group effect, the labels are exchangeable, so the distribution of any test statistic under random relabeling gives an exact null reference. Exact for small samples, approximated by Monte Carlo for large samples, robust under non-Normality and heavy tails.
Why This Matters
Permutation tests answer a question that classical asymptotic tests evade: how do you test a hypothesis when you do not want to assume a parametric model, the central limit theorem has not had time to converge, or the test statistic of interest has no closed-form null distribution? The answer is to build the null reference distribution from the data itself, by reshuffling the labels under a constraint (exchangeability) that is true under the null.
The result is a test that is exact in size (not asymptotic, not approximate; the Type I error rate is the nominal at every ), distribution-free under exchangeability, and applicable to any test statistic the researcher chooses. The cost is computational: an exact permutation test enumerates relabelings (or in the two-sample case), which is feasible for small and infeasible for large. The Monte Carlo permutation test samples a random subset of relabelings, trading exactness for tractability, and recovers an arbitrarily-close approximation by drawing more samples.
Permutation tests are the basis of most modern nonparametric inference. They power the standard t-test alternative under non-Normality, the standard two-sample test alternative to Welch, and most modern kernel-based, distance-based, and feature-importance tests. They are sometimes confused with the bootstrap; the two are related but solve different problems. The distinction is in the constraint: permutation tests preserve the null distribution; the bootstrap preserves the empirical distribution.
Exchangeability
A finite collection of random variables is exchangeable if its joint distribution is invariant under permutation of indices:
I.i.d. samples are exchangeable. Exchangeable random variables can be dependent (e.g., draws from a Polya urn), so the class is strictly larger than i.i.d. The use in permutation testing is one-directional: under the null hypothesis of no group effect, the labels we attach to the observations are exchangeable, even if the observations themselves are dependent in some structured way.
For a two-sample problem with observed values and , the null hypothesis "the and samples come from the same distribution" implies the pooled sample is exchangeable. Permutation testing exploits this directly.
Permutation Test for Two Samples
Suppose we observe two samples with sizes and and want to test , where and are the underlying distributions. Let be any two-sample test statistic (difference in sample means, Mann-Whitney rank-sum, KS statistic, kernel maximum mean discrepancy, etc.).
The permutation-test procedure is:
- Compute the observed value .
- Pool the observations.
- For every way to split the pooled sample into groups of sizes and , compute for that split. There are splits.
- The permutation -value for a two-sided test is the proportion of splits with .
The randomness in the test is purely a property of the labels under the null. The data values are held fixed; the labels are reshuffled. This is the key difference from the bootstrap, which resamples observations with replacement and changes the data values.
Permutation Test Exactness
Statement
For any test statistic and any level , the permutation test that rejects when the observed exceeds the quantile of the permutation distribution has Type I error rate at most . If the test is randomized at the boundary (toss a fair coin when equals the quantile), the size is exactly .
Intuition
Under the null, the joint distribution of the labels and the data is invariant under relabeling. The permutation distribution of under random relabeling is therefore the conditional distribution of given the data values, marginally over the labels. Rejecting when is in the upper- tail of this conditional distribution has exact level conditional on the data, hence unconditionally.
Proof Sketch
By exchangeability under the null, the data is jointly distributed identically to its image under any relabeling. The conditional distribution of given the unordered multiset of values is therefore uniform over the labelings. A rejection rule based on the upper- quantile of this uniform distribution has size conditional on the values, and by the tower property, size marginally. Tie-handling at the boundary uses a randomized rule to handle the discrete grid of permutation -values.
Why It Matters
Exact size at every , not asymptotic. The test is distribution-free under exchangeability: no parametric assumptions, no central limit theorem invocation, no degree-of-freedom adjustment. The choice of test statistic is unconstrained; any function of the data and labels works. Powerful statistics give powerful tests, but the choice does not affect the test's level.
Failure Mode
Exchangeability is the load-bearing assumption. If the null hypothesis does not imply exchangeability, the test is not valid. Two-sample tests of "do the samples have the same mean" do not yield exchangeability when the distributions differ in variance but share a mean; for those, the permutation test on the difference of means has the wrong size. Workarounds include studentizing the test statistic (use the t-statistic instead of the raw difference) or testing distributional equality rather than just the mean.
Monte Carlo Approximation
Enumerating all splits is feasible only for small samples. For , there are about splits, impractical. The Monte Carlo permutation test draws random permutations and uses the empirical proportion as the -value:
where the "" in numerator and denominator ensures the -value is conservative (does not undershoot when has positive probability under the permutation distribution).
Monte Carlo Permutation Validity
Statement
The Monte Carlo permutation test with random permutations and the -corrected -value satisfies for every . The size of the test is bounded above by , with equality in the limit .
Intuition
Under the null, the observed statistic and the permutation-resampled statistics are exchangeable (they are independent draws from the same permutation distribution conditional on the data, plus the "extra" original ordering). The rank of among the values is uniform on , so the -value is uniform on the grid . Rejecting when gives size at most exactly.
Proof Sketch
Under , the original-ordering statistic has the same conditional distribution as each given the data. The vector is exchangeable, so the rank of is uniformly distributed on . The event corresponds to the rank of being among the top values; this has probability .
Why It Matters
The "" correction is what makes Monte Carlo permutation tests exact at every , not just asymptotically. Without the correction the -value is anticonservative. Modern statistical software uses the formulation by default for permutation -values.
Failure Mode
The validity argument assumes the permutations are drawn uniformly without replacement from the set of all permutations, or with replacement; both work because the exchangeability argument is the same. Drawing permutations with replacement is the standard implementation. The Monte Carlo -value is on the discrete grid ; very small -values require correspondingly large . For , take at least .
Permutation Tests Versus the Bootstrap
The two procedures are often confused. They are related but solve different problems.
| Aspect | Permutation test | Bootstrap |
|---|---|---|
| Goal | Test a null hypothesis | Estimate a sampling distribution |
| Resampling unit | Labels (or sign flips, ranks) | Observations |
| Resampling method | Without replacement (relabeling) | With replacement |
| Preserved quantity | Null distribution | Empirical distribution |
| Output | -value | Standard error or confidence interval |
| Exact at finite ? | Yes, under exchangeability | No (consistent only) |
| Assumptions | Exchangeability under null | Bootstrap consistency for the statistic |
Use the bootstrap when you want to estimate variability or build confidence intervals without a parametric model. Use a permutation test when you want to test a null hypothesis with exact size and minimal assumptions. The two are complementary, not competitors.
A common pattern: use the bootstrap for the confidence interval and the permutation test for the -value. The two can yield slightly different conclusions (e.g., a bootstrap CI that excludes zero and a permutation -value above 0.05) when the test statistic is studentized differently or the bootstrap targets a different parameter.
Choosing a Test Statistic
Any function of the data works. The choice affects power, not size. Some canonical choices:
| Setting | Test statistic | Comments |
|---|---|---|
| Two-sample mean comparison | Sensitive to heavy tails | |
| Two-sample mean comparison (studentized) | -statistic with pooled | Studentized; better behavior under heavy tails |
| Two-sample distributional comparison | Two-sample KS statistic | Sensitive to any difference |
| Two-sample rank comparison | Wilcoxon rank-sum | Distribution-free under null, even without permutation |
| Independence in table | Pearson Chi-squared | Permutation version replaces asymptotic Chi-squared with exact randomization |
| Independence of two variables | Sample correlation | Or distance correlation for nonlinear dependence |
| Two-sample kernel comparison | Maximum Mean Discrepancy | Sensitive to high-dimensional differences |
Studentized statistics (the t-statistic instead of the raw mean difference) typically have more reliable size when the null does not imply full distributional equality but only equal means. The permutation test on the raw mean difference is exact only when distributions are fully equal under the null; the permutation test on the studentized statistic is asymptotically valid even when only the means agree.
Sign-Flip and Rotation Tests
Permutation tests apply more broadly than label swapping.
- Sign-flip permutation. For paired data with hypothesized zero treatment effect, randomly flipping the sign of each within-pair difference preserves the null distribution. The Wilcoxon signed-rank test is a sign-flip permutation test on the rank-transformed differences.
- Rotation tests. For data with rotational symmetry under the null (e.g., directional data, or some functional-data settings), random rotations of the residuals give the null reference distribution.
- Stratified permutation. When the data have an external structure (e.g., blocks, time, location) that should be preserved, permutation is restricted to within-stratum relabelings. This is the basis of randomization tests in randomized block designs.
The unifying theme is that the group of transformations under which the null distribution is invariant determines the right resampling scheme. Exchangeability gives the symmetric group of all permutations; sign symmetry gives the sign-flip group; rotational invariance gives the rotation group.
Common Confusions
Permutation does not equal bootstrap
Permutation and bootstrap are different procedures with different purposes. Permutation resamples labels under a null; bootstrap resamples observations to estimate variability. Using "bootstrap" loosely to mean "resampling-based" obscures the distinction; the size guarantees are different.
Permutation tests are not assumption-free
The exchangeability assumption under the null is real and can fail. Two-sample mean tests on data with unequal variances violate exchangeability under the equal-means null (because exchangeable data must have equal variances too). Use studentized statistics or test the broader null of equal distributions, not just equal means.
Choice of statistic affects power, not validity
The level of a permutation test does not depend on the test statistic. The power does. A poor choice of statistic gives a valid but powerless test; a good choice gives a valid and powerful test. This is the opposite of asymptotic tests, where the choice of statistic affects both the level (through the asymptotic distribution) and the power.
Discrete p-values from finite B
Monte Carlo permutation tests produce -values on the discrete grid . To detect a true -value of reliably, should be at least a few thousand. For , use to ensure the test can resolve the rejection region.
Exercises
Problem
A two-sample test compares values against values . Use the difference of sample means as the test statistic, enumerate all permutations, and compute the two-sided exact permutation -value.
Problem
A paired study reports differences for pairs. Use a sign-flip permutation test with the sum-of-differences statistic to test exactly.
Problem
Show by simulation argument that when the two-sample test uses the studentized t-statistic instead of the raw difference of means, the permutation test is asymptotically valid even under unequal variances, while the permutation test using the raw difference is not.
Problem
Suppose you want a permutation test for the null of zero correlation between two variables and measured on paired observations. Construct the test, identify the exact permutation scheme, and explain why pairing-permutation (shuffling one variable's order while keeping the other fixed) is the right scheme.
References
Canonical:
- Lehmann and Romano, Testing Statistical Hypotheses (2005), Chapter 15 (permutation and randomization tests).
- Casella and Berger, Statistical Inference (2002), Chapter 8 (Section 8.4 on randomization tests).
- Good, Permutation, Parametric, and Bootstrap Tests of Hypotheses (2005), Chapter 1 and 2.
Foundational papers:
- Fisher, Design of Experiments (1935), introduced randomization tests in the context of agricultural experiments.
- Pitman, "Significance tests which may be applied to samples from any populations" (JRSS Supplement, 1937), the systematic treatment.
Studentized permutation and modern theory:
- Janssen and Pauls, "How do bootstrap and permutation tests work?" (Annals of Statistics, 2003), studentized permutation under unequal variances.
- Romano and Wolf, Multiple Hypothesis Testing (Cambridge, 2010), permutation tests in multiple-testing settings.
Last reviewed: May 11, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Bootstrap Methodslayer 2 · tier 1
- Hypothesis Testing for MLlayer 2 · tier 2
- Neyman-Pearson and Hypothesis Testing Theorylayer 2 · tier 2
Derived topics
0No published topic currently declares this as a prerequisite.