Skip to main content

Statistical Estimation

Permutation Tests

Exchangeability-based hypothesis testing: under the null of no group effect, the labels are exchangeable, so the distribution of any test statistic under random relabeling gives an exact null reference. Exact for small samples, approximated by Monte Carlo for large samples, robust under non-Normality and heavy tails.

CoreTier 1StableSupporting~45 min

Why This Matters

Permutation tests answer a question that classical asymptotic tests evade: how do you test a hypothesis when you do not want to assume a parametric model, the central limit theorem has not had time to converge, or the test statistic of interest has no closed-form null distribution? The answer is to build the null reference distribution from the data itself, by reshuffling the labels under a constraint (exchangeability) that is true under the null.

The result is a test that is exact in size (not asymptotic, not approximate; the Type I error rate is the nominal α\alpha at every nn), distribution-free under exchangeability, and applicable to any test statistic the researcher chooses. The cost is computational: an exact permutation test enumerates n!n! relabelings (or (nX+nYnX)\binom{n_X + n_Y}{n_X} in the two-sample case), which is feasible for small nn and infeasible for large. The Monte Carlo permutation test samples a random subset of relabelings, trading exactness for tractability, and recovers an arbitrarily-close approximation by drawing more samples.

Permutation tests are the basis of most modern nonparametric inference. They power the standard t-test alternative under non-Normality, the standard two-sample test alternative to Welch, and most modern kernel-based, distance-based, and feature-importance tests. They are sometimes confused with the bootstrap; the two are related but solve different problems. The distinction is in the constraint: permutation tests preserve the null distribution; the bootstrap preserves the empirical distribution.

Exchangeability

A finite collection of random variables (X1,,Xn)(X_1,\dots,X_n) is exchangeable if its joint distribution is invariant under permutation of indices:

(X1,,Xn)=d(Xπ(1),,Xπ(n))for every permutation π.(X_1,\dots,X_n)\stackrel{d}{=}(X_{\pi(1)},\dots,X_{\pi(n)})\qquad\text{for every permutation }\pi.

I.i.d. samples are exchangeable. Exchangeable random variables can be dependent (e.g., draws from a Polya urn), so the class is strictly larger than i.i.d. The use in permutation testing is one-directional: under the null hypothesis of no group effect, the labels we attach to the observations are exchangeable, even if the observations themselves are dependent in some structured way.

For a two-sample problem with observed values (X1,,XnX)(X_1,\dots,X_{n_X}) and (Y1,,YnY)(Y_1,\dots,Y_{n_Y}), the null hypothesis H0:H_0: "the XX and YY samples come from the same distribution" implies the pooled sample is exchangeable. Permutation testing exploits this directly.

Permutation Test for Two Samples

Suppose we observe two samples with sizes nXn_X and nYn_Y and want to test H0:FX=FYH_0: F_X = F_Y, where FXF_X and FYF_Y are the underlying distributions. Let T(,)T(\cdot,\cdot) be any two-sample test statistic (difference in sample means, Mann-Whitney rank-sum, KS statistic, kernel maximum mean discrepancy, etc.).

The permutation-test procedure is:

  1. Compute the observed value T=T(X,Y)T^* = T(\mathbf{X}, \mathbf{Y}).
  2. Pool the N=nX+nYN = n_X + n_Y observations.
  3. For every way to split the pooled sample into groups of sizes nXn_X and nYn_Y, compute TT for that split. There are (NnX)\binom{N}{n_X} splits.
  4. The permutation pp-value for a two-sided test is the proportion of splits with TT|T| \ge |T^*|.

The randomness in the test is purely a property of the labels under the null. The data values are held fixed; the labels are reshuffled. This is the key difference from the bootstrap, which resamples observations with replacement and changes the data values.

Theorem

Permutation Test Exactness

Statement

For any test statistic TT and any level α(0,1)\alpha\in(0,1), the permutation test that rejects when the observed TT exceeds the 1α1-\alpha quantile of the permutation distribution has Type I error rate at most α\alpha. If the test is randomized at the boundary (toss a fair coin when TT equals the 1α1-\alpha quantile), the size is exactly α\alpha.

Intuition

Under the null, the joint distribution of the labels and the data is invariant under relabeling. The permutation distribution of TT under random relabeling is therefore the conditional distribution of TT given the data values, marginally over the labels. Rejecting when TT is in the upper-α\alpha tail of this conditional distribution has exact level α\alpha conditional on the data, hence unconditionally.

Proof Sketch

By exchangeability under the null, the data (D1,,DN)=(labels,values)(D_1,\dots,D_N) = (\text{labels},\text{values}) is jointly distributed identically to its image under any relabeling. The conditional distribution of TT given the unordered multiset of values is therefore uniform over the labelings. A rejection rule based on the upper-α\alpha quantile of this uniform distribution has size α\alpha conditional on the values, and by the tower property, size α\alpha marginally. Tie-handling at the boundary uses a randomized rule to handle the discrete grid of permutation pp-values.

Why It Matters

Exact size at every nn, not asymptotic. The test is distribution-free under exchangeability: no parametric assumptions, no central limit theorem invocation, no degree-of-freedom adjustment. The choice of test statistic TT is unconstrained; any function of the data and labels works. Powerful TT statistics give powerful tests, but the choice does not affect the test's level.

Failure Mode

Exchangeability is the load-bearing assumption. If the null hypothesis does not imply exchangeability, the test is not valid. Two-sample tests of "do the samples have the same mean" do not yield exchangeability when the distributions differ in variance but share a mean; for those, the permutation test on the difference of means has the wrong size. Workarounds include studentizing the test statistic (use the t-statistic instead of the raw difference) or testing distributional equality rather than just the mean.

Monte Carlo Approximation

Enumerating all (NnX)\binom{N}{n_X} splits is feasible only for small samples. For nX=nY=20n_X = n_Y = 20, there are about 1.4×10111.4\times 10^{11} splits, impractical. The Monte Carlo permutation test draws BB random permutations and uses the empirical proportion as the pp-value:

p^B=1+#{b:TbT}1+B,\hat p_B = \frac{1 + \#\{b: |T_b|\ge |T^*|\}}{1 + B},

where the "+1+1" in numerator and denominator ensures the pp-value is conservative (does not undershoot when TT^* has positive probability under the permutation distribution).

Theorem

Monte Carlo Permutation Validity

Statement

The Monte Carlo permutation test with BB random permutations and the (B+1)(B+1)-corrected pp-value satisfies PH0(p^Bα)α\mathbb{P}_{H_0}(\hat p_B \le \alpha)\le\alpha for every α(0,1)\alpha\in(0,1). The size of the test is bounded above by α\alpha, with equality in the limit BB\to\infty.

Intuition

Under the null, the observed statistic TT^* and the BB permutation-resampled statistics T1,,TBT_1,\dots,T_B are exchangeable (they are B+1B+1 independent draws from the same permutation distribution conditional on the data, plus the "extra" original ordering). The rank of TT^* among the B+1B+1 values is uniform on {1,2,,B+1}\{1,2,\dots,B+1\}, so the pp-value p^B\hat p_B is uniform on the grid {1/(B+1),2/(B+1),}\{1/(B+1), 2/(B+1),\dots\}. Rejecting when p^Bα\hat p_B\le\alpha gives size at most α\alpha exactly.

Proof Sketch

Under H0H_0, the original-ordering statistic TT^* has the same conditional distribution as each TbT_b given the data. The vector (T,T1,,TB)(T^*, T_1,\dots,T_B) is exchangeable, so the rank of TT^* is uniformly distributed on {1,,B+1}\{1,\dots,B+1\}. The event {p^Bα}\{\hat p_B\le\alpha\} corresponds to the rank of T|T^*| being among the top α(B+1)\lfloor\alpha(B+1)\rfloor values; this has probability α(B+1)/(B+1)α\lfloor\alpha(B+1)\rfloor/(B+1)\le\alpha.

Why It Matters

The "+1+1" correction is what makes Monte Carlo permutation tests exact at every BB, not just asymptotically. Without the correction the pp-value is anticonservative. Modern statistical software uses the +1+1 formulation by default for permutation pp-values.

Failure Mode

The validity argument assumes the BB permutations are drawn uniformly without replacement from the set of all permutations, or with replacement; both work because the exchangeability argument is the same. Drawing BB permutations with replacement is the standard implementation. The Monte Carlo pp-value is on the discrete grid {0,1/(B+1),,1}\{0, 1/(B+1),\dots, 1\}; very small pp-values require correspondingly large BB. For α=0.001\alpha = 0.001, take at least B=10,000B = 10{,}000.

Permutation Tests Versus the Bootstrap

The two procedures are often confused. They are related but solve different problems.

AspectPermutation testBootstrap
GoalTest a null hypothesisEstimate a sampling distribution
Resampling unitLabels (or sign flips, ranks)Observations
Resampling methodWithout replacement (relabeling)With replacement
Preserved quantityNull distributionEmpirical distribution
Outputpp-valueStandard error or confidence interval
Exact at finite nn?Yes, under exchangeabilityNo (consistent only)
AssumptionsExchangeability under nullBootstrap consistency for the statistic

Use the bootstrap when you want to estimate variability or build confidence intervals without a parametric model. Use a permutation test when you want to test a null hypothesis with exact size and minimal assumptions. The two are complementary, not competitors.

A common pattern: use the bootstrap for the confidence interval and the permutation test for the pp-value. The two can yield slightly different conclusions (e.g., a bootstrap CI that excludes zero and a permutation pp-value above 0.05) when the test statistic is studentized differently or the bootstrap targets a different parameter.

Choosing a Test Statistic

Any function of the data works. The choice affects power, not size. Some canonical choices:

SettingTest statisticComments
Two-sample mean comparisonXˉYˉ\bar X - \bar YSensitive to heavy tails
Two-sample mean comparison (studentized)tt-statistic with pooled SpS_pStudentized; better behavior under heavy tails
Two-sample distributional comparisonTwo-sample KS statisticSensitive to any difference
Two-sample rank comparisonWilcoxon rank-sumDistribution-free under null, even without permutation
Independence in r×cr\times c tablePearson Chi-squaredPermutation version replaces asymptotic Chi-squared with exact randomization
Independence of two variablesSample correlationOr distance correlation for nonlinear dependence
Two-sample kernel comparisonMaximum Mean DiscrepancySensitive to high-dimensional differences

Studentized statistics (the t-statistic instead of the raw mean difference) typically have more reliable size when the null does not imply full distributional equality but only equal means. The permutation test on the raw mean difference is exact only when distributions are fully equal under the null; the permutation test on the studentized statistic is asymptotically valid even when only the means agree.

Sign-Flip and Rotation Tests

Permutation tests apply more broadly than label swapping.

  • Sign-flip permutation. For paired data with hypothesized zero treatment effect, randomly flipping the sign of each within-pair difference DiD_i preserves the null distribution. The Wilcoxon signed-rank test is a sign-flip permutation test on the rank-transformed differences.
  • Rotation tests. For data with rotational symmetry under the null (e.g., directional data, or some functional-data settings), random rotations of the residuals give the null reference distribution.
  • Stratified permutation. When the data have an external structure (e.g., blocks, time, location) that should be preserved, permutation is restricted to within-stratum relabelings. This is the basis of randomization tests in randomized block designs.

The unifying theme is that the group of transformations under which the null distribution is invariant determines the right resampling scheme. Exchangeability gives the symmetric group of all permutations; sign symmetry gives the sign-flip group; rotational invariance gives the rotation group.

Common Confusions

Watch Out

Permutation does not equal bootstrap

Permutation and bootstrap are different procedures with different purposes. Permutation resamples labels under a null; bootstrap resamples observations to estimate variability. Using "bootstrap" loosely to mean "resampling-based" obscures the distinction; the size guarantees are different.

Watch Out

Permutation tests are not assumption-free

The exchangeability assumption under the null is real and can fail. Two-sample mean tests on data with unequal variances violate exchangeability under the equal-means null (because exchangeable data must have equal variances too). Use studentized statistics or test the broader null of equal distributions, not just equal means.

Watch Out

Choice of statistic affects power, not validity

The level of a permutation test does not depend on the test statistic. The power does. A poor choice of statistic gives a valid but powerless test; a good choice gives a valid and powerful test. This is the opposite of asymptotic tests, where the choice of statistic affects both the level (through the asymptotic distribution) and the power.

Watch Out

Discrete p-values from finite B

Monte Carlo permutation tests produce pp-values on the discrete grid {0,1/(B+1),,1}\{0, 1/(B+1),\dots, 1\}. To detect a true pp-value of 0.010.01 reliably, BB should be at least a few thousand. For α=0.001\alpha = 0.001, use B104B \ge 10^4 to ensure the test can resolve the rejection region.

Exercises

ExerciseCore

Problem

A two-sample test compares nX=5n_X = 5 values (3,5,7,8,11)(3, 5, 7, 8, 11) against nY=4n_Y = 4 values (2,4,6,9)(2, 4, 6, 9). Use the difference of sample means as the test statistic, enumerate all (95)=126\binom{9}{5} = 126 permutations, and compute the two-sided exact permutation pp-value.

ExerciseCore

Problem

A paired study reports differences D=(3,1,5,2,2,4,6)D = (3, -1, 5, 2, -2, 4, 6) for n=7n = 7 pairs. Use a sign-flip permutation test with the sum-of-differences statistic to test H0:E[D]=0H_0:\mathbb{E}[D] = 0 exactly.

ExerciseAdvanced

Problem

Show by simulation argument that when the two-sample test uses the studentized t-statistic instead of the raw difference of means, the permutation test is asymptotically valid even under unequal variances, while the permutation test using the raw difference is not.

ExerciseAdvanced

Problem

Suppose you want a permutation test for the null of zero correlation between two variables XX and YY measured on nn paired observations. Construct the test, identify the exact permutation scheme, and explain why pairing-permutation (shuffling one variable's order while keeping the other fixed) is the right scheme.

References

Canonical:

  • Lehmann and Romano, Testing Statistical Hypotheses (2005), Chapter 15 (permutation and randomization tests).
  • Casella and Berger, Statistical Inference (2002), Chapter 8 (Section 8.4 on randomization tests).
  • Good, Permutation, Parametric, and Bootstrap Tests of Hypotheses (2005), Chapter 1 and 2.

Foundational papers:

  • Fisher, Design of Experiments (1935), introduced randomization tests in the context of agricultural experiments.
  • Pitman, "Significance tests which may be applied to samples from any populations" (JRSS Supplement, 1937), the systematic treatment.

Studentized permutation and modern theory:

  • Janssen and Pauls, "How do bootstrap and permutation tests work?" (Annals of Statistics, 2003), studentized permutation under unequal variances.
  • Romano and Wolf, Multiple Hypothesis Testing (Cambridge, 2010), permutation tests in multiple-testing settings.

Last reviewed: May 11, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

0

No published topic currently declares this as a prerequisite.