Statistical Foundations
Kernel Two-Sample Tests
Maximum Mean Discrepancy (MMD): a kernel-based nonparametric test for whether two samples come from the same distribution, with unbiased estimation, permutation testing, and applications to GAN evaluation.
Prerequisites
Why This Matters
The two-sample problem is: given samples from two distributions and , test whether . Classical hypothesis tests (t-test, KS test) work in low dimensions but fail in high dimensions. The Maximum Mean Discrepancy (MMD) is a kernel-based test statistic that is nonparametric, works in arbitrary dimensions, and has strong theoretical guarantees.
MMD is used in practice for GAN evaluation, domain adaptation detection, dataset comparison, and anywhere you need to test distributional equality without parametric assumptions.
Mental Model
Embed both distributions into a reproducing kernel Hilbert space (RKHS). In this space, each distribution becomes a single point (its mean embedding). The MMD is the distance between these two points. If the kernel is characteristic (e.g., Gaussian), the embedding is injective: distinct distributions map to distinct points, so MMD is zero if and only if .
Formal Setup
Let and be probability distributions on a measurable space , and let be an RKHS with kernel .
Mean Embedding
The mean embedding (or kernel mean map) of a distribution into an RKHS is:
By the reproducing property, for any : .
Maximum Mean Discrepancy (MMD)
The maximum mean discrepancy between and with respect to an RKHS is:
Equivalently, .
Main Theorems
MMD as RKHS Distance Between Mean Embeddings
Statement
The squared MMD can be expressed as:
If is a characteristic kernel (e.g., Gaussian RBF, Laplacian), then if and only if .
Intuition
The three terms compare within- similarity, cross similarity between and , and within- similarity. If , all three expectations are equal and the expression is zero. A characteristic kernel ensures that different distributions always produce different mean embeddings, so MMD is a true metric on distributions.
Proof Sketch
Expand using bilinearity. Each inner product by the reproducing property applied twice. The characterization that for characteristic kernels follows from the injectivity of the mean embedding map.
Why It Matters
This theorem gives a computable expression for the distance between distributions that requires only kernel evaluations on samples. No density estimation is needed. This makes MMD practical in high dimensions where density estimation is intractable.
Failure Mode
If the kernel is not characteristic, does not imply . For example, a linear kernel only compares means: . It cannot detect differences in variance or higher moments.
Unbiased U-Statistic Estimator of MMD Squared
Statement
Given samples from and samples from , the following U-statistic is an unbiased estimator of :
Under , the statistic converges in distribution to a weighted sum of chi-squared random variables. Under , the estimator converges to .
Intuition
The estimator replaces population expectations with sample averages. Using (excluding diagonal terms) removes the bias that would come from terms, which inflate the within-distribution similarity.
Proof Sketch
The unbiasedness follows from the U-statistic property: each term is a sample average of a symmetric kernel function, and the expectation of a U-statistic equals the corresponding population quantity. The asymptotic distribution under follows from the theory of degenerate U-statistics (the kernel of the U-statistic has expectation zero under , making it a degenerate second-order U-statistic).
Why It Matters
A biased estimator could be positive even when , inflating the false positive rate. The unbiased estimator centers correctly at zero under the null, making it suitable for hypothesis testing.
Failure Mode
The U-statistic estimator has computation cost because it requires all pairwise kernel evaluations. For large datasets, this is expensive. Linear-time estimators exist (using random pairings) but have higher variance.
Permutation Test for P-Values
The asymptotic null distribution (weighted chi-squared) is hard to compute in practice. The standard approach is a permutation test:
- Compute on the original data.
- Pool the samples .
- Randomly split the pool into two groups of sizes and .
- Compute on the permuted split.
- Repeat steps 3-4 many times (e.g., ).
- The p-value is the fraction of permuted statistics that exceed the observed statistic.
This is exact under (the permutation distribution is the correct null distribution when ) and requires no distributional assumptions.
MMD Test Has Power Against All Alternatives
Statement
The MMD permutation test with a characteristic kernel is consistent: as , the test rejects with probability approaching 1 whenever . The test power depends on where is the kernel of the U-statistic. For a Gaussian RBF kernel with bandwidth , the power depends on the choice of : too small and the test only detects local differences; too large and the test only detects mean differences.
Intuition
Consistency means the test eventually detects any departure from given enough data. This is stronger than parametric tests, which can only detect departures in specific directions (e.g., mean shift for a t-test). The bandwidth controls the "resolution" of the test.
Why It Matters
A test that can detect any difference between distributions is valuable precisely because you typically do not know in advance how and differ. The trade-off is that MMD may require more samples than a specialized test that targets a known difference type.
Failure Mode
In practice, bandwidth selection strongly affects power. The median heuristic ( = median pairwise distance) is common but not optimal. Gretton et al. (2012) propose maximizing an estimate of test power over .
Connection to GAN Evaluation
The Frechet Inception Distance (FID), widely used for GAN evaluation, is related to MMD. FID assumes the Inception feature distributions are Gaussian and computes a parametric distance. MMD with a Gaussian kernel on Inception features provides a nonparametric alternative that does not assume Gaussianity.
The kernel Inception distance (KID) is exactly the unbiased MMD estimator applied to Inception features with a polynomial kernel. KID has the advantage of being unbiased (FID is biased for finite samples) and having simpler statistical properties.
Common Confusions
MMD is not a distance between samples, but between distributions
The MMD is a population quantity defined in terms of and . The estimator is a statistic computed from samples. A large estimated MMD could reflect a true distributional difference or just sampling noise. The permutation test distinguishes between these cases.
The kernel choice matters for what differences MMD detects
A linear kernel only detects mean differences. A Gaussian kernel with small bandwidth detects local density differences but may miss global distributional shifts. There is no universally best kernel; the choice should reflect what types of distributional differences you care about.
MMD is not the same as KL divergence
KL divergence requires density estimation and is asymmetric. MMD requires only kernel evaluations and is symmetric. KL divergence can be infinite when the supports do not overlap; MMD is always finite for bounded kernels. They measure different things.
Summary
- MMD = distance between mean embeddings of and in RKHS
- Squared MMD involves three terms: within-, cross, and within- kernel expectations
- Characteristic kernel (Gaussian, Laplacian):
- Unbiased U-statistic estimator has cost
- Permutation test gives exact p-values without distributional assumptions
- MMD is consistent: detects any departure from with enough data
- KID (kernel Inception distance) = MMD applied to Inception features
Exercises
Problem
You have 200 samples from and 200 samples from . You run a permutation test with permutations and observe that 12 of the permuted MMD values exceed your observed MMD. What is the p-value? Do you reject at significance level ?
Problem
Show that the MMD with a linear kernel reduces to the distance between sample means. Specifically, show that . Why does this make the linear kernel unsuitable as a general two-sample test?
References
Canonical:
- Gretton et al., "A Kernel Two-Sample Test" (JMLR, 2012)
- Gretton et al., "A Kernel Method for the Two-Sample Problem" (NeurIPS, 2006)
Current:
-
Sutherland et al., "Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy" (2017)
-
Binkowski et al., "Demystifying MMD GANs" (2018). introduces KID
-
Casella & Berger, Statistical Inference (2002), Chapters 5-10
-
Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6
Next Topics
The natural next steps from kernel two-sample tests:
- Gaussian processes for ML: kernel methods applied to Bayesian prediction
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Kernels and Reproducing Kernel Hilbert SpacesLayer 3
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Rademacher ComplexityLayer 3
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- VC DimensionLayer 2