Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Foundations

Kernel Two-Sample Tests

Maximum Mean Discrepancy (MMD): a kernel-based nonparametric test for whether two samples come from the same distribution, with unbiased estimation, permutation testing, and applications to GAN evaluation.

AdvancedTier 2Stable~50 min

Prerequisites

Why This Matters

The two-sample problem is: given samples from two distributions PP and QQ, test whether P=QP = Q. Classical hypothesis tests (t-test, KS test) work in low dimensions but fail in high dimensions. The Maximum Mean Discrepancy (MMD) is a kernel-based test statistic that is nonparametric, works in arbitrary dimensions, and has strong theoretical guarantees.

MMD is used in practice for GAN evaluation, domain adaptation detection, dataset comparison, and anywhere you need to test distributional equality without parametric assumptions.

Mental Model

Embed both distributions into a reproducing kernel Hilbert space (RKHS). In this space, each distribution becomes a single point (its mean embedding). The MMD is the distance between these two points. If the kernel is characteristic (e.g., Gaussian), the embedding is injective: distinct distributions map to distinct points, so MMD is zero if and only if P=QP = Q.

Formal Setup

Let PP and QQ be probability distributions on a measurable space X\mathcal{X}, and let H\mathcal{H} be an RKHS with kernel kk.

Definition

Mean Embedding

The mean embedding (or kernel mean map) of a distribution PP into an RKHS H\mathcal{H} is:

μP=ExP[k(x,)]H\mu_P = \mathbb{E}_{x \sim P}[k(x, \cdot)] \in \mathcal{H}

By the reproducing property, for any fHf \in \mathcal{H}: f,μPH=ExP[f(x)]\langle f, \mu_P \rangle_{\mathcal{H}} = \mathbb{E}_{x \sim P}[f(x)].

Definition

Maximum Mean Discrepancy (MMD)

The maximum mean discrepancy between PP and QQ with respect to an RKHS H\mathcal{H} is:

MMD(H,P,Q)=supfH1ExP[f(x)]EyQ[f(y)]\text{MMD}(\mathcal{H}, P, Q) = \sup_{\|f\|_{\mathcal{H}} \leq 1} \left|\mathbb{E}_{x \sim P}[f(x)] - \mathbb{E}_{y \sim Q}[f(y)]\right|

Equivalently, MMD(H,P,Q)=μPμQH\text{MMD}(\mathcal{H}, P, Q) = \|\mu_P - \mu_Q\|_{\mathcal{H}}.

Main Theorems

Theorem

MMD as RKHS Distance Between Mean Embeddings

Statement

The squared MMD can be expressed as:

MMD2(H,P,Q)=Ex,xP[k(x,x)]2ExP,yQ[k(x,y)]+Ey,yQ[k(y,y)]\text{MMD}^2(\mathcal{H}, P, Q) = \mathbb{E}_{x,x' \sim P}[k(x,x')] - 2\mathbb{E}_{x \sim P, y \sim Q}[k(x,y)] + \mathbb{E}_{y,y' \sim Q}[k(y,y')]

If kk is a characteristic kernel (e.g., Gaussian RBF, Laplacian), then MMD(H,P,Q)=0\text{MMD}(\mathcal{H}, P, Q) = 0 if and only if P=QP = Q.

Intuition

The three terms compare within-PP similarity, cross similarity between PP and QQ, and within-QQ similarity. If P=QP = Q, all three expectations are equal and the expression is zero. A characteristic kernel ensures that different distributions always produce different mean embeddings, so MMD is a true metric on distributions.

Proof Sketch

Expand μPμQH2=μPμQ,μPμQ\|\mu_P - \mu_Q\|_{\mathcal{H}}^2 = \langle \mu_P - \mu_Q, \mu_P - \mu_Q \rangle using bilinearity. Each inner product μP,μQ=ExP,yQ[k(x,y)]\langle \mu_P, \mu_Q \rangle = \mathbb{E}_{x \sim P, y \sim Q}[k(x,y)] by the reproducing property applied twice. The characterization that MMD=0    P=Q\text{MMD} = 0 \iff P = Q for characteristic kernels follows from the injectivity of the mean embedding map.

Why It Matters

This theorem gives a computable expression for the distance between distributions that requires only kernel evaluations on samples. No density estimation is needed. This makes MMD practical in high dimensions where density estimation is intractable.

Failure Mode

If the kernel is not characteristic, MMD=0\text{MMD} = 0 does not imply P=QP = Q. For example, a linear kernel k(x,y)=xyk(x,y) = x^\top y only compares means: MMD2=E[X]E[Y]2\text{MMD}^2 = \|\mathbb{E}[X] - \mathbb{E}[Y]\|^2. It cannot detect differences in variance or higher moments.

Theorem

Unbiased U-Statistic Estimator of MMD Squared

Statement

Given mm samples from PP and nn samples from QQ, the following U-statistic is an unbiased estimator of MMD2\text{MMD}^2:

MMD^u2=1m(m1)ijk(xi,xj)2mni,jk(xi,yj)+1n(n1)ijk(yi,yj)\widehat{\text{MMD}}^2_u = \frac{1}{m(m-1)} \sum_{i \neq j} k(x_i, x_j) - \frac{2}{mn} \sum_{i,j} k(x_i, y_j) + \frac{1}{n(n-1)} \sum_{i \neq j} k(y_i, y_j)

Under H0:P=QH_0: P = Q, the statistic mMMD^u2m \cdot \widehat{\text{MMD}}^2_u converges in distribution to a weighted sum of chi-squared random variables. Under H1:PQH_1: P \neq Q, the estimator converges to MMD2(P,Q)>0\text{MMD}^2(P, Q) > 0.

Intuition

The estimator replaces population expectations with sample averages. Using iji \neq j (excluding diagonal terms) removes the bias that would come from k(xi,xi)k(x_i, x_i) terms, which inflate the within-distribution similarity.

Proof Sketch

The unbiasedness follows from the U-statistic property: each term is a sample average of a symmetric kernel function, and the expectation of a U-statistic equals the corresponding population quantity. The asymptotic distribution under H0H_0 follows from the theory of degenerate U-statistics (the kernel of the U-statistic has expectation zero under H0H_0, making it a degenerate second-order U-statistic).

Why It Matters

A biased estimator could be positive even when P=QP = Q, inflating the false positive rate. The unbiased estimator centers correctly at zero under the null, making it suitable for hypothesis testing.

Failure Mode

The U-statistic estimator has O(m2+n2)O(m^2 + n^2) computation cost because it requires all pairwise kernel evaluations. For large datasets, this is expensive. Linear-time estimators exist (using random pairings) but have higher variance.

Permutation Test for P-Values

The asymptotic null distribution (weighted chi-squared) is hard to compute in practice. The standard approach is a permutation test:

  1. Compute MMD^u2\widehat{\text{MMD}}^2_u on the original data.
  2. Pool the samples {x1,,xm,y1,,yn}\{x_1, \ldots, x_m, y_1, \ldots, y_n\}.
  3. Randomly split the pool into two groups of sizes mm and nn.
  4. Compute MMD^u2\widehat{\text{MMD}}^2_u on the permuted split.
  5. Repeat steps 3-4 many times (e.g., B=1000B = 1000).
  6. The p-value is the fraction of permuted statistics that exceed the observed statistic.

This is exact under H0H_0 (the permutation distribution is the correct null distribution when P=QP = Q) and requires no distributional assumptions.

Proposition

MMD Test Has Power Against All Alternatives

Statement

The MMD permutation test with a characteristic kernel is consistent: as m,nm, n \to \infty, the test rejects H0H_0 with probability approaching 1 whenever PQP \neq Q. The test power depends on MMD2(P,Q)/Var(h^)\text{MMD}^2(P,Q) / \sqrt{\text{Var}(\hat{h})} where h^\hat{h} is the kernel of the U-statistic. For a Gaussian RBF kernel with bandwidth σ\sigma, the power depends on the choice of σ\sigma: too small and the test only detects local differences; too large and the test only detects mean differences.

Intuition

Consistency means the test eventually detects any departure from P=QP = Q given enough data. This is stronger than parametric tests, which can only detect departures in specific directions (e.g., mean shift for a t-test). The bandwidth σ\sigma controls the "resolution" of the test.

Why It Matters

A test that can detect any difference between distributions is valuable precisely because you typically do not know in advance how PP and QQ differ. The trade-off is that MMD may require more samples than a specialized test that targets a known difference type.

Failure Mode

In practice, bandwidth selection strongly affects power. The median heuristic (σ\sigma = median pairwise distance) is common but not optimal. Gretton et al. (2012) propose maximizing an estimate of test power over σ\sigma.

Connection to GAN Evaluation

The Frechet Inception Distance (FID), widely used for GAN evaluation, is related to MMD. FID assumes the Inception feature distributions are Gaussian and computes a parametric distance. MMD with a Gaussian kernel on Inception features provides a nonparametric alternative that does not assume Gaussianity.

The kernel Inception distance (KID) is exactly the unbiased MMD estimator applied to Inception features with a polynomial kernel. KID has the advantage of being unbiased (FID is biased for finite samples) and having simpler statistical properties.

Common Confusions

Watch Out

MMD is not a distance between samples, but between distributions

The MMD is a population quantity defined in terms of PP and QQ. The estimator MMD^u2\widehat{\text{MMD}}^2_u is a statistic computed from samples. A large estimated MMD could reflect a true distributional difference or just sampling noise. The permutation test distinguishes between these cases.

Watch Out

The kernel choice matters for what differences MMD detects

A linear kernel only detects mean differences. A Gaussian kernel with small bandwidth detects local density differences but may miss global distributional shifts. There is no universally best kernel; the choice should reflect what types of distributional differences you care about.

Watch Out

MMD is not the same as KL divergence

KL divergence requires density estimation and is asymmetric. MMD requires only kernel evaluations and is symmetric. KL divergence can be infinite when the supports do not overlap; MMD is always finite for bounded kernels. They measure different things.

Summary

  • MMD = distance between mean embeddings of PP and QQ in RKHS
  • Squared MMD involves three terms: within-PP, cross, and within-QQ kernel expectations
  • Characteristic kernel (Gaussian, Laplacian): MMD=0    P=Q\text{MMD} = 0 \iff P = Q
  • Unbiased U-statistic estimator has O((m+n)2)O((m+n)^2) cost
  • Permutation test gives exact p-values without distributional assumptions
  • MMD is consistent: detects any departure from P=QP = Q with enough data
  • KID (kernel Inception distance) = MMD applied to Inception features

Exercises

ExerciseCore

Problem

You have 200 samples from PP and 200 samples from QQ. You run a permutation test with B=999B = 999 permutations and observe that 12 of the permuted MMD values exceed your observed MMD. What is the p-value? Do you reject H0:P=QH_0: P = Q at significance level α=0.05\alpha = 0.05?

ExerciseAdvanced

Problem

Show that the MMD with a linear kernel k(x,y)=xyk(x,y) = x^\top y reduces to the distance between sample means. Specifically, show that MMD2=EP[X]EQ[Y]2\text{MMD}^2 = \|\mathbb{E}_P[X] - \mathbb{E}_Q[Y]\|^2. Why does this make the linear kernel unsuitable as a general two-sample test?

References

Canonical:

  • Gretton et al., "A Kernel Two-Sample Test" (JMLR, 2012)
  • Gretton et al., "A Kernel Method for the Two-Sample Problem" (NeurIPS, 2006)

Current:

  • Sutherland et al., "Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy" (2017)

  • Binkowski et al., "Demystifying MMD GANs" (2018). introduces KID

  • Casella & Berger, Statistical Inference (2002), Chapters 5-10

  • Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6

Next Topics

The natural next steps from kernel two-sample tests:

  • Gaussian processes for ML: kernel methods applied to Bayesian prediction

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.