Kernel Two-Sample Tests

Sneiderman, Robby

Statistical Foundations

Kernel Two-Sample Tests

Maximum Mean Discrepancy (MMD): a kernel-based nonparametric test for whether two samples come from the same distribution, with unbiased estimation, permutation testing, and applications to GAN evaluation.

AdvancedTier 2StableSupporting~50 min

Prerequisites

Kernels and Rkhs

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

statistical-foundations | layer 3 | tier 2. This page has 1 direct prerequisite and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The two-sample problem is: given samples from two distributions $P$ and $Q$ , test whether $P = Q$ . Classical hypothesis tests (t-test, KS test) work in low dimensions but fail in high dimensions. The Maximum Mean Discrepancy (MMD) is a kernel-based test statistic that is nonparametric, works in arbitrary dimensions, and has strong theoretical guarantees.

MMD is used in practice for GAN evaluation, domain adaptation detection, dataset comparison, and anywhere you need to test distributional equality without parametric assumptions.

Mental Model

Embed both distributions into a reproducing kernel Hilbert space (RKHS). In this space, each distribution becomes a single point (its mean embedding). The MMD is the distance between these two points. If the kernel is characteristic (e.g., Gaussian), the embedding is injective: distinct distributions map to distinct points, so MMD is zero if and only if $P = Q$ .

Formal Setup

Let $P$ and $Q$ be probability distributions on a measurable space $\mathcal{X}$ , and let $\mathcal{H}$ be an RKHS with kernel $k$ .

Definition

Mean Embedding $μ_{P}$

The mean embedding (or kernel mean map) of a distribution $P$ into an RKHS $\mathcal{H}$ is:

$\mu_P = \mathbb{E}_{x \sim P}[k(x, \cdot)] \in \mathcal{H}$

By the reproducing property, for any $f \in \mathcal{H}$ : $\langle f, \mu_P \rangle_{\mathcal{H}} = \mathbb{E}_{x \sim P}[f(x)]$ .

Definition

Maximum Mean Discrepancy (MMD) $MMD (P, Q)$

The maximum mean discrepancy between $P$ and $Q$ with respect to an RKHS $\mathcal{H}$ is:

$\text{MMD}(\mathcal{H}, P, Q) = \sup_{\|f\|_{\mathcal{H}} \leq 1} \left|\mathbb{E}_{x \sim P}[f(x)] - \mathbb{E}_{y \sim Q}[f(y)]\right|$

Equivalently, $\text{MMD}(\mathcal{H}, P, Q) = \|\mu_P - \mu_Q\|_{\mathcal{H}}$ .

Main Theorems

Theorem

MMD as RKHS Distance Between Mean Embeddings

Statement

The squared MMD can be expressed as:

$\text{MMD}^2(\mathcal{H}, P, Q) = \mathbb{E}_{x,x' \sim P}[k(x,x')] - 2\mathbb{E}_{x \sim P, y \sim Q}[k(x,y)] + \mathbb{E}_{y,y' \sim Q}[k(y,y')]$

If $k$ is a characteristic kernel (e.g., Gaussian RBF, Laplacian), then $\text{MMD}(\mathcal{H}, P, Q) = 0$ if and only if $P = Q$ .

Intuition

The three terms compare within- $P$ similarity, cross similarity between $P$ and $Q$ , and within- $Q$ similarity. If $P = Q$ , all three expectations are equal and the expression is zero. A characteristic kernel ensures that different distributions always produce different mean embeddings, so MMD is a true metric on distributions.

Proof Sketch

Expand $\|\mu_P - \mu_Q\|_{\mathcal{H}}^2 = \langle \mu_P - \mu_Q, \mu_P - \mu_Q \rangle$ using bilinearity. Each inner product $\langle \mu_P, \mu_Q \rangle = \mathbb{E}_{x \sim P, y \sim Q}[k(x,y)]$ by the reproducing property applied twice. The characterization that $\text{MMD} = 0 \iff P = Q$ for characteristic kernels follows from the injectivity of the mean embedding map.

Why It Matters

This theorem gives a computable expression for the distance between distributions that requires only kernel evaluations on samples. No density estimation is needed. This makes MMD practical in high dimensions where density estimation is intractable.

Failure Mode

If the kernel is not characteristic, $\text{MMD} = 0$ does not imply $P = Q$ . For example, a linear kernel $k(x,y) = x^\top y$ only compares means: $\text{MMD}^2 = \|\mathbb{E}[X] - \mathbb{E}[Y]\|^2$ . It cannot detect differences in variance or higher moments.

report a correction →

Theorem

Unbiased U-Statistic Estimator of MMD Squared

Statement

Given $m$ samples from $P$ and $n$ samples from $Q$ , the following U-statistic is an unbiased estimator of $\text{MMD}^2$ :

$\widehat{\text{MMD}}^2_u = \frac{1}{m(m-1)} \sum_{i \neq j} k(x_i, x_j) - \frac{2}{mn} \sum_{i,j} k(x_i, y_j) + \frac{1}{n(n-1)} \sum_{i \neq j} k(y_i, y_j)$

Under $H_0: P = Q$ , the statistic $m \cdot \widehat{\text{MMD}}^2_u$ converges in distribution to a weighted sum of chi-squared random variables. Under $H_1: P \neq Q$ , the estimator converges to $\text{MMD}^2(P, Q) > 0$ .

Intuition

The estimator replaces population expectations with sample averages. Using $i \neq j$ (excluding diagonal terms) removes the bias that would come from $k(x_i, x_i)$ terms, which inflate the within-distribution similarity.

Proof Sketch

The unbiasedness follows from the U-statistic property: each term is a sample average of a symmetric kernel function, and the expectation of a U-statistic equals the corresponding population quantity. The asymptotic distribution under $H_0$ follows from the theory of degenerate U-statistics (the kernel of the U-statistic has expectation zero under $H_0$ , making it a degenerate second-order U-statistic).

Why It Matters

A biased estimator could be positive even when $P = Q$ , inflating the false positive rate. The unbiased estimator centers correctly at zero under the null, making it suitable for hypothesis testing.

Failure Mode

The U-statistic estimator has $O(m^2 + n^2)$ computation cost because it requires all pairwise kernel evaluations. For large datasets, this is expensive. Linear-time estimators exist (using random pairings) but have higher variance.

report a correction →

Permutation Test for P-Values

The asymptotic null distribution (weighted chi-squared) is hard to compute in practice. The standard approach is a permutation test:

Compute $\widehat{\text{MMD}}^2_u$ on the original data.
Pool the samples $\{x_1, \ldots, x_m, y_1, \ldots, y_n\}$ .
Randomly split the pool into two groups of sizes $m$ and $n$ .
Compute $\widehat{\text{MMD}}^2_u$ on the permuted split.
Repeat steps 3-4 many times (e.g., $B = 1000$ ).
The p-value is the fraction of permuted statistics that exceed the observed statistic.

This is exact under $H_0$ (the permutation distribution is the correct null distribution when $P = Q$ ) and requires no distributional assumptions.

Proposition

MMD Test Has Power Against All Alternatives

Statement

The MMD permutation test with a characteristic kernel is consistent: as $m, n \to \infty$ , the test rejects $H_0$ with probability approaching 1 whenever $P \neq Q$ . The test power depends on $\text{MMD}^2(P,Q) / \sqrt{\text{Var}(\hat{h})}$ where $\hat{h}$ is the kernel of the U-statistic. For a Gaussian RBF kernel with bandwidth $\sigma$ , the power depends on the choice of $\sigma$ : too small and the test only detects local differences; too large and the test only detects mean differences.

Intuition

Consistency means the test eventually detects any departure from $P = Q$ given enough data. This is stronger than parametric tests, which can only detect departures in specific directions (e.g., mean shift for a t-test). The bandwidth $\sigma$ controls the "resolution" of the test.

Why It Matters

A test that can detect any difference between distributions is valuable precisely because you typically do not know in advance how $P$ and $Q$ differ. The trade-off is that MMD may require more samples than a specialized test that targets a known difference type.

Failure Mode

In practice, bandwidth selection strongly affects power. The median heuristic ( $\sigma$ = median pairwise distance) is common but not optimal. Gretton et al. (2012) propose maximizing an estimate of test power over $\sigma$ .

report a correction →

Connection to GAN Evaluation

The Frechet Inception Distance (FID), widely used for GAN evaluation, is related to MMD. FID assumes the Inception feature distributions are Gaussian and computes a parametric distance. MMD with a Gaussian kernel on Inception features provides a nonparametric alternative that does not assume Gaussianity.

The kernel Inception distance (KID) is exactly the unbiased MMD estimator applied to Inception features with a polynomial kernel. KID has the advantage of being unbiased (FID is biased for finite samples) and having simpler statistical properties.

Common Confusions

Watch Out

MMD is not a distance between samples, but between distributions

The MMD is a population quantity defined in terms of $P$ and $Q$ . The estimator $\widehat{\text{MMD}}^2_u$ is a statistic computed from samples. A large estimated MMD could reflect a true distributional difference or just sampling noise. The permutation test distinguishes between these cases.

Watch Out

The kernel choice matters for what differences MMD detects

A linear kernel only detects mean differences. A Gaussian kernel with small bandwidth detects local density differences but may miss global distributional shifts. There is no universally best kernel; the choice should reflect what types of distributional differences you care about.

Watch Out

MMD is not the same as KL divergence

KL divergence requires density estimation and is asymmetric. MMD requires only kernel evaluations and is symmetric. KL divergence can be infinite when the supports do not overlap; MMD is always finite for bounded kernels. They measure different things.

Summary

MMD = distance between mean embeddings of $P$ and $Q$ in RKHS
Squared MMD involves three terms: within- $P$ , cross, and within- $Q$ kernel expectations
Characteristic kernel (Gaussian, Laplacian): $\text{MMD} = 0 \iff P = Q$
Unbiased U-statistic estimator has $O((m+n)^2)$ cost
Permutation test gives exact p-values without distributional assumptions
MMD is consistent: detects any departure from $P = Q$ with enough data
KID (kernel Inception distance) = MMD applied to Inception features

Exercises

ExerciseCore

Problem

You have 200 samples from $P$ and 200 samples from $Q$ . You run a permutation test with $B = 999$ permutations and observe that 12 of the permuted MMD values exceed your observed MMD. What is the p-value? Do you reject $H_0: P = Q$ at significance level $\alpha = 0.05$ ?

ExerciseAdvanced

Problem

Show that the MMD with a linear kernel $k(x,y) = x^\top y$ reduces to the distance between sample means. Specifically, show that $\text{MMD}^2 = \|\mathbb{E}_P[X] - \mathbb{E}_Q[Y]\|^2$ . Why does this make the linear kernel unsuitable as a general two-sample test?

References

Canonical:

Gretton et al., "A Kernel Two-Sample Test" (JMLR, 2012)
Gretton et al., "A Kernel Method for the Two-Sample Problem" (NeurIPS, 2006)

Current:

Sutherland et al., "Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy" (ICLR 2017)
Binkowski et al., "Demystifying MMD GANs" (ICLR 2018). introduces KID
Muandet, Fukumizu, Sriperumbudur, Scholkopf, "Kernel Mean Embedding of Distributions: A Review and Beyond" (Foundations and Trends in Machine Learning, 2017)
Lehmann & Romano, Testing Statistical Hypotheses (3rd ed., 2005), Chapter 17 (permutation tests)
Sriperumbudur et al., "Universality, Characteristic Kernels and RKHS Embedding of Measures" (JMLR, 2011)

Next Topics

The natural next steps from kernel two-sample tests:

Gaussian processes for ML: kernel methods applied to Bayesian prediction

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.