Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Estimation

Goodness-of-Fit Tests

KS test, Shapiro-Wilk, Anderson-Darling, and the probability integral transform: testing whether data follows a specified distribution, with exact power comparisons and when each test fails.

CoreTier 2Stable~50 min
0

Why This Matters

Before you apply any parametric method, you need to know whether your distributional assumptions hold. Maximum likelihood estimation assumes a parametric family. The central limit theorem assumes finite variance. Confidence intervals assume approximate normality of the estimator. All of these depend on distributional structure that can be tested.

Goodness-of-fit tests ask: does this data come from the distribution I think it does? The answer determines whether your downstream analysis is trustworthy or built on sand. Applications range from checking normality assumptions to testing whether financial data follows Benford's law for fraud detection.

The Probability Integral Transform

Theorem

Probability Integral Transform

Statement

If XX is a continuous random variable with CDF FF, then:

U=F(X)Uniform(0,1)U = F(X) \sim \text{Uniform}(0, 1)

Conversely, if UUniform(0,1)U \sim \text{Uniform}(0, 1), then F1(U)FF^{-1}(U) \sim F.

Intuition

The CDF maps every distribution to the uniform. If you push data through its own CDF, the result should look uniform. If it does not look uniform, the assumed CDF is wrong. This is the conceptual engine behind all CDF-based goodness-of-fit tests.

Proof Sketch

Pr[F(X)u]=Pr[XF1(u)]=F(F1(u))=u\Pr[F(X) \leq u] = \Pr[X \leq F^{-1}(u)] = F(F^{-1}(u)) = u for u[0,1]u \in [0, 1]. This is the CDF of Uniform(0,1).

Why It Matters

The PIT is the theoretical foundation of calibration checks, quantile residuals, and CDF-based test statistics. When you check whether a model's predicted probabilities are calibrated, you are implicitly using the PIT: well-calibrated predictions, pushed through the data CDF, should be uniform.

Failure Mode

Requires continuity of FF. For discrete distributions, F(X)F(X) is not uniform but has a more complex distribution. Modified versions exist (randomized PIT) but are less clean.

The Kolmogorov-Smirnov Test

Definition

Empirical CDF

The empirical CDF of a sample x1,,xnx_1, \ldots, x_n is:

F^n(t)=1ni=1n1[xit]\hat{F}_n(t) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}[x_i \leq t]

This is a step function that jumps by 1/n1/n at each data point. By the law of large numbers, F^n(t)F(t)\hat{F}_n(t) \to F(t) pointwise. By the Glivenko-Cantelli theorem, the convergence is uniform.

Theorem

Kolmogorov-Smirnov Test

Statement

The KS test statistic is:

Dn=suptF^n(t)F0(t)D_n = \sup_t |\hat{F}_n(t) - F_0(t)|

Under H0:F=F0H_0: F = F_0, the distribution of DnD_n is distribution-free (does not depend on F0F_0). Asymptotically:

nDndK\sqrt{n} D_n \xrightarrow{d} K

where KK is the Kolmogorov distribution with CDF Pr[Kt]=12k=1(1)k1e2k2t2\Pr[K \leq t] = 1 - 2\sum_{k=1}^{\infty} (-1)^{k-1} e^{-2k^2 t^2}.

Reject H0H_0 when DnD_n exceeds the critical value from the Kolmogorov distribution.

Intuition

The KS test measures the maximum vertical distance between the empirical CDF and the hypothesized CDF. If the data really came from F0F_0, this distance should be small (of order 1/n1/\sqrt{n}). A large distance indicates the data does not follow F0F_0.

Proof Sketch

The distribution-free property follows from the probability integral transform: if F=F0F = F_0, then F0(Xi)Uniform(0,1)F_0(X_i) \sim \text{Uniform}(0,1), so DnD_n has the same distribution as the KS statistic for testing uniformity, regardless of what F0F_0 is.

Why It Matters

The KS test is nonparametric and distribution-free under H0H_0. It requires no binning (unlike chi-squared tests) and works for any continuous distribution. It is the default first test when checking distributional assumptions.

Failure Mode

The KS test has maximum power in the center of the distribution and low power in the tails. It detects location and scale shifts well but is insensitive to differences concentrated in the tails. For tail-sensitive testing, Anderson-Darling is better. The KS test also requires F0F_0 to be fully specified: if you estimate parameters from the data and then test using those estimates (the Lilliefors problem), the standard KS critical values are wrong and the test is anti-conservative.

Theorem

Glivenko-Cantelli Theorem

Statement

suptF^n(t)F(t)a.s.0\sup_t |\hat{F}_n(t) - F(t)| \xrightarrow{a.s.} 0

The empirical CDF converges uniformly to the true CDF, almost surely. This is the uniform law of large numbers for CDFs.

Intuition

At each point tt, F^n(t)\hat{F}_n(t) is a sample mean of Bernoulli variables, so it converges by the law of large numbers. Glivenko-Cantelli upgrades this from pointwise to uniform convergence, which is much stronger.

Why It Matters

This theorem justifies the empirical CDF as a consistent estimator of the true CDF. It is the CDF analog of uniform convergence in learning theory, and the reason why CDF-based tests like KS have power against all alternatives.

Failure Mode

The rate of convergence is O(1/n)O(1/\sqrt{n}) by the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality. For heavy-tailed distributions, convergence in the tails can be slow in practice even though the uniform rate is the same.

Normality Tests

When the specific question is "is this data Gaussian?", specialized tests outperform the generic KS test.

Shapiro-Wilk test. Computes the ratio of two variance estimates: one based on the order statistics' expected values under normality, one the usual sample variance. The test statistic is:

W=(i=1naix(i))2i=1n(xixˉ)2W = \frac{\left(\sum_{i=1}^n a_i x_{(i)}\right)^2}{\sum_{i=1}^n (x_i - \bar{x})^2}

where x(i)x_{(i)} are the order statistics and aia_i are coefficients derived from the expected values and covariance matrix of the standard normal order statistics.

Shapiro-Wilk has the highest power among normality tests for small to moderate sample sizes (n<5000n < 5000). It detects asymmetry, heavy tails, and light tails effectively. Its main limitation is computational: the exact critical values require specific tables, and the test was originally designed for n50n \leq 50 (though extensions exist for larger samples).

Anderson-Darling test. A modification of the KS test that puts more weight on the tails:

A2=n1ni=1n(2i1)[logF0(x(i))+log(1F0(x(n+1i)))]A^2 = -n - \frac{1}{n} \sum_{i=1}^n (2i - 1)[\log F_0(x_{(i)}) + \log(1 - F_0(x_{(n+1-i)}))]

The weighting function 1/[F0(t)(1F0(t))]1/[F_0(t)(1 - F_0(t))] amplifies discrepancies where F0(t)F_0(t) is near 0 or 1, making Anderson-Darling more sensitive to tail departures than KS.

Power Comparison

TestBest forWeakness
Kolmogorov-SmirnovGeneral-purpose, any continuous CDFLow tail sensitivity, requires fully specified F0F_0
Shapiro-WilkNormality testing, small samplesOnly tests normality, not general distributions
Anderson-DarlingTail-sensitive testingRequires computing logF0(x(i))\log F_0(x_{(i)}), slower
Chi-squaredDiscrete distributions, large samplesRequires binning, low power for continuous data
LillieforsNormality/exponentiality with estimated parametersSpecific to certain families, limited critical value tables

Common Confusions

Watch Out

The KS test requires a fully specified null distribution

If you estimate μ\mu and σ\sigma from the data and then test normality using KS with F0=N(μ^,σ^2)F_0 = \mathcal{N}(\hat{\mu}, \hat{\sigma}^2), the test is invalid. The parameter estimation changes the null distribution of DnD_n, making the standard critical values too liberal (you reject too infrequently). The Lilliefors correction provides the correct critical values for this setting. Alternatively, use Shapiro-Wilk, which is designed for composite hypotheses.

Watch Out

Failing to reject does not confirm the distribution

A goodness-of-fit test can only reject the null. Not rejecting means the data is consistent with F0F_0, not that F0F_0 is true. With small nn, you lack power to detect even substantial departures. With large nn, you will reject almost any parametric model because no real data exactly follows a theoretical distribution. The useful question is not "is my data exactly Gaussian?" but "is the Gaussian approximation good enough for my purpose?"

Watch Out

p-values from goodness-of-fit tests are not effect sizes

A small p-value from a KS test says the departure from F0F_0 is statistically significant, but it says nothing about how large or practically important the departure is. Report DnD_n (the maximum CDF discrepancy) alongside the p-value. A DnD_n of 0.02 with n=50,000n = 50{,}000 is statistically significant but practically negligible for most applications.

Exercises

ExerciseCore

Problem

You have 200 observations and want to test whether they come from an Exponential(1) distribution. You compute Dn=0.08D_n = 0.08. The KS critical value at α=0.05\alpha = 0.05 for n=200n = 200 is approximately 1.36/2000.0961.36/\sqrt{200} \approx 0.096. Do you reject?

ExerciseAdvanced

Problem

Explain why the Anderson-Darling test is more powerful than KS for detecting heavy-tailed alternatives to normality. What is the weighting function, and how does it change the test's sensitivity?

References

Canonical:

  • Lehmann & Romano, Testing Statistical Hypotheses (3rd ed., 2005), Chapter 14
  • D'Agostino & Stephens, Goodness-of-Fit Techniques (1986). The definitive reference for GOF testing.
  • Shapiro & Wilk, "An Analysis of Variance Test for Normality" (Biometrika, 1965)

Current:

  • Razali & Wah, "Power Comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling Tests" (Journal of Statistical Modeling and Analytics, 2011)

  • Stephens, "EDF Statistics for Goodness of Fit and Some Comparisons" (JASA, 1974)

  • Casella & Berger, Statistical Inference (2002), Chapters 5-10

Next Topics

  • Neyman-Pearson theory: the theoretical framework underlying all hypothesis tests
  • Calibration: when you need to test whether a model's predicted probabilities match observed frequencies

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics