Statistical Estimation
Goodness-of-Fit Tests
KS test, Shapiro-Wilk, Anderson-Darling, and the probability integral transform: testing whether data follows a specified distribution, with exact power comparisons and when each test fails.
Why This Matters
Before you apply any parametric method, you need to know whether your distributional assumptions hold. Maximum likelihood estimation assumes a parametric family. The central limit theorem assumes finite variance. Confidence intervals assume approximate normality of the estimator. All of these depend on distributional structure that can be tested.
Goodness-of-fit tests ask: does this data come from the distribution I think it does? The answer determines whether your downstream analysis is trustworthy or built on sand. Applications range from checking normality assumptions to testing whether financial data follows Benford's law for fraud detection.
The Probability Integral Transform
Probability Integral Transform
Statement
If is a continuous random variable with CDF , then:
Conversely, if , then .
Intuition
The CDF maps every distribution to the uniform. If you push data through its own CDF, the result should look uniform. If it does not look uniform, the assumed CDF is wrong. This is the conceptual engine behind all CDF-based goodness-of-fit tests.
Proof Sketch
for . This is the CDF of Uniform(0,1).
Why It Matters
The PIT is the theoretical foundation of calibration checks, quantile residuals, and CDF-based test statistics. When you check whether a model's predicted probabilities are calibrated, you are implicitly using the PIT: well-calibrated predictions, pushed through the data CDF, should be uniform.
Failure Mode
Requires continuity of . For discrete distributions, is not uniform but has a more complex distribution. Modified versions exist (randomized PIT) but are less clean.
The Kolmogorov-Smirnov Test
Empirical CDF
The empirical CDF of a sample is:
This is a step function that jumps by at each data point. By the law of large numbers, pointwise. By the Glivenko-Cantelli theorem, the convergence is uniform.
Kolmogorov-Smirnov Test
Statement
The KS test statistic is:
Under , the distribution of is distribution-free (does not depend on ). Asymptotically:
where is the Kolmogorov distribution with CDF .
Reject when exceeds the critical value from the Kolmogorov distribution.
Intuition
The KS test measures the maximum vertical distance between the empirical CDF and the hypothesized CDF. If the data really came from , this distance should be small (of order ). A large distance indicates the data does not follow .
Proof Sketch
The distribution-free property follows from the probability integral transform: if , then , so has the same distribution as the KS statistic for testing uniformity, regardless of what is.
Why It Matters
The KS test is nonparametric and distribution-free under . It requires no binning (unlike chi-squared tests) and works for any continuous distribution. It is the default first test when checking distributional assumptions.
Failure Mode
The KS test has maximum power in the center of the distribution and low power in the tails. It detects location and scale shifts well but is insensitive to differences concentrated in the tails. For tail-sensitive testing, Anderson-Darling is better. The KS test also requires to be fully specified: if you estimate parameters from the data and then test using those estimates (the Lilliefors problem), the standard KS critical values are wrong and the test is anti-conservative.
Glivenko-Cantelli Theorem
Statement
The empirical CDF converges uniformly to the true CDF, almost surely. This is the uniform law of large numbers for CDFs.
Intuition
At each point , is a sample mean of Bernoulli variables, so it converges by the law of large numbers. Glivenko-Cantelli upgrades this from pointwise to uniform convergence, which is much stronger.
Why It Matters
This theorem justifies the empirical CDF as a consistent estimator of the true CDF. It is the CDF analog of uniform convergence in learning theory, and the reason why CDF-based tests like KS have power against all alternatives.
Failure Mode
The rate of convergence is by the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality. For heavy-tailed distributions, convergence in the tails can be slow in practice even though the uniform rate is the same.
Normality Tests
When the specific question is "is this data Gaussian?", specialized tests outperform the generic KS test.
Shapiro-Wilk test. Computes the ratio of two variance estimates: one based on the order statistics' expected values under normality, one the usual sample variance. The test statistic is:
where are the order statistics and are coefficients derived from the expected values and covariance matrix of the standard normal order statistics.
Shapiro-Wilk has the highest power among normality tests for small to moderate sample sizes (). It detects asymmetry, heavy tails, and light tails effectively. Its main limitation is computational: the exact critical values require specific tables, and the test was originally designed for (though extensions exist for larger samples).
Anderson-Darling test. A modification of the KS test that puts more weight on the tails:
The weighting function amplifies discrepancies where is near 0 or 1, making Anderson-Darling more sensitive to tail departures than KS.
Power Comparison
| Test | Best for | Weakness |
|---|---|---|
| Kolmogorov-Smirnov | General-purpose, any continuous CDF | Low tail sensitivity, requires fully specified |
| Shapiro-Wilk | Normality testing, small samples | Only tests normality, not general distributions |
| Anderson-Darling | Tail-sensitive testing | Requires computing , slower |
| Chi-squared | Discrete distributions, large samples | Requires binning, low power for continuous data |
| Lilliefors | Normality/exponentiality with estimated parameters | Specific to certain families, limited critical value tables |
Common Confusions
The KS test requires a fully specified null distribution
If you estimate and from the data and then test normality using KS with , the test is invalid. The parameter estimation changes the null distribution of , making the standard critical values too liberal (you reject too infrequently). The Lilliefors correction provides the correct critical values for this setting. Alternatively, use Shapiro-Wilk, which is designed for composite hypotheses.
Failing to reject does not confirm the distribution
A goodness-of-fit test can only reject the null. Not rejecting means the data is consistent with , not that is true. With small , you lack power to detect even substantial departures. With large , you will reject almost any parametric model because no real data exactly follows a theoretical distribution. The useful question is not "is my data exactly Gaussian?" but "is the Gaussian approximation good enough for my purpose?"
p-values from goodness-of-fit tests are not effect sizes
A small p-value from a KS test says the departure from is statistically significant, but it says nothing about how large or practically important the departure is. Report (the maximum CDF discrepancy) alongside the p-value. A of 0.02 with is statistically significant but practically negligible for most applications.
Exercises
Problem
You have 200 observations and want to test whether they come from an Exponential(1) distribution. You compute . The KS critical value at for is approximately . Do you reject?
Problem
Explain why the Anderson-Darling test is more powerful than KS for detecting heavy-tailed alternatives to normality. What is the weighting function, and how does it change the test's sensitivity?
References
Canonical:
- Lehmann & Romano, Testing Statistical Hypotheses (3rd ed., 2005), Chapter 14
- D'Agostino & Stephens, Goodness-of-Fit Techniques (1986). The definitive reference for GOF testing.
- Shapiro & Wilk, "An Analysis of Variance Test for Normality" (Biometrika, 1965)
Current:
-
Razali & Wah, "Power Comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling Tests" (Journal of Statistical Modeling and Analytics, 2011)
-
Stephens, "EDF Statistics for Goodness of Fit and Some Comparisons" (JASA, 1974)
-
Casella & Berger, Statistical Inference (2002), Chapters 5-10
Next Topics
- Neyman-Pearson theory: the theoretical framework underlying all hypothesis tests
- Calibration: when you need to test whether a model's predicted probabilities match observed frequencies
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Hypothesis Testing for MLLayer 2
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A