Analysis of Variance

Sneiderman, Robby

Statistics

Analysis of Variance

One-way ANOVA decomposes the total sum of squares into a between-group component and a within-group component. Under iid normal data with equal group variances, the ratio of mean squares has an F distribution and gives an exact test of the equal-means null hypothesis. The Welch correction handles unequal variances. Two-way ANOVA partitions further into main effects and interaction. Post-hoc procedures (Tukey HSD, Bonferroni, Scheffe) correct for the multiple-comparison problem that naive pairwise t-tests ignore.

EssentialCoreTier 1StableCore spine~60 min

For:StatsGeneral

Prerequisites

Central Limit Theorem Multivariate Normal Distribution Linear Regression Hypothesis Testing for ML

Prereq Map

Why This Matters

The one-way ANOVA $F$ -statistic is the first multi-group comparison every student of statistics learns, and it is the engine inside fixed-effects regression, mixed-effects models, designed experiments, randomized trials, and quality-control inspection. The construction is short: decompose the total variability into a between-group piece (signal) and a within-group piece (noise), form the ratio, and read off the test statistic.

ANOVA also shows where the variance-stabilizing transformations and the multivariate normal pay off in applications: the $F$ distribution under the null is exact for iid normal data with equal variances, and approximate otherwise. The same machinery generalizes to two-way layouts with interaction, to repeated-measures designs, and to all of the post-hoc multiple-comparison machinery.

One-Way ANOVA: Setup and Decomposition

Consider $K \geq 2$ groups with $n_k$ observations in group $k$ , total $N = \sum_k n_k$ . Write $Y_{kj} = \mu_k + \varepsilon_{kj}, \quad k = 1, \ldots, K,\; j = 1, \ldots, n_k,$ where $\mu_k$ is the group mean and $\varepsilon_{kj}$ are iid noise. Define the group means and the grand mean $\bar Y_k = \frac{1}{n_k}\sum_{j=1}^{n_k} Y_{kj}, \quad \bar Y = \frac{1}{N}\sum_{k=1}^K\sum_{j=1}^{n_k} Y_{kj}.$

Theorem

One-Way ANOVA Sum-of-Squares Decomposition

Statement

The total sum of squares decomposes as $\underbrace{\sum_{k=1}^K \sum_{j=1}^{n_k} (Y_{kj} - \bar Y)^2}_{\text{SS}_{\text{tot}}} \;=\; \underbrace{\sum_{k=1}^K n_k (\bar Y_k - \bar Y)^2}_{\text{SS}_{\text{between}}} \;+\; \underbrace{\sum_{k=1}^K \sum_{j=1}^{n_k} (Y_{kj} - \bar Y_k)^2}_{\text{SS}_{\text{within}}}.$ The decomposition is purely algebraic and does not require any distributional assumption.

Proof Sketch

Add and subtract $\bar Y_k$ inside the squared deviation: $Y_{kj} - \bar Y = (Y_{kj} - \bar Y_k) + (\bar Y_k - \bar Y)$ . Expand the square. The cross-term sums to zero because $\sum_j (Y_{kj} - \bar Y_k) = 0$ inside each group, leaving the two stated sums of squares.

Why It Matters

The decomposition is the geometric heart of ANOVA: $\text{SS}_{\text{between}}$ is the squared distance from the group-mean fit to the grand-mean fit, and $\text{SS}_{\text{within}}$ is the squared distance from the data to the group-mean fit. Under iid normal data with equal variances, these two squared distances are independent and have known chi-squared distributions; their ratio gives the $F$ statistic.

Failure Mode

The decomposition is exact. The downstream $F$ -distribution claim is what fails under non-normal data or unequal variances; the decomposition itself holds for any partition of the data.

report a correction →

The F-Statistic and Its Distribution

The standard one-way ANOVA test is for the null hypothesis $H_0 : \mu_1 = \mu_2 = \cdots = \mu_K$ .

Theorem

F Distribution of the One-Way ANOVA Statistic Under the Null

Statement

Under $H_0 : \mu_1 = \cdots = \mu_K = \mu$ and the iid $N(\mu, \sigma^2)$ assumption, $F = \frac{\text{SS}_{\text{between}} / (K - 1)}{\text{SS}_{\text{within}} / (N - K)} = \frac{\text{MS}_{\text{between}}}{\text{MS}_{\text{within}}} \;\sim\; F_{K - 1,\, N - K}.$ The two sums of squares satisfy $\frac{\text{SS}_{\text{between}}}{\sigma^2} \sim \chi^2_{K - 1}, \quad \frac{\text{SS}_{\text{within}}}{\sigma^2} \sim \chi^2_{N - K},$ and they are independent.

Intuition

Project the data onto the subspace of vectors that are constant within each group (this gives the group means) and onto its orthogonal complement (this gives the within-group residuals). Under joint normality with equal variances, the two projections are jointly normal with zero cross-covariance, hence independent (see multivariate normal). The squared norms are independent chi-squared with degrees of freedom equal to the dimensions of the two subspaces.

Proof Sketch

Under $H_0$ , $Y_{kj} \sim N(\mu, \sigma^2)$ independently. Consider the data vector $\mathbf{Y} \in \mathbb{R}^N$ . Let $V_1$ be the subspace of $\mathbb{R}^N$ where each group has its own constant value; let $V_0 \subset V_1$ be the subspace where all entries are the same constant. Then $\text{SS}_{\text{within}} = \|\mathbf{Y} - P_{V_1}\mathbf{Y}\|^2$ and $\text{SS}_{\text{between}} = \|P_{V_1}\mathbf{Y} - P_{V_0}\mathbf{Y}\|^2$ . The subspace $V_1$ has dimension $K$ and $V_0$ has dimension $1$ .

Under $H_0$ , $\mathbf{Y} - \mu\mathbf{1} \sim N_N(0, \sigma^2 I_N)$ . Projecting an isotropic Gaussian onto orthogonal subspaces gives independent Gaussians of the lower dimensions, and squared norms become scaled chi-squareds: $\frac{\text{SS}_{\text{within}}}{\sigma^2} \sim \chi^2_{N - K}, \quad \frac{\text{SS}_{\text{between}}}{\sigma^2} \sim \chi^2_{K - 1},$ independently. The ratio of independent scaled chi-squareds divided by their degrees of freedom is the $F$ distribution by definition.

Why It Matters

This is one of the cleanest exact-distribution results in classical statistics. Under the stated assumptions, the test has level exactly $\alpha$ for any finite sample size $N$ . The same construction underlies the $F$ tests in linear regression, ANCOVA, repeated-measures, and split-plot designs; they differ only in the choice of nested subspaces.

Failure Mode

Three assumptions can break. (1) Non-normality. The $F$ test tolerates mild non-normality when the sample sizes are equal, but its level and power degrade when sample sizes are unequal. (2) Unequal variances (heteroscedasticity). The $F$ statistic no longer has an $F$ distribution; use the Welch correction below. (3) Non-independence (e.g., repeated measurements, clustered data). The decomposition itself is fine, but the chi-squared degrees of freedom are wrong; use a mixed model.

report a correction →

Unequal Variances: The Welch Correction

When the assumption of equal variances across groups fails (the Behrens-Fisher problem), the ratio MS-between / MS-within no longer has an $F$ distribution. Welch (1951) proposed a correction that approximates the null distribution as $F_{K - 1, \nu^*}$ for a data-driven degrees-of-freedom adjustment.

Theorem

Welch ANOVA Statistic

Statement

Define weights $w_k = n_k / S_k^2$ where $S_k^2$ is the within-group sample variance, $w_\cdot = \sum_k w_k$ , and the weighted grand mean $\tilde Y = \sum_k w_k \bar Y_k / w_\cdot$ . The Welch statistic is $F^* = \frac{\sum_k w_k (\bar Y_k - \tilde Y)^2 / (K - 1)}{1 + \frac{2(K - 2)}{K^2 - 1} \sum_k \frac{(1 - w_k / w_\cdot)^2}{n_k - 1}}.$ Under the equal-means null and approximate normality, $F^*$ is approximately $F_{K - 1, \nu^*}$ where $\nu^* = \frac{K^2 - 1}{3 \sum_k \frac{(1 - w_k / w_\cdot)^2}{n_k - 1}}.$

Intuition

The naive equal-variance pooled estimate of $\sigma^2$ is replaced by a weighted average that gives more weight to groups with smaller variance. The degrees of freedom are then adjusted to match the first two moments of the resulting distribution.

Why It Matters

Welch's correction is the default ANOVA in most statistical software (e.g., oneway.test in R) when variances are not assumed equal. For two groups, the same idea gives Welch's $t$ -test, which is now the recommended replacement for the equal-variance two-sample $t$ -test in nearly all applied contexts.

Failure Mode

The Welch approximation is good when sample sizes are at least moderate (each $n_k \geq 5$ is a reasonable rule of thumb). For very small group sizes the approximation can be poor and a permutation or bootstrap alternative is safer.

report a correction →

Two-Way ANOVA: Main Effects and Interaction

For a two-factor design with factor $A$ at $I$ levels and factor $B$ at $J$ levels, the model with $n$ replicates per cell is $Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk}, \quad i = 1, \ldots, I,\; j = 1, \ldots, J,\; k = 1, \ldots, n,$ with sum-to-zero constraints $\sum_i \alpha_i = 0$ , $\sum_j \beta_j = 0$ , and $\sum_i (\alpha\beta)_{ij} = \sum_j (\alpha\beta)_{ij} = 0$ for identifiability.

Theorem

Two-Way ANOVA Sum-of-Squares Decomposition

Statement

With $N = I J n$ observations, the total sum of squares decomposes orthogonally into four components: $\text{SS}_{\text{tot}} \;=\; \text{SS}_A + \text{SS}_B + \text{SS}_{AB} + \text{SS}_E,$ where $\text{SS}_A = J n \sum_i (\bar Y_{i \cdot \cdot} - \bar Y)^2,\quad \text{SS}_B = I n \sum_j (\bar Y_{\cdot j \cdot} - \bar Y)^2,$ $\text{SS}_{AB} = n \sum_{i, j}(\bar Y_{ij \cdot} - \bar Y_{i \cdot \cdot} - \bar Y_{\cdot j \cdot} + \bar Y)^2,\quad \text{SS}_E = \sum_{i, j, k}(Y_{ijk} - \bar Y_{ij \cdot})^2.$ Under $H_0$ for each of the three effects, the corresponding $F$ -statistic $F_{\text{effect}} = \frac{\text{SS}_{\text{effect}} / \text{df}_{\text{effect}}}{\text{SS}_E / (I J (n - 1))}$ has an exact $F$ distribution with degrees of freedom $(\text{df}_{\text{effect}}, I J (n - 1))$ , where $\text{df}_A = I - 1$ , $\text{df}_B = J - 1$ , $\text{df}_{AB} = (I - 1)(J - 1)$ .

Intuition

The four components are orthogonal projections onto four nested subspaces. The dimension counts give the degrees of freedom. Under jointly normal data, the chi-squared distributions are independent (the multivariate normal independence property again) and the ratio is $F$ .

Why It Matters

This is the design-of-experiments core. The interaction sum of squares $\text{SS}_{AB}$ is the part of the variability that is not explained by the main effects acting additively; a significant interaction means the effect of one factor depends on the level of the other. Without checking the interaction, additive main-effects conclusions can be misleading.

Failure Mode

Unbalanced designs (unequal $n$ per cell) break the orthogonality of the decomposition. Type I, II, and III sums of squares (sequential, hierarchical, partial) become different and the choice depends on the scientific question. Modern software defaults vary; specify the type explicitly in any reported analysis.

report a correction →

Post-Hoc Comparisons: Why Pairwise t-Tests Are Wrong

A significant ANOVA $F$ test rejects $H_0 : \mu_1 = \cdots = \mu_K$ but does not identify which group means differ. Running $\binom{K}{2}$ pairwise $t$ -tests at level $\alpha$ each inflates the family-wise error rate (FWER) to approximately $1 - (1 - \alpha)^{\binom{K}{2}}$ , which for $K = 5$ and $\alpha = 0.05$ already exceeds $0.4$ . Post-hoc procedures fix this.

Bonferroni. Run each of the $m = \binom{K}{2}$ pairwise tests at level $\alpha / m$ . The FWER is at most $\alpha$ by the union bound. Bonferroni is conservative (the bound is loose when tests are positively correlated), and most useful when the number of comparisons is small.

Tukey HSD. Use the studentized-range distribution of the maximum standardized difference among $K$ group means. The HSD critical value $q_{\alpha; K, N - K}$ comes from the distribution of $\max_{i, j} (\bar Y_i - \bar Y_j) / \sqrt{\text{MS}_E / n}$ under the equal-means null with equal sample sizes. Confidence intervals $\bar Y_i - \bar Y_j \pm q_{\alpha; K, N - K} \sqrt{\text{MS}_E / n}$ have simultaneous coverage $1 - \alpha$ over all $\binom{K}{2}$ pairs. Tukey is the standard choice for all-pairs comparisons with balanced data; it is sharper than Bonferroni in this setting.

Scheffe. Adjust each contrast (any linear combination $\sum_k c_k \bar Y_k$ with $\sum_k c_k = 0$ ) using the critical value $\sqrt{(K - 1) F_{\alpha; K - 1, N - K}}$ . Scheffe gives simultaneous coverage for the infinite family of all contrasts, not just pairwise ones; it is the right tool for data-driven contrast exploration but is overly conservative for restricted families.

The choice among these is determined by the family of comparisons of interest. Pre-specified small families: Bonferroni or Holm. All pairwise: Tukey. Open-ended contrast search: Scheffe.

Common Confusions

Watch Out

A significant F does not say which groups differ

The ANOVA $F$ test answers the omnibus question "are any of the group means different?" not "which pairs of means differ?" The post-hoc procedures answer the second question; running them only after $F$ rejects ("protected" testing) is one common approach. Reporting only the $F$ test result and informally describing which means look biggest is not statistics.

Watch Out

ANOVA does not require the groups to be ordered

Group labels in ANOVA are nominal: there is no ordering and no notion of distance between groups. If the levels are ordered (dose levels, age bins), a regression with the level as a numeric predictor or a trend test is usually more informative than the ANOVA $F$ .

Watch Out

Equal-variance is the assumption most often violated

ANOVA tolerates mild non-normality reasonably well, especially with balanced and large samples. It is far more sensitive to unequal variances combined with unequal sample sizes. When in doubt, use Welch.

Watch Out

ANOVA is a special case of linear regression

The one-way ANOVA $F$ statistic equals the $F$ statistic from regressing $Y$ on $K - 1$ dummy variables encoding group membership. Two-way ANOVA is the same as regressing on dummies for both factors plus their product terms. Modern statistical packages use the regression formulation as the unifying frame; ANOVA tables are a presentation choice, not a separate methodology.

Exercises

ExerciseCore

Problem

Three groups of $n_1 = n_2 = n_3 = 10$ observations are drawn iid from $N(\mu_k, \sigma^2)$ . The group means are $\bar Y_1 = 5$ , $\bar Y_2 = 6$ , $\bar Y_3 = 8$ , the within-group sample variances are $S_1^2 = S_2^2 = S_3^2 = 4$ . Compute the $F$ statistic and state its degrees of freedom under the equal-means null.

ExerciseCore

Problem

For $K = 4$ groups, find the Bonferroni-corrected significance level for pairwise comparisons such that the family-wise error rate is at most $0.05$ . How much smaller is this than $0.05$ ?

ExerciseAdvanced

Problem

Show that for $K = 2$ groups with $n_1 = n_2 = n$ , the ANOVA $F$ statistic equals the square of the two-sample equal-variance $t$ statistic, with $F_{1, 2n - 2}$ matching $t_{2n - 2}^2$ .

References

Canonical:

Casella and Berger, Statistical Inference (2002), 2nd edition, Chapter 11
Lehmann and Romano, Testing Statistical Hypotheses (2005), 3rd edition, Chapter 7
Scheffe, The Analysis of Variance (1959). The original monograph; still the deepest reference on the geometry.

Foundational papers:

Fisher, "Statistical Methods for Research Workers" (1925) introduced the technique; Chapter 7.
Welch, "On the comparison of several mean values: an alternative approach" (Biometrika, 1951), volume 38, pages 330-336
Tukey, "The problem of multiple comparisons" (unpublished manuscript, 1953; later in The Collected Works of John W. Tukey, Volume VIII)
Bonferroni, "Teoria statistica delle classi e calcolo delle probabilita" (1936). The union-bound correction.

Applied references:

Box, Hunter, and Hunter, Statistics for Experimenters (2005), 2nd edition. The design-of-experiments perspective.
Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning (2009), 2nd edition, Section 3.2 (ANOVA-as-regression).
Hsu, Multiple Comparisons: Theory and Methods (1996). The post-hoc-procedure reference.

Next Topics

Linear regression: ANOVA as a special case of linear-model inference with categorical predictors.
Hypothesis testing for ML: the broader testing framework, including the $F$ test's role inside it.
Variance-stabilizing transformations: the preprocessing step that makes ANOVA assumptions reasonable for count or proportion data.
Bootstrap methods: the resampling alternative when normality or independence fails.

Last reviewed: May 12, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Central Limit Theoremlayer 0B · tier 1
The Multivariate Normal Distributionlayer 0B · tier 1
Linear Regressionlayer 1 · tier 1
Hypothesis Testing for MLlayer 2 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.