Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Regression Methods

Gauss-Markov Theorem

Among all linear unbiased estimators, ordinary least squares has the smallest variance. This is the BLUE theorem. And understanding when its assumptions fail is just as important as the result itself.

CoreTier 1Stable~45 min

Prerequisites

Why This Matters

When you run a linear regression using ordinary least squares (OLS), you are making an implicit claim: this is the best way to estimate the coefficients. The Gauss-Markov theorem tells you exactly when that claim is justified. and when it is not.

Understanding Gauss-Markov is essential because it tells you the default choice (OLS) and the conditions under which you should deviate from it.

Mental Model

You want to estimate the coefficients β\beta in a linear model. There are many possible linear estimators. weighted least squares, ridge regression (which is biased), arbitrary linear combinations of the data. Among all estimators that are both linear and unbiased, OLS gives you the one with the smallest variance.

This is a strong optimality guarantee, but it comes with conditions. Break the conditions and the guarantee evaporates.

Formal Setup

Definition

Linear Regression Model

The linear regression model is:

y=Xβ+ϵy = X\beta + \epsilon

where yRny \in \mathbb{R}^n is the response vector, XRn×pX \in \mathbb{R}^{n \times p} is the design matrix (assumed fixed and full column rank), βRp\beta \in \mathbb{R}^p is the coefficient vector, and ϵRn\epsilon \in \mathbb{R}^n is the error vector.

Definition

Gauss-Markov Assumptions

The Gauss-Markov assumptions on the error vector ϵ\epsilon are:

  1. Zero mean: E[ϵ]=0\mathbb{E}[\epsilon] = 0
  2. Homoscedasticity: Var(ϵi)=σ2\text{Var}(\epsilon_i) = \sigma^2 for all ii
  3. Uncorrelated errors: Cov(ϵi,ϵj)=0\text{Cov}(\epsilon_i, \epsilon_j) = 0 for iji \neq j

In matrix form: E[ϵ]=0\mathbb{E}[\epsilon] = 0 and Var(ϵ)=σ2I\text{Var}(\epsilon) = \sigma^2 I.

Note: Gaussian (normal) errors are not required.

Definition

OLS Estimator

The ordinary least squares estimator is:

β^OLS=(XTX)1XTy\hat{\beta}_{\text{OLS}} = (X^T X)^{-1} X^T y

This minimizes the sum of squared residuals: yXβ2\|y - X\beta\|^2.

Definition

Linear Unbiased Estimator

An estimator β~=Cy\tilde{\beta} = Cy is linear (it is a linear function of yy) and unbiased if E[β~]=β\mathbb{E}[\tilde{\beta}] = \beta for all β\beta. The unbiasedness condition requires CX=IpCX = I_p.

The Theorem

Theorem

Gauss-Markov Theorem

Statement

Under the Gauss-Markov assumptions, the OLS estimator β^OLS=(XTX)1XTy\hat{\beta}_{\text{OLS}} = (X^T X)^{-1} X^T y is BLUE. The Best Linear Unbiased Estimator. That is, for any other linear unbiased estimator β~=Cy\tilde{\beta} = Cy:

Var(β~j)Var(β^OLS,j)for all j=1,,p\text{Var}(\tilde{\beta}_j) \geq \text{Var}(\hat{\beta}_{\text{OLS},j}) \quad \text{for all } j = 1, \ldots, p

Equivalently, Var(β~)Var(β^OLS)\text{Var}(\tilde{\beta}) - \text{Var}(\hat{\beta}_{\text{OLS}}) is positive semidefinite.

Intuition

OLS is the most efficient use of the data among all linear unbiased methods. Any other linear unbiased estimator must have equal or larger variance for every coefficient. You cannot do better without either (a) introducing bias, (b) using a nonlinear estimator, or (c) exploiting additional knowledge about the errors.

Proof Sketch

Let β~=Cy\tilde{\beta} = Cy be any linear unbiased estimator. Write C=(XTX)1XT+DC = (X^T X)^{-1}X^T + D for some matrix DD. Unbiasedness (CX=ICX = I) forces DX=0DX = 0. Compute the variance:

Var(β~)=σ2CCT=σ2(XTX)1+σ2DDT\text{Var}(\tilde{\beta}) = \sigma^2 CC^T = \sigma^2(X^T X)^{-1} + \sigma^2 DD^T

Since DDTDD^T is positive semidefinite, Var(β~)Var(β^OLS)\text{Var}(\tilde{\beta}) \geq \text{Var}(\hat{\beta}_{\text{OLS}}) in the matrix (Loewner) ordering. The cross term vanishes because DX=0DX = 0.

Why It Matters

This theorem justifies OLS as the default method for linear regression. If the Gauss-Markov assumptions hold, there is no reason to use any other linear unbiased estimator. This is why OLS is taught first and used most often.

Failure Mode

The theorem fails when: (1) errors are heteroscedastic (Var(ϵi)σ2\text{Var}(\epsilon_i) \neq \sigma^2), (2) errors are correlated (Cov(ϵi,ϵj)0\text{Cov}(\epsilon_i, \epsilon_j) \neq 0), or (3) you are willing to accept bias in exchange for lower variance (regularization). In all three cases, OLS is no longer optimal.

What BLUE Means

B = Best (minimum variance)

L = Linear (estimator is a linear function of yy)

U = Unbiased (E[β^]=β\mathbb{E}[\hat{\beta}] = \beta)

E = Estimator

The restriction to linear estimators is important. There may exist nonlinear unbiased estimators with lower variance. And biased estimators (like ridge regression) can have lower mean squared error by trading bias for variance.

When Assumptions Fail

Heteroscedasticity

If Var(ϵi)=σi2\text{Var}(\epsilon_i) = \sigma_i^2 varies across observations, OLS is still unbiased but no longer efficient. Use weighted least squares (WLS):

β^WLS=(XTWX)1XTWy\hat{\beta}_{\text{WLS}} = (X^T W X)^{-1} X^T W y

where W=diag(1/σ12,,1/σn2)W = \text{diag}(1/\sigma_1^2, \ldots, 1/\sigma_n^2). WLS is BLUE under heteroscedastic errors.

Correlated Errors

If Var(ϵ)=σ2Ω\text{Var}(\epsilon) = \sigma^2 \Omega where ΩI\Omega \neq I, use generalized least squares (GLS):

β^GLS=(XTΩ1X)1XTΩ1y\hat{\beta}_{\text{GLS}} = (X^T \Omega^{-1} X)^{-1} X^T \Omega^{-1} y

GLS is BLUE under the general covariance structure. OLS ignores correlations and is inefficient.

Biased Estimators

Ridge regression β^ridge=(XTX+λI)1XTy\hat{\beta}_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T y is biased, so Gauss-Markov does not apply. But ridge can have lower mean squared error than OLS, especially when XTXX^T X is nearly singular (high multicollinearity).

Common Confusions

Watch Out

Gauss-Markov does NOT require Gaussian errors

The theorem says nothing about the distribution of ϵ\epsilon beyond its first two moments (zero mean, constant variance, uncorrelated). The errors can follow any distribution. Gaussian errors are needed for the F-test and t-test to have exact distributions, but not for Gauss-Markov itself.

Watch Out

BLUE does not mean best among ALL estimators

OLS is best among linear unbiased estimators. Biased estimators (ridge, lasso) or nonlinear estimators can do better in terms of mean squared error. The theorem is about a restricted competition.

Watch Out

Unbiasedness is not always desirable

The bias-variance tradeoff shows that a small amount of bias can dramatically reduce variance, lowering overall MSE. Gauss-Markov optimizes for zero bias, which is the right goal for inference but not always for prediction.

Summary

  • OLS is BLUE: best linear unbiased estimator under Gauss-Markov assumptions
  • Assumptions: E[ϵ]=0\mathbb{E}[\epsilon] = 0, Var(ϵ)=σ2I\text{Var}(\epsilon) = \sigma^2 I. no Gaussianity needed
  • Heteroscedasticity breaks optimality. use WLS
  • Correlated errors break optimality. use GLS
  • Biased estimators (ridge) are outside the scope of the theorem but can beat OLS in MSE
  • BLUE is about a restricted competition: linear and unbiased only

Exercises

ExerciseCore

Problem

State the three Gauss-Markov assumptions. For each, give an example of a real dataset where that assumption might be violated.

ExerciseAdvanced

Problem

Prove that if β~=Cy\tilde{\beta} = Cy is linear and unbiased for β\beta, then CX=IpCX = I_p. Start from the definition of unbiasedness.

ExerciseResearch

Problem

Consider the James-Stein estimator, which shrinks OLS estimates toward zero. It is biased but has strictly lower total MSE than OLS when p3p \geq 3. How does this not contradict Gauss-Markov?

References

Canonical:

  • Gauss (1821) and Markov (1912). The original results
  • Greene, Econometric Analysis (2018), Chapter 4
  • Hastie, Tibshirani & Friedman, Elements of Statistical Learning (2009), Section 3.2

Current:

  • Hayashi, Econometrics (2000), Chapter 1. clear modern treatment

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics