Regression Methods
Gauss-Markov Theorem
Among all linear unbiased estimators, ordinary least squares has the smallest variance. This is the BLUE theorem. And understanding when its assumptions fail is just as important as the result itself.
Prerequisites
Why This Matters
When you run a linear regression using ordinary least squares (OLS), you are making an implicit claim: this is the best way to estimate the coefficients. The Gauss-Markov theorem tells you exactly when that claim is justified. and when it is not.
Understanding Gauss-Markov is essential because it tells you the default choice (OLS) and the conditions under which you should deviate from it.
Mental Model
You want to estimate the coefficients in a linear model. There are many possible linear estimators. weighted least squares, ridge regression (which is biased), arbitrary linear combinations of the data. Among all estimators that are both linear and unbiased, OLS gives you the one with the smallest variance.
This is a strong optimality guarantee, but it comes with conditions. Break the conditions and the guarantee evaporates.
Formal Setup
Linear Regression Model
The linear regression model is:
where is the response vector, is the design matrix (assumed fixed and full column rank), is the coefficient vector, and is the error vector.
Gauss-Markov Assumptions
The Gauss-Markov assumptions on the error vector are:
- Zero mean:
- Homoscedasticity: for all
- Uncorrelated errors: for
In matrix form: and .
Note: Gaussian (normal) errors are not required.
OLS Estimator
The ordinary least squares estimator is:
This minimizes the sum of squared residuals: .
Linear Unbiased Estimator
An estimator is linear (it is a linear function of ) and unbiased if for all . The unbiasedness condition requires .
The Theorem
Gauss-Markov Theorem
Statement
Under the Gauss-Markov assumptions, the OLS estimator is BLUE. The Best Linear Unbiased Estimator. That is, for any other linear unbiased estimator :
Equivalently, is positive semidefinite.
Intuition
OLS is the most efficient use of the data among all linear unbiased methods. Any other linear unbiased estimator must have equal or larger variance for every coefficient. You cannot do better without either (a) introducing bias, (b) using a nonlinear estimator, or (c) exploiting additional knowledge about the errors.
Proof Sketch
Let be any linear unbiased estimator. Write for some matrix . Unbiasedness () forces . Compute the variance:
Since is positive semidefinite, in the matrix (Loewner) ordering. The cross term vanishes because .
Why It Matters
This theorem justifies OLS as the default method for linear regression. If the Gauss-Markov assumptions hold, there is no reason to use any other linear unbiased estimator. This is why OLS is taught first and used most often.
Failure Mode
The theorem fails when: (1) errors are heteroscedastic (), (2) errors are correlated (), or (3) you are willing to accept bias in exchange for lower variance (regularization). In all three cases, OLS is no longer optimal.
What BLUE Means
B = Best (minimum variance)
L = Linear (estimator is a linear function of )
U = Unbiased ()
E = Estimator
The restriction to linear estimators is important. There may exist nonlinear unbiased estimators with lower variance. And biased estimators (like ridge regression) can have lower mean squared error by trading bias for variance.
When Assumptions Fail
Heteroscedasticity
If varies across observations, OLS is still unbiased but no longer efficient. Use weighted least squares (WLS):
where . WLS is BLUE under heteroscedastic errors.
Correlated Errors
If where , use generalized least squares (GLS):
GLS is BLUE under the general covariance structure. OLS ignores correlations and is inefficient.
Biased Estimators
Ridge regression is biased, so Gauss-Markov does not apply. But ridge can have lower mean squared error than OLS, especially when is nearly singular (high multicollinearity).
Common Confusions
Gauss-Markov does NOT require Gaussian errors
The theorem says nothing about the distribution of beyond its first two moments (zero mean, constant variance, uncorrelated). The errors can follow any distribution. Gaussian errors are needed for the F-test and t-test to have exact distributions, but not for Gauss-Markov itself.
BLUE does not mean best among ALL estimators
OLS is best among linear unbiased estimators. Biased estimators (ridge, lasso) or nonlinear estimators can do better in terms of mean squared error. The theorem is about a restricted competition.
Unbiasedness is not always desirable
The bias-variance tradeoff shows that a small amount of bias can dramatically reduce variance, lowering overall MSE. Gauss-Markov optimizes for zero bias, which is the right goal for inference but not always for prediction.
Summary
- OLS is BLUE: best linear unbiased estimator under Gauss-Markov assumptions
- Assumptions: , . no Gaussianity needed
- Heteroscedasticity breaks optimality. use WLS
- Correlated errors break optimality. use GLS
- Biased estimators (ridge) are outside the scope of the theorem but can beat OLS in MSE
- BLUE is about a restricted competition: linear and unbiased only
Exercises
Problem
State the three Gauss-Markov assumptions. For each, give an example of a real dataset where that assumption might be violated.
Problem
Prove that if is linear and unbiased for , then . Start from the definition of unbiasedness.
Problem
Consider the James-Stein estimator, which shrinks OLS estimates toward zero. It is biased but has strictly lower total MSE than OLS when . How does this not contradict Gauss-Markov?
References
Canonical:
- Gauss (1821) and Markov (1912). The original results
- Greene, Econometric Analysis (2018), Chapter 4
- Hastie, Tibshirani & Friedman, Elements of Statistical Learning (2009), Section 3.2
Current:
- Hayashi, Econometrics (2000), Chapter 1. clear modern treatment
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Linear RegressionLayer 1
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Differentiation in RnLayer 0A