Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Linear Regression

Ordinary least squares as projection, the normal equations, the hat matrix, Gauss-Markov optimality, and the connection to maximum likelihood under Gaussian noise.

CoreTier 1Stable~50 min

Why This Matters

y = 0.7x + 1.5residuals: eᵢ = yᵢ - (wxᵢ + b)OLS minimizes sum of squared residuals02460246xyDataFitResiduals

Linear regression is the most fundamental supervised learning method. Every idea in regression (projection, residuals, bias-variance, regularization) generalizes directly to more complex models. If you understand linear regression at the level of linear algebra (not just "fit a line"), you have the skeleton key for half of statistical learning.

Mental Model

You have data points in a high-dimensional space. The columns of your design matrix XX span a subspace. OLS finds the point in that subspace closest to your target vector yy. This is an orthogonal projection. The residuals are the component of yy orthogonal to the column space of XX.

Formal Setup

We observe nn input-output pairs. Stack the inputs into a design matrix XRn×dX \in \mathbb{R}^{n \times d} and responses into yRny \in \mathbb{R}^n. We seek a weight vector wRdw \in \mathbb{R}^d minimizing the sum of squared residuals.

Definition

Ordinary Least Squares

The OLS estimator minimizes the squared loss:

w^OLS=argminwRdyXw22\hat{w}_{\text{OLS}} = \arg\min_{w \in \mathbb{R}^d} \|y - Xw\|_2^2

Setting the gradient to zero yields the normal equations:

XXw^=XyX^\top X \hat{w} = X^\top y

When XXX^\top X is invertible, the closed-form solution is:

w^OLS=(XX)1Xy\hat{w}_{\text{OLS}} = (X^\top X)^{-1} X^\top y

Definition

Hat Matrix

The hat matrix (or projection matrix) is:

H=X(XX)1XH = X(X^\top X)^{-1} X^\top

It projects yy onto the column space of XX: the fitted values are y^=Hy\hat{y} = Hy. The matrix "puts the hat on yy."

Key properties: HH is symmetric and idempotent (H2=HH^2 = H), tr(H)=d\text{tr}(H) = d, and eigenvalues are all 0 or 1.

Definition

Residuals

The residual vector is:

e=yy^=(IH)ye = y - \hat{y} = (I - H)y

By the normal equations, Xe=0X^\top e = 0. residuals are orthogonal to every column of XX. This is the geometric content of OLS.

The Projection Interpretation

The OLS solution has a direct geometric meaning. The column space of XX, denoted col(X)\text{col}(X), is a dd-dimensional subspace of Rn\mathbb{R}^n. The fitted vector y^=Xw\hat{y} = Xw must lie in col(X)\text{col}(X).

OLS finds the point in col(X)\text{col}(X) closest to yy in Euclidean distance. By the projection theorem in linear algebra, this is the orthogonal projection of yy onto col(X)\text{col}(X). The residual e=yy^e = y - \hat{y} is perpendicular to col(X)\text{col}(X), which is exactly the statement Xe=0X^\top e = 0.

Ridge Regression as Regularized OLS

Definition

Ridge Regression

Ridge regression adds an 2\ell_2 penalty:

w^ridge=argminwyXw22+λw22\hat{w}_{\text{ridge}} = \arg\min_w \|y - Xw\|_2^2 + \lambda \|w\|_2^2

The closed-form solution is:

w^ridge=(XX+λI)1Xy\hat{w}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y

The addition of λI\lambda I ensures invertibility and shrinks the estimate toward zero, trading bias for reduced variance.

Main Theorems

Theorem

Gauss-Markov Theorem

Statement

Under the linear model y=Xw+εy = Xw + \varepsilon where E[ε]=0\mathbb{E}[\varepsilon] = 0 and Var(ε)=σ2I\text{Var}(\varepsilon) = \sigma^2 I, the OLS estimator w^OLS\hat{w}_{\text{OLS}} is the Best Linear Unbiased Estimator (BLUE). That is, among all unbiased estimators that are linear in yy, OLS has the smallest variance (in the matrix sense):

Var(w~)Var(w^OLS)0\text{Var}(\tilde{w}) - \text{Var}(\hat{w}_{\text{OLS}}) \succeq 0

for any other linear unbiased estimator w~\tilde{w}.

Intuition

OLS is not just an unbiased linear estimator. It is the best one. You cannot reduce variance by using a different linear combination of yy without introducing bias. If you want lower variance, you must either accept bias (ridge regression) or use nonlinear methods.

Proof Sketch

Let w~=Cy\tilde{w} = Cy be any linear unbiased estimator. Unbiasedness requires CX=ICX = I. Write C=(XX)1X+DC = (X^\top X)^{-1}X^\top + D where DX=0DX = 0. Then Var(w~)=σ2(XX)1+σ2DDσ2(XX)1=Var(w^OLS)\text{Var}(\tilde{w}) = \sigma^2(X^\top X)^{-1} + \sigma^2 DD^\top \succeq \sigma^2(X^\top X)^{-1} = \text{Var}(\hat{w}_{\text{OLS}}).

Why It Matters

The Gauss-Markov theorem tells you exactly what you give up by regularizing. Ridge regression is biased, so it falls outside the Gauss-Markov scope. But the bias-variance tradeoff can still make it better in terms of mean squared error. The theorem defines the frontier of what unbiased linear estimation can achieve.

Failure Mode

If errors are heteroscedastic or correlated (violating Var(ε)=σ2I\text{Var}(\varepsilon) = \sigma^2 I), OLS is no longer BLUE. Generalized least squares (GLS) reclaims optimality by accounting for the error covariance structure.

Connection to Maximum Likelihood

Under the Gaussian noise model y=Xw+εy = Xw + \varepsilon with εN(0,σ2I)\varepsilon \sim \mathcal{N}(0, \sigma^2 I), the log-likelihood is:

logp(yX,w,σ2)=n2log(2πσ2)12σ2yXw22\log p(y \mid X, w, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\|y - Xw\|_2^2

Maximizing over ww is equivalent to minimizing yXw22\|y - Xw\|_2^2, which is exactly OLS. So OLS = MLE under Gaussian noise. This also means ridge regression corresponds to MAP estimation with a Gaussian prior on ww.

Canonical Examples

Example

Simple linear regression in 2D

With a single feature and intercept, X=[1,x]X = [\mathbf{1}, x] where xx is the feature vector. The normal equations give the familiar slope and intercept:

β^1=i(xixˉ)(yiyˉ)i(xixˉ)2,β^0=yˉβ^1xˉ\hat{\beta}_1 = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sum_i (x_i - \bar{x})^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}

The hat matrix HH has diagonal entries hiih_{ii} called leverages. Points with high leverage (far from xˉ\bar{x}) have outsized influence on the fit.

Example

Polynomial regression as linear regression

Fitting a polynomial y=β0+β1x+β2x2+y = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots is still linear regression. The design matrix has columns [1,x,x2,][1, x, x^2, \ldots]. The model is linear in the parameters, not the features. This is why the term "linear" in linear regression refers to linearity in ww, not in xx.

Common Confusions

Watch Out

OLS minimizes squared residuals, not perpendicular distances

A frequent misconception is that OLS minimizes the perpendicular (orthogonal) distance from each point to the regression line. It does not. OLS minimizes the vertical distances (residuals in yy). Minimizing perpendicular distances gives total least squares (or orthogonal regression), which is a different estimator. The distinction matters when both xx and yy have measurement error.

Watch Out

Invertibility of X^T X is not guaranteed

The normal equations require XXX^\top X to be invertible, which happens when XX has full column rank (rank(X)=d\text{rank}(X) = d). This fails when features are linearly dependent or when n<dn < d. Ridge regression fixes this: XX+λIX^\top X + \lambda I is always invertible for λ>0\lambda > 0.

Summary

  • The normal equations XXw=XyX^\top X w = X^\top y are the first-order optimality conditions for least squares
  • OLS is an orthogonal projection of yy onto the column space of XX
  • The hat matrix H=X(XX)1XH = X(X^\top X)^{-1}X^\top maps yy to fitted values
  • Gauss-Markov: OLS is BLUE under homoscedastic uncorrelated errors
  • OLS = MLE under Gaussian noise; ridge = MAP with Gaussian prior
  • Residuals are orthogonal to every predictor: Xe=0X^\top e = 0

Exercises

ExerciseCore

Problem

Show that the residual vector e=yXw^e = y - X\hat{w} satisfies Xe=0X^\top e = 0. What does this mean geometrically?

ExerciseCore

Problem

For ridge regression with penalty λ\lambda, show that the solution can be written as w^ridge=(XX+λI)1Xy\hat{w}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y. What happens as λ0\lambda \to 0 and λ\lambda \to \infty?

ExerciseAdvanced

Problem

Prove that if εN(0,σ2I)\varepsilon \sim \mathcal{N}(0, \sigma^2 I) and y=Xw+εy = Xw + \varepsilon, then the MLE for ww is exactly w^OLS\hat{w}_{\text{OLS}}. What is the MLE for σ2\sigma^2?

Related Comparisons

References

Canonical:

  • Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 3
  • Seber & Lee, Linear Regression Analysis (2003), Chapters 3-4
  • Wasserman, All of Statistics (2004), Chapter 13

Current:

  • Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 11

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

The natural next steps from linear regression:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics