ML Methods
Linear Regression
Ordinary least squares as projection, the normal equations, the hat matrix, Gauss-Markov optimality, and the connection to maximum likelihood under Gaussian noise.
Why This Matters
Linear regression is the most fundamental supervised learning method. Every idea in regression (projection, residuals, bias-variance, regularization) generalizes directly to more complex models. If you understand linear regression at the level of linear algebra (not just "fit a line"), you have the skeleton key for half of statistical learning.
Mental Model
You have data points in a high-dimensional space. The columns of your design matrix span a subspace. OLS finds the point in that subspace closest to your target vector . This is an orthogonal projection. The residuals are the component of orthogonal to the column space of .
Formal Setup
We observe input-output pairs. Stack the inputs into a design matrix and responses into . We seek a weight vector minimizing the sum of squared residuals.
Ordinary Least Squares
The OLS estimator minimizes the squared loss:
Setting the gradient to zero yields the normal equations:
When is invertible, the closed-form solution is:
Hat Matrix
The hat matrix (or projection matrix) is:
It projects onto the column space of : the fitted values are . The matrix "puts the hat on ."
Key properties: is symmetric and idempotent (), , and eigenvalues are all 0 or 1.
Residuals
The residual vector is:
By the normal equations, . residuals are orthogonal to every column of . This is the geometric content of OLS.
The Projection Interpretation
The OLS solution has a direct geometric meaning. The column space of , denoted , is a -dimensional subspace of . The fitted vector must lie in .
OLS finds the point in closest to in Euclidean distance. By the projection theorem in linear algebra, this is the orthogonal projection of onto . The residual is perpendicular to , which is exactly the statement .
Ridge Regression as Regularized OLS
Ridge Regression
Ridge regression adds an penalty:
The closed-form solution is:
The addition of ensures invertibility and shrinks the estimate toward zero, trading bias for reduced variance.
Main Theorems
Gauss-Markov Theorem
Statement
Under the linear model where and , the OLS estimator is the Best Linear Unbiased Estimator (BLUE). That is, among all unbiased estimators that are linear in , OLS has the smallest variance (in the matrix sense):
for any other linear unbiased estimator .
Intuition
OLS is not just an unbiased linear estimator. It is the best one. You cannot reduce variance by using a different linear combination of without introducing bias. If you want lower variance, you must either accept bias (ridge regression) or use nonlinear methods.
Proof Sketch
Let be any linear unbiased estimator. Unbiasedness requires . Write where . Then .
Why It Matters
The Gauss-Markov theorem tells you exactly what you give up by regularizing. Ridge regression is biased, so it falls outside the Gauss-Markov scope. But the bias-variance tradeoff can still make it better in terms of mean squared error. The theorem defines the frontier of what unbiased linear estimation can achieve.
Failure Mode
If errors are heteroscedastic or correlated (violating ), OLS is no longer BLUE. Generalized least squares (GLS) reclaims optimality by accounting for the error covariance structure.
Connection to Maximum Likelihood
Under the Gaussian noise model with , the log-likelihood is:
Maximizing over is equivalent to minimizing , which is exactly OLS. So OLS = MLE under Gaussian noise. This also means ridge regression corresponds to MAP estimation with a Gaussian prior on .
Canonical Examples
Simple linear regression in 2D
With a single feature and intercept, where is the feature vector. The normal equations give the familiar slope and intercept:
The hat matrix has diagonal entries called leverages. Points with high leverage (far from ) have outsized influence on the fit.
Polynomial regression as linear regression
Fitting a polynomial is still linear regression. The design matrix has columns . The model is linear in the parameters, not the features. This is why the term "linear" in linear regression refers to linearity in , not in .
Common Confusions
OLS minimizes squared residuals, not perpendicular distances
A frequent misconception is that OLS minimizes the perpendicular (orthogonal) distance from each point to the regression line. It does not. OLS minimizes the vertical distances (residuals in ). Minimizing perpendicular distances gives total least squares (or orthogonal regression), which is a different estimator. The distinction matters when both and have measurement error.
Invertibility of X^T X is not guaranteed
The normal equations require to be invertible, which happens when has full column rank (). This fails when features are linearly dependent or when . Ridge regression fixes this: is always invertible for .
Summary
- The normal equations are the first-order optimality conditions for least squares
- OLS is an orthogonal projection of onto the column space of
- The hat matrix maps to fitted values
- Gauss-Markov: OLS is BLUE under homoscedastic uncorrelated errors
- OLS = MLE under Gaussian noise; ridge = MAP with Gaussian prior
- Residuals are orthogonal to every predictor:
Exercises
Problem
Show that the residual vector satisfies . What does this mean geometrically?
Problem
For ridge regression with penalty , show that the solution can be written as . What happens as and ?
Problem
Prove that if and , then the MLE for is exactly . What is the MLE for ?
Related Comparisons
References
Canonical:
- Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 3
- Seber & Lee, Linear Regression Analysis (2003), Chapters 3-4
- Wasserman, All of Statistics (2004), Chapter 13
Current:
-
Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 11
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Next Topics
The natural next steps from linear regression:
- Ridge regression: what happens when you trade bias for variance
- Logistic regression: extending the linear model to classification
- Bias-variance tradeoff: the general principle behind regularization
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Differentiation in RnLayer 0A
Builds on This
- Cubist and Model TreesLayer 2
- Data Preprocessing and Feature EngineeringLayer 1
- Gauss-Markov TheoremLayer 2
- Generalized Additive ModelsLayer 2
- Implicit Bias and Modern GeneralizationLayer 4
- Lasso RegressionLayer 2
- Longitudinal Surveys and Panel DataLayer 3
- MARS (Multivariate Adaptive Regression Splines)Layer 2
- Ridge RegressionLayer 2
- Small Area EstimationLayer 3
- Time Series Forecasting BasicsLayer 2