Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Matrix Operations and Properties

Essential matrix operations for ML: trace, determinant, inverse, pseudoinverse, Schur complement, and the Sherman-Morrison-Woodbury formula. When and why each matters.

CoreTier 1Stable~50 min

Why This Matters

Matrices are the language of ML. Weight matrices, covariance matrices, kernel matrices, Hessians. They are everywhere. You need to know what operations you can perform on them, what those operations mean, and when they are numerically safe.

This page covers the operations that appear repeatedly in ML theory and practice.

Mental Model

Think of a matrix as a linear map that transforms vectors. The properties of that map. how it scales space, whether it is invertible, how it relates to its eigenvalues. are captured by operations like trace, determinant, and inverse. Each operation answers a different question about the map.

Trace

Definition

Trace

The trace of a square matrix ARn×nA \in \mathbb{R}^{n \times n} is the sum of its diagonal elements:

tr(A)=i=1nAii\text{tr}(A) = \sum_{i=1}^{n} A_{ii}

Equivalently, the trace equals the sum of eigenvalues: tr(A)=i=1nλi\text{tr}(A) = \sum_{i=1}^{n} \lambda_i.

Key properties of the trace:

  • Linearity: tr(A+B)=tr(A)+tr(B)\text{tr}(A + B) = \text{tr}(A) + \text{tr}(B) and tr(cA)=ctr(A)\text{tr}(cA) = c \cdot \text{tr}(A)
  • Cyclic property: tr(ABC)=tr(CAB)=tr(BCA)\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA)
  • Transpose invariance: tr(A)=tr(AT)\text{tr}(A) = \text{tr}(A^T)

The cyclic property is extremely useful. It lets you rearrange matrix products inside a trace, which simplifies many derivations in ML (e.g., computing gradients of matrix expressions).

Determinant

Definition

Determinant

The determinant of a square matrix ARn×nA \in \mathbb{R}^{n \times n} equals the product of its eigenvalues:

det(A)=i=1nλi\det(A) = \prod_{i=1}^{n} \lambda_i

Geometrically, det(A)|\det(A)| measures the factor by which AA scales volumes. If det(A)=0\det(A) = 0, the matrix is singular (not invertible).

Key properties:

  • det(AB)=det(A)det(B)\det(AB) = \det(A)\det(B)
  • det(A1)=1/det(A)\det(A^{-1}) = 1/\det(A)
  • det(AT)=det(A)\det(A^T) = \det(A)
  • det(cA)=cndet(A)\det(cA) = c^n \det(A) for ARn×nA \in \mathbb{R}^{n \times n}

In ML, determinants appear in Gaussian distributions (the normalization constant involves det(Σ)\det(\Sigma)), in volume arguments for information theory, and in Bayesian model selection.

Matrix Inverse

Definition

Matrix Inverse

For a square matrix ARn×nA \in \mathbb{R}^{n \times n}, the inverse A1A^{-1} satisfies AA1=A1A=IAA^{-1} = A^{-1}A = I. It exists if and only if det(A)0\det(A) \neq 0 (equivalently, all eigenvalues are nonzero).

Key identities:

  • (AB)1=B1A1(AB)^{-1} = B^{-1}A^{-1} (note the reversed order)
  • (AT)1=(A1)T(A^T)^{-1} = (A^{-1})^T

When Inversion is Dangerous

Definition

Condition Number

The condition number of a matrix is:

κ(A)=AA1=σmax(A)σmin(A)\kappa(A) = \|A\| \cdot \|A^{-1}\| = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}

where σmax\sigma_{\max} and σmin\sigma_{\min} are the largest and smallest singular values. A large condition number means the matrix is nearly singular and inversion is numerically unstable.

Rule of thumb: if κ(A)10k\kappa(A) \approx 10^k, you lose about kk digits of accuracy when solving Ax=bAx = b by inversion. For double precision (about 16 digits), κ(A)>1012\kappa(A) > 10^{12} is dangerous.

In practice, avoid explicit matrix inversion. Use factorizations (Cholesky, LU, QR) to solve linear systems instead.

Moore-Penrose Pseudoinverse

Definition

Moore-Penrose Pseudoinverse

The Moore-Penrose pseudoinverse A+A^+ of a matrix ARm×nA \in \mathbb{R}^{m \times n} is the unique matrix satisfying four conditions: (1) AA+A=AAA^+A = A, (2) A+AA+=A+A^+AA^+ = A^+, (3) (AA+)T=AA+(AA^+)^T = AA^+, (4) (A+A)T=A+A(A^+A)^T = A^+A.

For full column rank (rank(A)=nm\text{rank}(A) = n \leq m):

A+=(ATA)1ATA^+ = (A^T A)^{-1} A^T

For full row rank (rank(A)=mn\text{rank}(A) = m \leq n):

A+=AT(AAT)1A^+ = A^T (A A^T)^{-1}

The pseudoinverse gives the least-squares solution to Ax=bAx = b when AA is not square or not invertible: x+=A+bx^+ = A^+ b minimizes Axb2\|Ax - b\|_2.

This is exactly what happens in linear regression: the OLS solution β^=(XTX)1XTy\hat{\beta} = (X^T X)^{-1} X^T y is X+yX^+ y.

Transpose and Adjoint

The transpose ATA^T swaps rows and columns: (AT)ij=Aji(A^T)_{ij} = A_{ji}.

The conjugate transpose (adjoint) AA^* also conjugates complex entries: (A)ij=Aji(A^*)_{ij} = \overline{A_{ji}}. For real matrices, A=ATA^* = A^T.

A matrix is symmetric if A=ATA = A^T. A matrix is orthogonal if ATA=IA^T A = I. These properties simplify many computations. Symmetric matrices have real eigenvalues, and orthogonal matrices preserve lengths. A symmetric matrix with all nonnegative eigenvalues is positive semidefinite.

Schur Complement

Definition

Schur Complement

Given a block matrix:

M=(ABCD)M = \begin{pmatrix} A & B \\ C & D \end{pmatrix}

If DD is invertible, the Schur complement of DD in MM is:

M/D=ABD1CM/D = A - B D^{-1} C

The determinant factors as det(M)=det(D)det(M/D)\det(M) = \det(D) \cdot \det(M/D).

The Schur complement appears in:

  • Gaussian conditioning (deriving conditional distributions from joint)
  • Block matrix inversion
  • Optimization (eliminating variables in quadratic forms)

Sherman-Morrison-Woodbury Formula

Theorem

Sherman-Morrison-Woodbury Formula

Statement

If ARn×nA \in \mathbb{R}^{n \times n} is invertible, URn×kU \in \mathbb{R}^{n \times k}, CRk×kC \in \mathbb{R}^{k \times k} is invertible, and VRk×nV \in \mathbb{R}^{k \times n}, then:

(A+UCV)1=A1A1U(C1+VA1U)1VA1(A + UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1}

Intuition

When you add a low-rank update UCVUCV to a matrix AA whose inverse you already know, you can compute the new inverse by solving a smaller k×kk \times k system instead of a full n×nn \times n inversion. This is a huge saving when knk \ll n.

Proof Sketch

Multiply (A+UCV)(A + UCV) by the proposed right-hand side and verify you get II. This is a direct algebraic verification. expand the product and simplify using AA1=IAA^{-1} = I and CC1=ICC^{-1} = I.

Why It Matters

This formula appears throughout ML: online learning (rank-1 updates to covariance matrices), Kalman filters, Gaussian process inference with structured kernels, and Bayesian linear regression. Any time you have A1A^{-1} and need (A+low-rank)1(A + \text{low-rank})^{-1}, use this formula.

Failure Mode

The formula requires both AA and C1+VA1UC^{-1} + VA^{-1}U to be invertible. If either is singular (or nearly so), the formula is inapplicable or numerically unstable.

The special case with k=1k=1 (rank-1 update) is the Sherman-Morrison formula:

(A+uvT)1=A1A1uvTA11+vTA1u(A + uv^T)^{-1} = A^{-1} - \frac{A^{-1}uv^T A^{-1}}{1 + v^T A^{-1}u}

Common Confusions

Watch Out

Trace and determinant serve different purposes

The trace sums eigenvalues; the determinant multiplies them. A matrix can have large trace but zero determinant (e.g., diag(100,0)\text{diag}(100, 0)). The trace tells you about the "total magnitude" of eigenvalues; the determinant tells you whether the matrix is invertible and how it scales volume.

Watch Out

Never invert matrices explicitly

In nearly all practical ML code, you should solve Ax=bAx = b using a linear system solver (e.g., np.linalg.solve), not by computing A1A^{-1} and then multiplying. Direct inversion is slower, less numerically stable, and rarely necessary.

Summary

  • tr(A)\text{tr}(A) = sum of eigenvalues; use the cyclic property freely
  • det(A)\det(A) = product of eigenvalues; zero means singular
  • The pseudoinverse A+A^+ gives the least-squares solution to Ax=bAx = b
  • Condition number κ(A)\kappa(A) measures how dangerous inversion is
  • Schur complements factor block matrices and appear in Gaussian conditioning
  • Sherman-Morrison-Woodbury turns rank-kk updates into k×kk \times k solves
  • In practice, solve linear systems instead of inverting matrices

Exercises

ExerciseCore

Problem

Let A=diag(3,5,7)A = \text{diag}(3, 5, 7). Compute tr(A)\text{tr}(A), det(A)\det(A), A1A^{-1}, and κ(A)\kappa(A) (using the 2-norm).

ExerciseCore

Problem

Show that tr(AB)=tr(BA)\text{tr}(AB) = \text{tr}(BA) for any matrices ARm×nA \in \mathbb{R}^{m \times n} and BRn×mB \in \mathbb{R}^{n \times m}.

ExerciseAdvanced

Problem

You have computed A1A^{-1} for a 1000×10001000 \times 1000 matrix. A new data point arrives, requiring you to compute (A+uvT)1(A + uv^T)^{-1} where u,vR1000u, v \in \mathbb{R}^{1000}. What is the computational cost using Sherman-Morrison versus recomputing the inverse from scratch?

References

Canonical:

  • Strang, Introduction to Linear Algebra (2016), Chapters 2, 5, 6
  • Horn & Johnson, Matrix Analysis (2013), Chapters 0-1
  • Golub & Van Loan, Matrix Computations (2013), Chapters 2-3 (matrix operations and LU/QR factorization)

Current:

  • Petersen & Pedersen, The Matrix Cookbook (2012). Essential reference for matrix identities.
  • Axler, Linear Algebra Done Right (2024), Chapters 3-4 (linear maps, invertibility, determinants)
  • Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 4 (matrix decompositions)

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics