Calculus Objects
Matrix Calculus
The differentiation identities you actually use in ML: derivatives of traces, log-determinants, and quadratic forms with respect to matrices and vectors.
Prerequisites
Why This Matters
Every time you derive a gradient for a machine learning model. whether it is linear regression, a Gaussian model, or a neural network. You are doing matrix calculus. Most ML papers skip the derivation steps, writing "taking the derivative and setting to zero" as if the reader knows how to differentiate with respect to .
This topic gives you the identities that make those derivations mechanical. Once you know a handful of rules, you can derive gradients for any model whose loss involves traces, determinants, and quadratic forms.
Mental Model
Matrix calculus extends scalar calculus to matrices the same way vector calculus extends it to vectors. The key insight: every matrix expression is really a scalar function of many variables (the matrix entries). We package the partial derivatives into matrices of the same shape.
The derivative of a scalar with respect to a matrix is a matrix. The derivative of a scalar with respect to a vector is a vector. These are the gradients you need for optimization.
Formal Setup and Notation
Let be a scalar-valued function of a matrix . The derivative of with respect to is the matrix of partial derivatives:
Denominator Layout Convention
In the denominator layout (also called Jacobian formulation), the derivative has the same shape as . For a scalar and column vector , is a row vector. Many statistics textbooks use this convention.
Numerator Layout Convention
In the numerator layout (also called gradient formulation), the derivative has the same shape as . For a scalar and column vector , is a column vector. Most ML and optimization textbooks use this convention because the gradient points in the direction of steepest ascent.
This topic uses numerator layout throughout.
Core Identities
These are the identities you need to know. In all cases, is a constant matrix of appropriate dimensions, is a vector, and is a matrix.
Vector identities:
If is symmetric, this simplifies to .
Trace identities:
Log-determinant:
If is symmetric, this is .
Inverse:
The Chain Rule for Matrix Expressions
Matrix Chain Rule via Differentials
To differentiate complex matrix expressions, use differentials. The differential of is:
The procedure: (1) compute the differential using rules like , , and ; (2) rearrange into the form ; (3) read off the derivative.
Main Theorems
Chain Rule for Matrix-to-Scalar Functions
Statement
If is a scalar function and depends on parameters , then:
For the common case where with constant and :
Intuition
The chain rule in matrix calculus works the same as in scalar calculus: the derivative of the composition is the product of derivatives. The trace appears because we need to contract a matrix of derivatives (the Jacobian) down to a scalar. The key is keeping track of transposes.
Proof Sketch
Write the differential: . If , then . Substitute and use linearity of trace: . Read off as the coefficient of .
Why It Matters
This is how you derive gradients for any model. Want the gradient of the log-likelihood of a Gaussian with respect to the mean? Chain rule through the quadratic form. With respect to the covariance? Chain rule through the log-determinant and the inverse.
Failure Mode
The most common error is getting transposes wrong. In numerator layout, (a column vector), but in denominator layout it is (a row vector). Always state your convention and check dimensions.
Canonical Examples
Maximum likelihood for multivariate Gaussian
The log-likelihood of and given data is . Taking : the derivative of the quadratic form gives . Setting to zero gives . Taking : use and . Setting to zero gives .
Linear regression gradient
For , expand to . Then . Setting to zero gives the normal equations .
Common Confusions
Layout conventions cause most errors
Different textbooks use different layout conventions, and mixing them produces wrong transposes. Pick one convention (numerator layout for ML) and stick with it. When reading papers, check which convention they use by verifying a simple case like .
The derivative of a trace is not a trace
is a matrix, not a scalar. The trace of a matrix is a scalar, but its derivative with respect to a matrix is a matrix. This is the gradient: it tells you how much the trace changes when you perturb each entry of .
Summary
- ; for symmetric this is
- Use differentials for complex expressions: compute , rearrange into , read off the derivative
- Always check dimensions and state your layout convention
- Most ML uses numerator layout (gradient has same shape as variable)
Exercises
Problem
Derive where is a constant square matrix and is a vector.
Problem
Derive using differentials. Assume is symmetric positive definite.
Problem
Derive the derivative using the differential method, where is a constant matrix and is an matrix.
References
Canonical:
- Petersen & Pedersen, The Matrix Cookbook (2012)
- Magnus & Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics (2019), Chapters 1-5
- Boyd & Vandenberghe, Convex Optimization (2004), Appendix A.4 (matrix calculus identities)
Current:
- Parr & Howard, "The Matrix Calculus You Need for Deep Learning" (2018)
- Goodfellow, Bengio, Courville, Deep Learning (2016), Section 4.3 (gradient-based optimization)
- Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 5.3 (gradients of matrix and vector expressions)
Next Topics
The natural next steps from matrix calculus:
- Backpropagation: applying the chain rule to computation graphs
- Automatic differentiation: computing derivatives without symbolic math
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A