Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Calculus Objects

Matrix Calculus

The differentiation identities you actually use in ML: derivatives of traces, log-determinants, and quadratic forms with respect to matrices and vectors.

CoreTier 1Stable~50 min

Why This Matters

Every time you derive a gradient for a machine learning model. whether it is linear regression, a Gaussian model, or a neural network. You are doing matrix calculus. Most ML papers skip the derivation steps, writing "taking the derivative and setting to zero" as if the reader knows how to differentiate logdet(X)\log \det(X) with respect to XX.

This topic gives you the identities that make those derivations mechanical. Once you know a handful of rules, you can derive gradients for any model whose loss involves traces, determinants, and quadratic forms.

Mental Model

Matrix calculus extends scalar calculus to matrices the same way vector calculus extends it to vectors. The key insight: every matrix expression is really a scalar function of many variables (the matrix entries). We package the partial derivatives into matrices of the same shape.

The derivative of a scalar with respect to a matrix is a matrix. The derivative of a scalar with respect to a vector is a vector. These are the gradients you need for optimization.

Formal Setup and Notation

Let f:Rm×nRf: \mathbb{R}^{m \times n} \to \mathbb{R} be a scalar-valued function of a matrix XX. The derivative of ff with respect to XX is the matrix of partial derivatives:

fXRm×n,[fX]ij=fXij\frac{\partial f}{\partial X} \in \mathbb{R}^{m \times n}, \quad \left[\frac{\partial f}{\partial X}\right]_{ij} = \frac{\partial f}{\partial X_{ij}}

Definition

Denominator Layout Convention

In the denominator layout (also called Jacobian formulation), the derivative fX\frac{\partial f}{\partial X} has the same shape as XTX^T. For a scalar ff and column vector xRnx \in \mathbb{R}^n, fx\frac{\partial f}{\partial x} is a row vector. Many statistics textbooks use this convention.

Definition

Numerator Layout Convention

In the numerator layout (also called gradient formulation), the derivative fX\frac{\partial f}{\partial X} has the same shape as XX. For a scalar ff and column vector xRnx \in \mathbb{R}^n, fx\frac{\partial f}{\partial x} is a column vector. Most ML and optimization textbooks use this convention because the gradient points in the direction of steepest ascent.

This topic uses numerator layout throughout.

Core Identities

These are the identities you need to know. In all cases, AA is a constant matrix of appropriate dimensions, xx is a vector, and XX is a matrix.

Vector identities:

x(aTx)=a\frac{\partial}{\partial x}(a^T x) = a

x(xTAx)=(A+AT)x\frac{\partial}{\partial x}(x^T A x) = (A + A^T)x

If AA is symmetric, this simplifies to 2Ax2Ax.

xx2=2x\frac{\partial}{\partial x}\|x\|^2 = 2x

Trace identities:

Xtr(AX)=AT\frac{\partial}{\partial X}\mathrm{tr}(AX) = A^T

Xtr(XTA)=A\frac{\partial}{\partial X}\mathrm{tr}(X^T A) = A

Xtr(XTAX)=(A+AT)X\frac{\partial}{\partial X}\mathrm{tr}(X^T A X) = (A + A^T)X

Xtr(AXB)=ATBT\frac{\partial}{\partial X}\mathrm{tr}(AXB) = A^T B^T

Log-determinant:

Xlogdet(X)=XT\frac{\partial}{\partial X}\log \det(X) = X^{-T}

If XX is symmetric, this is X1X^{-1}.

Inverse:

Xtr(AX1)=(X1AX1)T=XTATXT\frac{\partial}{\partial X}\mathrm{tr}(AX^{-1}) = -(X^{-1}AX^{-1})^T = -X^{-T}A^TX^{-T}

The Chain Rule for Matrix Expressions

Definition

Matrix Chain Rule via Differentials

To differentiate complex matrix expressions, use differentials. The differential of ff is:

df=tr((fX)TdX)df = \mathrm{tr}\left(\left(\frac{\partial f}{\partial X}\right)^T dX\right)

The procedure: (1) compute the differential dfdf using rules like d(AB)=(dA)B+A(dB)d(AB) = (dA)B + A(dB), d(A1)=A1(dA)A1d(A^{-1}) = -A^{-1}(dA)A^{-1}, and d(logdetA)=tr(A1dA)d(\log\det A) = \mathrm{tr}(A^{-1} dA); (2) rearrange into the form tr(()TdX)\mathrm{tr}((\cdot)^T dX); (3) read off the derivative.

Main Theorems

Proposition

Chain Rule for Matrix-to-Scalar Functions

Statement

If f:Rm×nRf: \mathbb{R}^{m \times n} \to \mathbb{R} is a scalar function and X=X(θ)X = X(\theta) depends on parameters θRp\theta \in \mathbb{R}^p, then:

fθk=tr((fX)TXθk)\frac{\partial f}{\partial \theta_k} = \mathrm{tr}\left(\left(\frac{\partial f}{\partial X}\right)^T \frac{\partial X}{\partial \theta_k}\right)

For the common case where f(X)=f(Aθ)f(X) = f(A\theta) with AA constant and x=Aθx = A\theta:

fθ=ATfx\frac{\partial f}{\partial \theta} = A^T \frac{\partial f}{\partial x}

Intuition

The chain rule in matrix calculus works the same as in scalar calculus: the derivative of the composition is the product of derivatives. The trace appears because we need to contract a matrix of derivatives (the Jacobian) down to a scalar. The key is keeping track of transposes.

Proof Sketch

Write the differential: df=tr((fX)TdX)df = \mathrm{tr}((\frac{\partial f}{\partial X})^T dX). If X=X(θ)X = X(\theta), then dX=kXθkdθkdX = \sum_k \frac{\partial X}{\partial \theta_k} d\theta_k. Substitute and use linearity of trace: df=ktr((fX)TXθk)dθkdf = \sum_k \mathrm{tr}((\frac{\partial f}{\partial X})^T \frac{\partial X}{\partial \theta_k}) d\theta_k. Read off fθk\frac{\partial f}{\partial \theta_k} as the coefficient of dθkd\theta_k.

Why It Matters

This is how you derive gradients for any model. Want the gradient of the log-likelihood of a Gaussian with respect to the mean? Chain rule through the quadratic form. With respect to the covariance? Chain rule through the log-determinant and the inverse.

Failure Mode

The most common error is getting transposes wrong. In numerator layout, x(aTx)=a\frac{\partial}{\partial x}(a^Tx) = a (a column vector), but in denominator layout it is aTa^T (a row vector). Always state your convention and check dimensions.

Canonical Examples

Example

Maximum likelihood for multivariate Gaussian

The log-likelihood of μ\mu and Σ\Sigma given data {xi}i=1n\{x_i\}_{i=1}^n is =n2logdetΣ12i(xiμ)TΣ1(xiμ)+const\ell = -\frac{n}{2}\log\det\Sigma - \frac{1}{2}\sum_i (x_i - \mu)^T \Sigma^{-1}(x_i - \mu) + \text{const}. Taking μ\frac{\partial \ell}{\partial \mu}: the derivative of the quadratic form gives Σ1i(xiμ)\Sigma^{-1}\sum_i(x_i - \mu). Setting to zero gives μ^=xˉ\hat{\mu} = \bar{x}. Taking Σ\frac{\partial \ell}{\partial \Sigma}: use ΣlogdetΣ=Σ1\frac{\partial}{\partial \Sigma}\log\det\Sigma = \Sigma^{-1} and Σtr(Σ1A)=Σ1AΣ1\frac{\partial}{\partial \Sigma}\mathrm{tr}(\Sigma^{-1}A) = -\Sigma^{-1}A\Sigma^{-1}. Setting to zero gives Σ^=1ni(xixˉ)(xixˉ)T\hat{\Sigma} = \frac{1}{n}\sum_i(x_i-\bar{x})(x_i-\bar{x})^T.

Example

Linear regression gradient

For f(β)=yXβ2=(yXβ)T(yXβ)f(\beta) = \|y - X\beta\|^2 = (y - X\beta)^T(y-X\beta), expand to yTy2βTXTy+βTXTXβy^Ty - 2\beta^TX^Ty + \beta^TX^TX\beta. Then fβ=2XTy+2XTXβ\frac{\partial f}{\partial \beta} = -2X^Ty + 2X^TX\beta. Setting to zero gives the normal equations XTXβ=XTyX^TX\beta = X^Ty.

Common Confusions

Watch Out

Layout conventions cause most errors

Different textbooks use different layout conventions, and mixing them produces wrong transposes. Pick one convention (numerator layout for ML) and stick with it. When reading papers, check which convention they use by verifying a simple case like x(aTx)\frac{\partial}{\partial x}(a^Tx).

Watch Out

The derivative of a trace is not a trace

Xtr(AX)=AT\frac{\partial}{\partial X}\mathrm{tr}(AX) = A^T is a matrix, not a scalar. The trace of a matrix is a scalar, but its derivative with respect to a matrix is a matrix. This is the gradient: it tells you how much the trace changes when you perturb each entry of XX.

Summary

  • x(xTAx)=(A+AT)x\frac{\partial}{\partial x}(x^TAx) = (A+A^T)x; for symmetric AA this is 2Ax2Ax
  • Xtr(AX)=AT\frac{\partial}{\partial X}\mathrm{tr}(AX) = A^T
  • Xlogdet(X)=XT\frac{\partial}{\partial X}\log\det(X) = X^{-T}
  • Use differentials for complex expressions: compute dfdf, rearrange into tr(()TdX)\mathrm{tr}((\cdot)^T dX), read off the derivative
  • Always check dimensions and state your layout convention
  • Most ML uses numerator layout (gradient has same shape as variable)

Exercises

ExerciseCore

Problem

Derive x(xTAx)\frac{\partial}{\partial x}(x^TAx) where AA is a constant square matrix and xx is a vector.

ExerciseAdvanced

Problem

Derive Σlogdet(Σ)\frac{\partial}{\partial \Sigma}\log\det(\Sigma) using differentials. Assume Σ\Sigma is symmetric positive definite.

ExerciseResearch

Problem

Derive the derivative Xtr(XTAX)\frac{\partial}{\partial X}\mathrm{tr}(X^TAX) using the differential method, where AA is a constant matrix and XX is an n×pn \times p matrix.

References

Canonical:

  • Petersen & Pedersen, The Matrix Cookbook (2012)
  • Magnus & Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics (2019), Chapters 1-5
  • Boyd & Vandenberghe, Convex Optimization (2004), Appendix A.4 (matrix calculus identities)

Current:

  • Parr & Howard, "The Matrix Calculus You Need for Deep Learning" (2018)
  • Goodfellow, Bengio, Courville, Deep Learning (2016), Section 4.3 (gradient-based optimization)
  • Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 5.3 (gradients of matrix and vector expressions)

Next Topics

The natural next steps from matrix calculus:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics