Matrix Calculus

Sneiderman, Robby

Mathematical Infrastructure

Matrix Calculus

The differentiation identities you actually use in ML: derivatives of traces, log-determinants, and quadratic forms with respect to matrices and vectors.

CoreTier 1StableSupporting~20 min

Prerequisites

The Jacobian Matrix The Hessian Matrix

Start 8-question practice · 9 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

mathematical-infrastructure | layer 1 | tier 1. This page has 2 direct prerequisites and 4 published dependents.

Open Atlas Prerequisites Leads to

What next

Feedforward Networks and Backpropagation

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Chain rule for matrix calculus: Jacobians compose by multiplication

Every time you derive a gradient for a machine learning model — whether it is linear regression, a Gaussian model, or a neural network — you are doing matrix calculus. Most ML papers skip the derivation steps, writing "taking the derivative and setting to zero" as if the reader knows how to differentiate $\log \det(X)$ with respect to $X$ .

This topic gives you the identities that make those derivations mechanical. Once you know a handful of rules, you can derive gradients for any model whose loss involves traces, determinants, and quadratic forms.

Mental Model

Matrix calculus extends scalar calculus to matrices the same way vector calculus extends it to vectors. The key insight: every matrix expression is really a scalar function of many variables (the matrix entries). We package the partial derivatives into matrices of the same shape.

The derivative of a scalar with respect to a matrix is a matrix. The derivative of a scalar with respect to a vector is a vector. These are the gradients you need for optimization.

Formal Setup and Notation

Let $f: \mathbb{R}^{m \times n} \to \mathbb{R}$ be a scalar-valued function of a matrix $X$ . The derivative of $f$ with respect to $X$ is the matrix of partial derivatives:

$\frac{\partial f}{\partial X} \in \mathbb{R}^{m \times n}, \quad \left[\frac{\partial f}{\partial X}\right]_{ij} = \frac{\partial f}{\partial X_{ij}}$

Definition

Denominator Layout Convention

In the denominator layout (also called Jacobian formulation), the derivative $\frac{\partial f}{\partial X}$ has the same shape as $X^T$ . For a scalar $f$ and column vector $x \in \mathbb{R}^n$ , $\frac{\partial f}{\partial x}$ is a row vector. Many statistics textbooks use this convention.

Definition

Numerator Layout Convention

In the numerator layout (also called gradient formulation), the derivative $\frac{\partial f}{\partial X}$ has the same shape as $X$ . For a scalar $f$ and column vector $x \in \mathbb{R}^n$ , $\frac{\partial f}{\partial x}$ is a column vector. Most ML and optimization textbooks use this convention because the gradient points in the direction of steepest ascent.

This topic uses numerator layout throughout.

Core Identities

These are the identities you need to know. In all cases, $A$ is a constant matrix of appropriate dimensions, $x$ is a vector, and $X$ is a matrix.

Vector identities:

$\frac{\partial}{\partial x}(a^T x) = a$

$\frac{\partial}{\partial x}(x^T A x) = (A + A^T)x$

If $A$ is symmetric, this simplifies to $2Ax$ .

$\frac{\partial}{\partial x}\|x\|^2 = 2x$

Trace identities:

$\frac{\partial}{\partial X}\mathrm{tr}(AX) = A^T$

$\frac{\partial}{\partial X}\mathrm{tr}(X^T A) = A$

$\frac{\partial}{\partial X}\mathrm{tr}(X^T A X) = (A + A^T)X$

$\frac{\partial}{\partial X}\mathrm{tr}(AXB) = A^T B^T$

Log-determinant (valid where $\det X > 0$ , in particular on the cone of symmetric positive-definite matrices):

$\frac{\partial}{\partial X}\log \det(X) = X^{-T}$

If $X$ is symmetric and positive-definite (so $\log \det X$ is real and differentiable), this is $X^{-1}$ .

Inverse:

$\frac{\partial}{\partial X}\mathrm{tr}(AX^{-1}) = -(X^{-1}AX^{-1})^T = -X^{-T}A^TX^{-T}$

The Chain Rule for Matrix Expressions

Definition

Matrix Chain Rule via Differentials

To differentiate complex matrix expressions, use differentials. The differential of $f$ is:

$df = \mathrm{tr}\left(\left(\frac{\partial f}{\partial X}\right)^T dX\right)$

The procedure: (1) compute the differential $df$ using rules like $d(AB) = (dA)B + A(dB)$ , $d(A^{-1}) = -A^{-1}(dA)A^{-1}$ , and $d(\log\det A) = \mathrm{tr}(A^{-1} dA)$ ; (2) rearrange into the form $\mathrm{tr}((\cdot)^T dX)$ ; (3) read off the derivative.

Main Theorems

Proposition

Chain Rule for Matrix-to-Scalar Functions

Statement

If $f: \mathbb{R}^{m \times n} \to \mathbb{R}$ is a scalar function and $X = X(\theta)$ depends on parameters $\theta \in \mathbb{R}^p$ , then:

$\frac{\partial f}{\partial \theta_k} = \mathrm{tr}\left(\left(\frac{\partial f}{\partial X}\right)^T \frac{\partial X}{\partial \theta_k}\right)$

For the common case where $f(X) = f(A\theta)$ with $A$ constant and $x = A\theta$ :

$\frac{\partial f}{\partial \theta} = A^T \frac{\partial f}{\partial x}$

Intuition

The chain rule in matrix calculus works the same as in scalar calculus: the derivative of the composition is the product of derivatives. The trace appears because we need to contract a matrix of derivatives (the Jacobian) down to a scalar. The key is keeping track of transposes.

Proof Sketch

Write the differential: $df = \mathrm{tr}((\frac{\partial f}{\partial X})^T dX)$ . If $X = X(\theta)$ , then $dX = \sum_k \frac{\partial X}{\partial \theta_k} d\theta_k$ . Substitute and use linearity of trace: $df = \sum_k \mathrm{tr}((\frac{\partial f}{\partial X})^T \frac{\partial X}{\partial \theta_k}) d\theta_k$ . Read off $\frac{\partial f}{\partial \theta_k}$ as the coefficient of $d\theta_k$ .

Why It Matters

This is how you derive gradients for any model. Want the gradient of the log-likelihood of a Gaussian with respect to the mean? Chain rule through the quadratic form. With respect to the covariance? Chain rule through the log-determinant and the inverse.

Failure Mode

The most common error is getting transposes wrong. In numerator layout, $\frac{\partial}{\partial x}(a^Tx) = a$ (a column vector), but in denominator layout it is $a^T$ (a row vector). Always state your convention and check dimensions.

report a correction →

Canonical Examples

Example

Maximum likelihood for multivariate Gaussian

The log-likelihood of $\mu$ and $\Sigma$ given data $\{x_i\}_{i=1}^n$ is $\ell = -\frac{n}{2}\log\det\Sigma - \frac{1}{2}\sum_i (x_i - \mu)^T \Sigma^{-1}(x_i - \mu) + \text{const}$ . Taking $\frac{\partial \ell}{\partial \mu}$ : the derivative of the quadratic form gives $\Sigma^{-1}\sum_i(x_i - \mu)$ . Setting to zero gives $\hat{\mu} = \bar{x}$ . Taking $\frac{\partial \ell}{\partial \Sigma}$ : use $\frac{\partial}{\partial \Sigma}\log\det\Sigma = \Sigma^{-1}$ and $\frac{\partial}{\partial \Sigma}\mathrm{tr}(\Sigma^{-1}A) = -\Sigma^{-1}A\Sigma^{-1}$ . Setting to zero gives $\hat{\Sigma} = \frac{1}{n}\sum_i(x_i-\bar{x})(x_i-\bar{x})^T$ .

Example

Linear regression gradient

For $f(\beta) = \|y - X\beta\|^2 = (y - X\beta)^T(y-X\beta)$ , expand to $y^Ty - 2\beta^TX^Ty + \beta^TX^TX\beta$ . Then $\frac{\partial f}{\partial \beta} = -2X^Ty + 2X^TX\beta$ . Setting to zero gives the normal equations $X^TX\beta = X^Ty$ .

Common Confusions

Watch Out

Layout conventions cause most errors

Different textbooks use different layout conventions, and mixing them produces wrong transposes. Pick one convention (numerator layout for ML) and stick with it. When reading papers, check which convention they use by verifying a simple case like $\frac{\partial}{\partial x}(a^Tx)$ .

Watch Out

The derivative of a trace is not a trace

$\frac{\partial}{\partial X}\mathrm{tr}(AX) = A^T$ is a matrix, not a scalar. The trace of a matrix is a scalar, but its derivative with respect to a matrix is a matrix. This is the gradient: it tells you how much the trace changes when you perturb each entry of $X$ .

Summary

$\frac{\partial}{\partial x}(x^TAx) = (A+A^T)x$ ; for symmetric $A$ this is $2Ax$
$\frac{\partial}{\partial X}\mathrm{tr}(AX) = A^T$
$\frac{\partial}{\partial X}\log\det(X) = X^{-T}$
Use differentials for complex expressions: compute $df$ , rearrange into $\mathrm{tr}((\cdot)^T dX)$ , read off the derivative
Always check dimensions and state your layout convention
Most ML uses numerator layout (gradient has same shape as variable)

Exercises

ExerciseCore

Problem

Derive $\frac{\partial}{\partial x}(x^TAx)$ where $A$ is a constant square matrix and $x$ is a vector.

ExerciseAdvanced

Problem

Derive $\frac{\partial}{\partial \Sigma}\log\det(\Sigma)$ using differentials. Assume $\Sigma$ is symmetric positive definite.

ExerciseResearch

Problem

Derive the derivative $\frac{\partial}{\partial X}\mathrm{tr}(X^TAX)$ using the differential method, where $A$ is a constant matrix and $X$ is an $n \times p$ matrix.

References

Canonical:

Petersen & Pedersen, The Matrix Cookbook (2012), https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
Magnus & Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics (2019), Chapters 1-5
Boyd & Vandenberghe, Convex Optimization (2004), Section A.4 (matrix calculus identities)

Current:

Parr & Howard, "The Matrix Calculus You Need for Deep Learning" (2018), arXiv:1802.01528
Goodfellow, Bengio, Courville, Deep Learning (2016), Section 4.3 (gradient-based optimization)
Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 5.3 (gradients of matrix and vector expressions)

Next Topics

The natural next steps from matrix calculus:

Backpropagation: applying the chain rule to computation graphs
Automatic differentiation: computing derivatives without symbolic math

Last reviewed: July 3, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

The Hessian Matrixlayer 0A · tier 1
The Jacobian Matrixlayer 0A · tier 1

Derived topics

4

Automatic Differentiationlayer 1 · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1
Linear Layer: Shapes, Bias, and Memorylayer 2 · tier 1
Quantization Theorylayer 5 · tier 3

Graph-backed continuations

Feedforward Networks and Backpropagation Automatic Differentiation Linear Layer: Shapes, Bias, and Memory Quantization Theory