Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Calculus Objects

The Jacobian Matrix

The matrix of all first partial derivatives of a vector-valued function: encodes the best linear approximation, connects to the chain rule in matrix form, and is the backbone of backpropagation.

CoreTier 1Stable~35 min

Why This Matters

The Jacobian is the derivative of a vector-valued function. It is the matrix that makes the chain rule work for functions from Rn\mathbb{R}^n to Rm\mathbb{R}^m. Backpropagation is a chain of Jacobian multiplications. When you compose layers in a neural network, the chain rule multiplies Jacobians. Forward mode automatic differentiation computes Jacobian-vector products (JVPs). Reverse mode (backprop) computes vector-Jacobian products (VJPs).

Beyond deep learning, the Jacobian appears in coordinate transformations (the Jacobian determinant is the volume change factor), in the inverse function theorem, and in sensitivity analysis throughout science and engineering.

J = [df/dx] where f: R R/x/x/x/xInput sensitivitiesfffOutput gradientsf0/x0f0/x1f0/x2f0/x3f1/x0f1/x1f1/x2f1/x3f2/x0f2/x1f2/x2f2/x3m × n matrixRow i = gradient of fCol j = sensitivity to xChain rule: J(g f) = J(g) · J(f). This is why backpropagation is matrix multiplication.

Mental Model

For a scalar function f:RRf: \mathbb{R} \to \mathbb{R}, the derivative f(x)f'(x) gives the best linear approximation: f(x+h)f(x)+f(x)hf(x + h) \approx f(x) + f'(x) h.

For a vector-valued function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m (a map between spaces studied in differentiation in R^n), the Jacobian matrix plays exactly the same role. It is the matrix JJ such that:

f(x+h)f(x)+Jhf(x + h) \approx f(x) + J \cdot h

The Jacobian generalizes the derivative to maps between multi-dimensional spaces. Each row is the gradient of one output component; each column shows how one input affects all outputs.

Core Definitions

Definition

Jacobian Matrix

For a differentiable function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m with component functions f=(f1,,fm)Tf = (f_1, \ldots, f_m)^T, the Jacobian matrix at a point xx is the m×nm \times n matrix:

[Jf(x)]ij=fixj[J_f(x)]_{ij} = \frac{\partial f_i}{\partial x_j}

Explicitly:

Jf(x)=(f1x1f1xnfmx1fmxn)J_f(x) = \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{pmatrix}

Row ii is the gradient of the ii-th component fif_i (transposed). Column jj shows how all outputs change when xjx_j is perturbed.

Definition

Special Cases by Dimension

The Jacobian unifies several familiar derivative objects:

  • f:RnRf: \mathbb{R}^n \to \mathbb{R} (scalar-valued): The Jacobian is a 1×n1 \times n row vector. The transpose of the gradient: Jf=(f)TJ_f = (\nabla f)^T.
  • f:RRmf: \mathbb{R} \to \mathbb{R}^m (curve in space): The Jacobian is an m×1m \times 1 column vector. The tangent vector f(t)f'(t).
  • f:RnRnf: \mathbb{R}^n \to \mathbb{R}^n (square case): The Jacobian is n×nn \times n and has a well-defined determinant.

The Linear Approximation

The defining property of the Jacobian is that it gives the best linear approximation:

f(x+h)=f(x)+Jf(x)h+o(h)f(x + h) = f(x) + J_f(x) \cdot h + o(\|h\|)

where o(h)o(\|h\|) means the error goes to zero faster than h\|h\|. This is the multi-dimensional version of f(x+h)f(x)+f(x)hf(x+h) \approx f(x) + f'(x)h. The Jacobian is the unique matrix with this property (when ff is differentiable).

The Chain Rule in Matrix Form

Theorem

Chain Rule for Jacobians

Statement

Let g:RnRpg: \mathbb{R}^n \to \mathbb{R}^p and f:RpRmf: \mathbb{R}^p \to \mathbb{R}^m. The Jacobian of the composition fgf \circ g is the product of Jacobians:

Jfg(x)=Jf(g(x))Jg(x)J_{f \circ g}(x) = J_f(g(x)) \cdot J_g(x)

This is an m×nm \times n matrix, computed as the product of an m×pm \times p matrix and a p×np \times n matrix.

Intuition

The chain rule says: the linear approximation of a composition is the composition of the linear approximations. Since linear maps compose by matrix multiplication, Jacobians multiply. Each Jacobian represents "how much things change locally," and the chain rule says these local changes multiply.

Proof Sketch

Using the linear approximation:

f(g(x+h))f(g(x)+Jg(x)h)f(g(x))+Jf(g(x))Jg(x)hf(g(x+h)) \approx f(g(x) + J_g(x)h) \approx f(g(x)) + J_f(g(x)) \cdot J_g(x) \cdot h

The first step uses the linear approximation of gg, and the second uses the linear approximation of ff. The composite linear map is Jf(g(x))Jg(x)J_f(g(x)) \cdot J_g(x), which is the Jacobian of fgf \circ g by uniqueness of the derivative.

Making this rigorous requires controlling the o(h)o(\|h\|) error terms, which is done via the standard multivariable chain rule proof.

Why It Matters

This is the theoretical foundation of backpropagation. A neural network is a composition of layers: f=fLfL1f1f = f_L \circ f_{L-1} \circ \cdots \circ f_1. The Jacobian of the entire network is:

Jf=JfLJfL1Jf1J_f = J_{f_L} \cdot J_{f_{L-1}} \cdots J_{f_1}

Backprop computes this product efficiently from right to left (reverse mode), yielding vector-Jacobian products. The chain rule is the reason backpropagation works.

Failure Mode

The chain rule requires differentiability. Functions with kinks (like ReLU at zero) are not differentiable at those points. In practice, deep learning frameworks define a subgradient or use the derivative from one side (e.g., ReLU'(0) = 0 by convention). The chain rule still works almost everywhere since the non-differentiable points have measure zero.

The Jacobian Determinant

When f:RnRnf: \mathbb{R}^n \to \mathbb{R}^n (same input and output dimension), the Jacobian is a square matrix and has a determinant.

Definition

Jacobian Determinant

The Jacobian determinant detJf(x)\det J_f(x) measures how ff locally scales volumes. If ARnA \subset \mathbb{R}^n is a small region near xx:

Volume(f(A))detJf(x)Volume(A)\text{Volume}(f(A)) \approx |\det J_f(x)| \cdot \text{Volume}(A)

This is the basis for the change of variables formula in integration:

f(A)g(y)dy=Ag(f(x))detJf(x)dx\int_{f(A)} g(y) \, dy = \int_A g(f(x)) \, |\det J_f(x)| \, dx

The sign of the Jacobian determinant indicates whether ff preserves or reverses orientation. A positive determinant means the orientation is preserved; negative means it is reversed.

The Inverse Function Theorem

Theorem

Inverse Function Theorem

Statement

Let f:RnRnf: \mathbb{R}^n \to \mathbb{R}^n be continuously differentiable and suppose detJf(a)0\det J_f(a) \neq 0 at a point aa. Then there exists a neighborhood UU of aa on which ff is a diffeomorphism (smooth bijection with smooth inverse), and the Jacobian of the inverse at b=f(a)b = f(a) is:

Jf1(b)=[Jf(a)]1J_{f^{-1}}(b) = [J_f(a)]^{-1}

The derivative of the inverse is the inverse of the derivative.

Intuition

If the Jacobian at a point is invertible (nonzero determinant), then ff is locally one-to-one and onto. It does not collapse any directions. The linear approximation is invertible, so the nonlinear function is locally invertible too. And the derivative of the inverse is what you would guess: invert the derivative.

Proof Sketch

The proof uses the contraction mapping theorem. Define ϕ(x)=xJf(a)1(f(x)y)\phi(x) = x - J_f(a)^{-1}(f(x) - y) for a target value yy near f(a)f(a). Since Jf(a)J_f(a) is invertible and ff is C1C^1, ϕ\phi is a contraction near aa, so it has a unique fixed point xx with f(x)=yf(x) = y. This gives local invertibility. The formula for the Jacobian of the inverse follows from the chain rule applied to f1(f(x))=xf^{-1}(f(x)) = x.

Why It Matters

The inverse function theorem is foundational in differential geometry, implicit methods in ODEs, and. in machine learning. in normalizing flows. Normalizing flows require computing detJf\det J_f to evaluate the density of transformed distributions. The inverse function theorem guarantees that the flow is locally invertible whenever the Jacobian is nonsingular.

Failure Mode

The theorem is local: ff may be locally invertible near aa but not globally invertible. Example: f(x)=x2f(x) = x^2 on R\mathbb{R} has f(1)=20f'(1) = 2 \neq 0, so ff is locally invertible near x=1x = 1. But ff is not globally invertible (f(1)=f(1)=1f(1) = f(-1) = 1). Also, the Jacobian must be nonsingular at the point in question. IF detJf(a)=0\det J_f(a) = 0, the function may fold or collapse directions near aa.

The Jacobian in Backpropagation

The connection between the Jacobian and automatic differentiation is fundamental:

Jacobian-vector product (JVP). forward mode:

Given a vector vRnv \in \mathbb{R}^n, compute Jf(x)vRmJ_f(x) \cdot v \in \mathbb{R}^m. This tells you how a perturbation vv in the input propagates to the output. Cost: one forward pass. Efficient when n<mn < m (few inputs, many outputs).

Vector-Jacobian product (VJP). reverse mode (backpropagation):

Given a vector uRmu \in \mathbb{R}^m, compute uTJf(x)Rnu^T \cdot J_f(x) \in \mathbb{R}^n. This tells you how each input component affects the output along direction uu. Cost: one backward pass. Efficient when m<nm < n (few outputs, many inputs).

For a loss function L:RnRL: \mathbb{R}^n \to \mathbb{R} (one output, nn parameters), reverse mode computes the gradient L\nabla L in one backward pass. This is why backprop is efficient for neural network training.

Canonical Examples

Example

Jacobian of polar-to-Cartesian transformation

The transformation from polar coordinates (r,θ)(r, \theta) to Cartesian (x,y)(x, y) is:

f(r,θ)=(rcosθrsinθ)f(r, \theta) = \begin{pmatrix} r \cos\theta \\ r \sin\theta \end{pmatrix}

The Jacobian is:

Jf(r,θ)=(xrxθyryθ)=(cosθrsinθsinθrcosθ)J_f(r, \theta) = \begin{pmatrix} \frac{\partial x}{\partial r} & \frac{\partial x}{\partial \theta} \\ \frac{\partial y}{\partial r} & \frac{\partial y}{\partial \theta} \end{pmatrix} = \begin{pmatrix} \cos\theta & -r\sin\theta \\ \sin\theta & r\cos\theta \end{pmatrix}

The Jacobian determinant is:

detJf=rcos2θ+rsin2θ=r\det J_f = r\cos^2\theta + r\sin^2\theta = r

This is the familiar factor rr in the polar coordinate area element dA=rdrdθdA = r \, dr \, d\theta. The Jacobian determinant directly gives you the area scaling factor for coordinate transformations.

Example

Jacobian of a two-layer neural network

Consider a simple network f(x)=σ(W2σ(W1x+b1)+b2)f(x) = \sigma(W_2 \, \sigma(W_1 x + b_1) + b_2) where σ\sigma is an elementwise activation. By the chain rule:

Jf(x)=diag(σ(z2))W2diag(σ(z1))W1J_f(x) = \text{diag}(\sigma'(z_2)) \cdot W_2 \cdot \text{diag}(\sigma'(z_1)) \cdot W_1

where z1=W1x+b1z_1 = W_1 x + b_1 and z2=W2σ(z1)+b2z_2 = W_2 \sigma(z_1) + b_2.

Each layer contributes its Jacobian as a factor in the product. The diag(σ())(\sigma'(\cdot)) terms come from the elementwise nonlinearity. Backpropagation computes VJPs through this product right-to-left.

Common Confusions

Watch Out

The Jacobian is a matrix, the gradient is a vector

For f:RnRf: \mathbb{R}^n \to \mathbb{R} (scalar output), the gradient fRn\nabla f \in \mathbb{R}^n is a column vector, while the Jacobian JfR1×nJ_f \in \mathbb{R}^{1 \times n} is a row vector. They carry the same information: Jf=(f)TJ_f = (\nabla f)^T. But for f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m (vector output), the "gradient" is not well-defined as a single vector. the Jacobian is the correct generalization. Each row of the Jacobian is the gradient of one component.

Watch Out

JVP vs VJP is not just a transpose

A Jacobian-vector product JvJv and a vector-Jacobian product uTJu^T J involve the same matrix JJ but are computed differently by autodiff. They are not interchangeable: JVPs propagate perturbations forward (tangent mode), while VJPs propagate sensitivities backward (adjoint/cotangent mode). For a composition f=fLf1f = f_L \circ \cdots \circ f_1, the JVP multiplies Jacobians left-to-right; the VJP multiplies right-to-left. This choice of multiplication order is the difference between forward mode and reverse mode automatic differentiation.

Watch Out

Jacobian determinant only makes sense for square Jacobians

If f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m with mnm \neq n, the Jacobian is rectangular and has no determinant. Volume scaling in this case is described by det(JTJ)\sqrt{\det(J^T J)} (for m>nm > n, measuring how ff scales nn-dimensional volumes in Rm\mathbb{R}^m). The change-of-variables formula with detJ|\det J| requires m=nm = n.

Summary

  • The Jacobian JfRm×nJ_f \in \mathbb{R}^{m \times n} contains all first partial derivatives: [J]ij=fi/xj[J]_{ij} = \partial f_i / \partial x_j
  • Linear approximation: f(x+h)f(x)+Jf(x)hf(x + h) \approx f(x) + J_f(x) \cdot h
  • Chain rule: Jfg=JfJgJ_{f \circ g} = J_f \cdot J_g. Jacobians multiply
  • Jacobian determinant: measures local volume scaling; gives the rr in rdrdθr \, dr \, d\theta
  • Inverse function theorem: if detJf0\det J_f \neq 0, ff is locally invertible
  • Forward mode autodiff computes JVPs (JvJv); reverse mode computes VJPs (uTJu^T J)
  • For scalar loss (m=1m = 1), VJP = gradient computation, which is why backprop works

Exercises

ExerciseCore

Problem

Compute the Jacobian of the polar-to-Cartesian transformation f(r,θ)=(rcosθ,  rsinθ)Tf(r, \theta) = (r\cos\theta, \; r\sin\theta)^T at the point (r,θ)=(2,π/4)(r, \theta) = (2, \pi/4). Verify that the Jacobian determinant equals r=2r = 2.

ExerciseAdvanced

Problem

Let f:R3R2f: \mathbb{R}^3 \to \mathbb{R}^2 be defined by f(x,y,z)=(x2+yz,  exsin(y+z))Tf(x, y, z) = (x^2 + yz, \; e^x \sin(y+z))^T.

(a) Compute the Jacobian Jf(x,y,z)J_f(x, y, z). (b) At the point (0,0,0)(0, 0, 0), compute the JVP JfvJ_f \cdot v for v=(1,1,1)Tv = (1, 1, 1)^T. (c) At the same point, compute the VJP uTJfu^T \cdot J_f for u=(1,0)Tu = (1, 0)^T.

References

Canonical:

  • Spivak, Calculus on Manifolds (1965), Chapter 2
  • Rudin, Principles of Mathematical Analysis (1976), Chapter 9

Current:

  • Griewank & Walther, Evaluating Derivatives (2008). for the autodiff connection
  • Baydin et al., "Automatic Differentiation in Machine Learning: a Survey" (2018)

Next Topics

The natural next steps from the Jacobian:

  • Automatic differentiation: how JVPs and VJPs are computed efficiently, forward mode vs. reverse mode, and the computational graph perspective
  • The Hessian matrix: the Jacobian of the gradient, encoding second-order curvature information

Last reviewed: April 2026

Builds on This

Next Topics