Calculus Objects
The Jacobian Matrix
The matrix of all first partial derivatives of a vector-valued function: encodes the best linear approximation, connects to the chain rule in matrix form, and is the backbone of backpropagation.
Why This Matters
The Jacobian is the derivative of a vector-valued function. It is the matrix that makes the chain rule work for functions from to . Backpropagation is a chain of Jacobian multiplications. When you compose layers in a neural network, the chain rule multiplies Jacobians. Forward mode automatic differentiation computes Jacobian-vector products (JVPs). Reverse mode (backprop) computes vector-Jacobian products (VJPs).
Beyond deep learning, the Jacobian appears in coordinate transformations (the Jacobian determinant is the volume change factor), in the inverse function theorem, and in sensitivity analysis throughout science and engineering.
Mental Model
For a scalar function , the derivative gives the best linear approximation: .
For a vector-valued function (a map between spaces studied in differentiation in R^n), the Jacobian matrix plays exactly the same role. It is the matrix such that:
The Jacobian generalizes the derivative to maps between multi-dimensional spaces. Each row is the gradient of one output component; each column shows how one input affects all outputs.
Core Definitions
Jacobian Matrix
For a differentiable function with component functions , the Jacobian matrix at a point is the matrix:
Explicitly:
Row is the gradient of the -th component (transposed). Column shows how all outputs change when is perturbed.
Special Cases by Dimension
The Jacobian unifies several familiar derivative objects:
- (scalar-valued): The Jacobian is a row vector. The transpose of the gradient: .
- (curve in space): The Jacobian is an column vector. The tangent vector .
- (square case): The Jacobian is and has a well-defined determinant.
The Linear Approximation
The defining property of the Jacobian is that it gives the best linear approximation:
where means the error goes to zero faster than . This is the multi-dimensional version of . The Jacobian is the unique matrix with this property (when is differentiable).
The Chain Rule in Matrix Form
Chain Rule for Jacobians
Statement
Let and . The Jacobian of the composition is the product of Jacobians:
This is an matrix, computed as the product of an matrix and a matrix.
Intuition
The chain rule says: the linear approximation of a composition is the composition of the linear approximations. Since linear maps compose by matrix multiplication, Jacobians multiply. Each Jacobian represents "how much things change locally," and the chain rule says these local changes multiply.
Proof Sketch
Using the linear approximation:
The first step uses the linear approximation of , and the second uses the linear approximation of . The composite linear map is , which is the Jacobian of by uniqueness of the derivative.
Making this rigorous requires controlling the error terms, which is done via the standard multivariable chain rule proof.
Why It Matters
This is the theoretical foundation of backpropagation. A neural network is a composition of layers: . The Jacobian of the entire network is:
Backprop computes this product efficiently from right to left (reverse mode), yielding vector-Jacobian products. The chain rule is the reason backpropagation works.
Failure Mode
The chain rule requires differentiability. Functions with kinks (like ReLU at zero) are not differentiable at those points. In practice, deep learning frameworks define a subgradient or use the derivative from one side (e.g., ReLU'(0) = 0 by convention). The chain rule still works almost everywhere since the non-differentiable points have measure zero.
The Jacobian Determinant
When (same input and output dimension), the Jacobian is a square matrix and has a determinant.
Jacobian Determinant
The Jacobian determinant measures how locally scales volumes. If is a small region near :
This is the basis for the change of variables formula in integration:
The sign of the Jacobian determinant indicates whether preserves or reverses orientation. A positive determinant means the orientation is preserved; negative means it is reversed.
The Inverse Function Theorem
Inverse Function Theorem
Statement
Let be continuously differentiable and suppose at a point . Then there exists a neighborhood of on which is a diffeomorphism (smooth bijection with smooth inverse), and the Jacobian of the inverse at is:
The derivative of the inverse is the inverse of the derivative.
Intuition
If the Jacobian at a point is invertible (nonzero determinant), then is locally one-to-one and onto. It does not collapse any directions. The linear approximation is invertible, so the nonlinear function is locally invertible too. And the derivative of the inverse is what you would guess: invert the derivative.
Proof Sketch
The proof uses the contraction mapping theorem. Define for a target value near . Since is invertible and is , is a contraction near , so it has a unique fixed point with . This gives local invertibility. The formula for the Jacobian of the inverse follows from the chain rule applied to .
Why It Matters
The inverse function theorem is foundational in differential geometry, implicit methods in ODEs, and. in machine learning. in normalizing flows. Normalizing flows require computing to evaluate the density of transformed distributions. The inverse function theorem guarantees that the flow is locally invertible whenever the Jacobian is nonsingular.
Failure Mode
The theorem is local: may be locally invertible near but not globally invertible. Example: on has , so is locally invertible near . But is not globally invertible (). Also, the Jacobian must be nonsingular at the point in question. IF , the function may fold or collapse directions near .
The Jacobian in Backpropagation
The connection between the Jacobian and automatic differentiation is fundamental:
Jacobian-vector product (JVP). forward mode:
Given a vector , compute . This tells you how a perturbation in the input propagates to the output. Cost: one forward pass. Efficient when (few inputs, many outputs).
Vector-Jacobian product (VJP). reverse mode (backpropagation):
Given a vector , compute . This tells you how each input component affects the output along direction . Cost: one backward pass. Efficient when (few outputs, many inputs).
For a loss function (one output, parameters), reverse mode computes the gradient in one backward pass. This is why backprop is efficient for neural network training.
Canonical Examples
Jacobian of polar-to-Cartesian transformation
The transformation from polar coordinates to Cartesian is:
The Jacobian is:
The Jacobian determinant is:
This is the familiar factor in the polar coordinate area element . The Jacobian determinant directly gives you the area scaling factor for coordinate transformations.
Jacobian of a two-layer neural network
Consider a simple network where is an elementwise activation. By the chain rule:
where and .
Each layer contributes its Jacobian as a factor in the product. The diag terms come from the elementwise nonlinearity. Backpropagation computes VJPs through this product right-to-left.
Common Confusions
The Jacobian is a matrix, the gradient is a vector
For (scalar output), the gradient is a column vector, while the Jacobian is a row vector. They carry the same information: . But for (vector output), the "gradient" is not well-defined as a single vector. the Jacobian is the correct generalization. Each row of the Jacobian is the gradient of one component.
JVP vs VJP is not just a transpose
A Jacobian-vector product and a vector-Jacobian product involve the same matrix but are computed differently by autodiff. They are not interchangeable: JVPs propagate perturbations forward (tangent mode), while VJPs propagate sensitivities backward (adjoint/cotangent mode). For a composition , the JVP multiplies Jacobians left-to-right; the VJP multiplies right-to-left. This choice of multiplication order is the difference between forward mode and reverse mode automatic differentiation.
Jacobian determinant only makes sense for square Jacobians
If with , the Jacobian is rectangular and has no determinant. Volume scaling in this case is described by (for , measuring how scales -dimensional volumes in ). The change-of-variables formula with requires .
Summary
- The Jacobian contains all first partial derivatives:
- Linear approximation:
- Chain rule: . Jacobians multiply
- Jacobian determinant: measures local volume scaling; gives the in
- Inverse function theorem: if , is locally invertible
- Forward mode autodiff computes JVPs (); reverse mode computes VJPs ()
- For scalar loss (), VJP = gradient computation, which is why backprop works
Exercises
Problem
Compute the Jacobian of the polar-to-Cartesian transformation at the point . Verify that the Jacobian determinant equals .
Problem
Let be defined by .
(a) Compute the Jacobian . (b) At the point , compute the JVP for . (c) At the same point, compute the VJP for .
References
Canonical:
- Spivak, Calculus on Manifolds (1965), Chapter 2
- Rudin, Principles of Mathematical Analysis (1976), Chapter 9
Current:
- Griewank & Walther, Evaluating Derivatives (2008). for the autodiff connection
- Baydin et al., "Automatic Differentiation in Machine Learning: a Survey" (2018)
Next Topics
The natural next steps from the Jacobian:
- Automatic differentiation: how JVPs and VJPs are computed efficiently, forward mode vs. reverse mode, and the computational graph perspective
- The Hessian matrix: the Jacobian of the gradient, encoding second-order curvature information
Last reviewed: April 2026
Builds on This
- Automatic DifferentiationLayer 1
- Gradient Flow and Vanishing GradientsLayer 2
- Implicit DifferentiationLayer 2
- Inverse and Implicit Function TheoremLayer 0A
- Matrix CalculusLayer 1
- Normalizing FlowsLayer 3
- Physics-Informed Neural NetworksLayer 4