Mathematical Infrastructure
Vector Calculus Chain Rule
The chain rule for compositions of multivariable maps. Jacobians multiply when functions compose; gradients of scalar-valued compositions become vector-Jacobian products. The result that makes backpropagation a one-line theorem.
Prerequisites
Why This Matters
The single-variable chain rule says . The multivariable version replaces those scalar derivatives with Jacobians, and the product becomes a matrix product. That one upgrade is enough to derive backpropagation in one line: a deep network is a composition , its Jacobian is the product , and the gradient of a scalar loss is a vector-Jacobian product evaluated right-to-left.
This page states the rule with assumptions, gives the matrix form, derives the gradient corollary, and shows the implicit chain rule that handles constraints. It does not re-derive backprop end-to-end; that lives on its own page.
The Rule
Multivariable Chain Rule
Statement
Let be differentiable at , with , and let be differentiable at . Then the composition is differentiable at and its Jacobian satisfies where the right-hand side is matrix multiplication of a matrix with an matrix.
Intuition
Differentiability of at means admits a best linear approximation: . Apply and use its own linear approximation at . The composed approximation is linear with matrix . Because the linear approximation of a differentiable map is unique, this matrix is the Jacobian of .
Proof Sketch
By assumption with , and with . Set and substitute: The term is because is a fixed linear map. The term is and , so as well. The remainder is , identifying as the Jacobian of .
Why It Matters
Every layer in a neural network is a function . A depth- network is the composition . The chain rule gives , a product of matrices. Forward-mode autodiff multiplies left-to-right; reverse-mode (backprop) multiplies right-to-left starting from the loss gradient. The choice of order is the difference between and flops with asymmetric memory profiles, which is why reverse mode wins for scalar losses with many parameters.
Failure Mode
Differentiability at the inner point is required, not just continuity. Composing a non-differentiable inner map with a differentiable outer one breaks the rule even when the composition happens to be differentiable. ReLU activations are not differentiable at zero, so neural-network practice uses subgradients there; modern autodiff frameworks pick a convention (typically or ) and document it.
Gradient Corollary
For scalar-valued compositions the chain rule has a familiar special case. Let be differentiable. Define for . By the theorem, . Since is scalar-valued, is a row vector of size , and is . The product is a row vector of size , so This is a vector-Jacobian product (VJP): the upstream gradient is propagated backward by left-multiplying by . Backprop is exactly this identity applied recursively layer by layer.
Implicit Function Chain Rule
Sometimes a variable is defined implicitly: specifies as a function of near a point where is invertible. The implicit function theorem guarantees a smooth exists locally, and differentiating via the chain rule gives
\quad\Longrightarrow\quad \frac{dy}{dx} = -\left(\frac{\partial F}{\partial y}\right)^{-1} \frac{\partial F}{\partial x}.$$ [Implicit differentiation](/topics/implicit-differentiation) underpins meta-learning by implicit gradients, deep-equilibrium models (DEQs), and the gradient of an [optimization](/topics/projected-gradient-descent) solution with respect to its parameters. ## Worked Example: Two-Layer Network Let $g(x) = \sigma(W_1 x)$ and $h(x) = W_2 g(x)$ for activation $\sigma$ applied componentwise, with $x \in \mathbb{R}^n$, $W_1 \in \mathbb{R}^{m \times n}$, $W_2 \in \mathbb{R}^{p \times m}$. The Jacobian of the inner map is $J_g(x) = \mathrm{diag}(\sigma'(W_1 x)) \cdot W_1$, a $m \times n$ matrix. The Jacobian of the outer map at $y = g(x)$ is $W_2$. By the chain rule the composed Jacobian is $$J_h(x) = W_2 \cdot \mathrm{diag}(\sigma'(W_1 x)) \cdot W_1.$$ For a scalar loss $L(h(x))$ the gradient with respect to $x$ is the VJP $\nabla_x L = W_1^T \cdot \mathrm{diag}(\sigma'(W_1 x)) \cdot W_2^T \cdot \nabla L$, evaluated right-to-left, which is precisely what backprop computes. ## Common Confusions <Confusion title="Order of multiplication matters"> The product $J_f \cdot J_g$ is not the same as $J_g \cdot J_f$ in general, because matrix multiplication is non-commutative and even the shapes typically disagree. The outer Jacobian $J_f$ is evaluated at the inner output $g(x)$ and goes on the left. Reversing the order is the most common chain-rule mistake on multivariable calculus exams. </Confusion> <Confusion title="Gradients are not Jacobians"> For a scalar-valued function $f: \mathbb{R}^n \to \mathbb{R}$ the gradient $\nabla f$ is a column vector and the Jacobian is the row vector $J_f = (\nabla f)^T$. Many sources collapse the distinction and write $\nabla(f \circ g) = J_g^T \nabla f$, hiding the transpose inside the notation. When you implement autodiff or read papers, track the orientation explicitly: the chain rule gives row vectors via $J_L J_g$ and gradients via $J_g^T \nabla L$. </Confusion> ## Exercises <Exercise difficulty="core"> <Problem> Let $f(x, y) = x^2 + y^2$ and let $g(t) = (t \cos t, t \sin t)$ trace out a spiral. Compute $\frac{d}{dt}(f \circ g)(t)$ via the chain rule, then verify by direct computation. </Problem> <Hint> The Jacobian of $f$ as a $1 \times 2$ row is $(2x, 2y)$. The Jacobian of $g$ as a $2 \times 1$ column is $(\cos t - t \sin t, \sin t + t \cos t)^T$. Multiply. </Hint> <Solution> By the chain rule, $\frac{d}{dt}(f \circ g) = J_f(g(t)) \cdot J_g(t)$. Substituting $x = t \cos t$, $y = t \sin t$: $J_f(g(t)) = (2 t \cos t, 2 t \sin t)$. Then $J_f \cdot J_g = 2 t \cos t (\cos t - t \sin t) + 2 t \sin t (\sin t + t \cos t) = 2 t (\cos^2 t + \sin^2 t) + 2 t^2 (\sin t \cos t - \sin t \cos t) = 2 t$. Direct: $f(g(t)) = t^2 \cos^2 t + t^2 \sin^2 t = t^2$, and $d/dt[t^2] = 2t$. The chain rule and direct computation agree. </Solution> </Exercise> <Exercise difficulty="advanced"> <Problem> Use the chain rule to derive the gradient of the squared-loss $L(W) = \frac{1}{2} \|W x - y\|^2$ with respect to $W \in \mathbb{R}^{m \times n}$ for fixed $x \in \mathbb{R}^n$, $y \in \mathbb{R}^m$. Show that $\nabla_W L = (W x - y) x^T$. </Problem> <Hint> Treat $W \mapsto W x$ as a linear map from $\mathbb{R}^{m \times n}$ to $\mathbb{R}^m$. Its Jacobian (as a linear operator on matrices) is $M \mapsto M x$. Compose with $r \mapsto \frac{1}{2}\|r - y\|^2$, whose gradient is $r - y$, and apply the matrix-version of the chain rule using the trace inner product $\langle A, B \rangle = \mathrm{tr}(A^T B)$. </Hint> <Solution> Let $r = W x - y$. Then $\nabla_r L = r$. The directional derivative of $W \mapsto W x$ in direction $\Delta W$ is $\Delta W \cdot x$. The chain rule (under the trace inner product) gives the gradient as the unique matrix $G$ such that $\langle G, \Delta W \rangle = \langle r, \Delta W \cdot x \rangle$ for all $\Delta W$. The right side equals $\mathrm{tr}(\Delta W^T \cdot r x^T) = \langle r x^T, \Delta W \rangle$. By uniqueness $G = r x^T = (W x - y) x^T$. This is the rank-one outer-product gradient that drives least-squares fitting and the linear layer in any neural network. </Solution> </Exercise> ## References <ReferenceTabs> <ReferenceTab id="rudin-1976" label="Rudin 1976" type="book"> Walter Rudin. *Principles of Mathematical Analysis* (3rd ed.). McGraw-Hill, 1976. Theorem 9.15: the chain rule for differentiable maps between Banach spaces. The cleanest statement and proof in the literature. </ReferenceTab> <ReferenceTab id="spivak-1965" label="Spivak 1965" type="book"> Michael Spivak. *Calculus on Manifolds*. W. A. Benjamin, 1965. Chapter 2.5: chain rule via best linear approximation. Short, modern, coordinate-free presentation. </ReferenceTab> <ReferenceTab id="apostol-1969" label="Apostol 1969" type="book"> Tom M. Apostol. *Calculus, Volume II* (2nd ed.). Wiley, 1969. Sections 8.10-8.12: multivariable chain rule with worked examples and the implicit-function variant. </ReferenceTab> <ReferenceTab id="baydin-2018" label="Baydin et al. 2018" type="paper"> Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind. *Automatic Differentiation in Machine Learning: A Survey*. JMLR 18, 2018. Section 3 develops forward and reverse mode as the two associativity orderings of chained Jacobians. [arXiv:1502.05767](https://arxiv.org/abs/1502.05767) </ReferenceTab> <ReferenceTab id="griewank-2008" label="Griewank and Walther 2008" type="book"> Andreas Griewank and Andrea Walther. *Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation* (2nd ed.). SIAM, 2008. Chapter 3: matrix-form chain rule and the cheap-gradient principle. </ReferenceTab> <ReferenceTab id="magnus-2019" label="Magnus and Neudecker 2019" type="book"> Jan R. Magnus and Heinz Neudecker. *Matrix Differential Calculus with Applications in Statistics and Econometrics* (3rd ed.). Wiley, 2019. Chapter 5: matrix chain rule with the trace inner product, the working tool for matrix gradients in ML. </ReferenceTab> </ReferenceTabs> ## Related Topics - [The Jacobian Matrix](/topics/the-jacobian-matrix) - [Automatic Differentiation](/topics/automatic-differentiation) - [Feedforward Networks and Backpropagation](/topics/feedforward-networks-and-backpropagation) - [The Hessian Matrix](/topics/the-hessian-matrix) - [Divergence, Curl, and Line Integrals](/topics/divergence-curl-and-line-integrals)Last reviewed: April 18, 2026
Prerequisites
Foundations this topic depends on.
- The Jacobian MatrixLayer 0A
- Vectors, Matrices, and Linear MapsLayer 0A