Skip to main content

Mathematical Infrastructure

Vector Calculus Chain Rule

The chain rule for compositions of multivariable maps. Jacobians multiply when functions compose; gradients of scalar-valued compositions become vector-Jacobian products. The result that makes backpropagation a one-line theorem.

CoreTier 1Stable~25 min
0

Why This Matters

The single-variable chain rule says (fg)(x)=f(g(x))g(x)(f \circ g)'(x) = f'(g(x)) \cdot g'(x). The multivariable version replaces those scalar derivatives with Jacobians, and the product becomes a matrix product. That one upgrade is enough to derive backpropagation in one line: a deep network is a composition fLfL1f1f_L \circ f_{L-1} \circ \cdots \circ f_1, its Jacobian is the product JLJL1J1J_L J_{L-1} \cdots J_1, and the gradient of a scalar loss is a vector-Jacobian product evaluated right-to-left.

This page states the rule with assumptions, gives the matrix form, derives the gradient corollary, and shows the implicit chain rule that handles constraints. It does not re-derive backprop end-to-end; that lives on its own page.

The Rule

Theorem

Multivariable Chain Rule

Statement

Let g:UVg: U \to V be differentiable at x0URnx_0 \in U \subset \mathbb{R}^n, with g(U)VRmg(U) \subset V \subset \mathbb{R}^m, and let f:VRpf: V \to \mathbb{R}^p be differentiable at y0=g(x0)y_0 = g(x_0). Then the composition h=fgh = f \circ g is differentiable at x0x_0 and its Jacobian satisfies Jh(x0)=Jf(g(x0))Jg(x0),J_h(x_0) = J_f(g(x_0)) \cdot J_g(x_0), where the right-hand side is matrix multiplication of a p×mp \times m matrix with an m×nm \times n matrix.

Intuition

Differentiability of gg at x0x_0 means gg admits a best linear approximation: g(x0+h)=g(x0)+Jg(x0)h+o(h)g(x_0 + h) = g(x_0) + J_g(x_0) h + o(\|h\|). Apply ff and use its own linear approximation at g(x0)g(x_0). The composed approximation is linear with matrix Jf(g(x0))Jg(x0)J_f(g(x_0)) \cdot J_g(x_0). Because the linear approximation of a differentiable map is unique, this matrix is the Jacobian of hh.

Proof Sketch

By assumption g(x0+h)g(x0)=Jg(x0)h+rg(h)g(x_0 + h) - g(x_0) = J_g(x_0) h + r_g(h) with rg(h)=o(h)r_g(h) = o(\|h\|), and f(y0+k)f(y0)=Jf(y0)k+rf(k)f(y_0 + k) - f(y_0) = J_f(y_0) k + r_f(k) with rf(k)=o(k)r_f(k) = o(\|k\|). Set k=Jg(x0)h+rg(h)k = J_g(x_0) h + r_g(h) and substitute: h(x0+h)h(x0)=Jf(y0)Jg(x0)h+Jf(y0)rg(h)+rf(k).h(x_0 + h) - h(x_0) = J_f(y_0) J_g(x_0) h + J_f(y_0) r_g(h) + r_f(k). The term Jf(y0)rg(h)J_f(y_0) r_g(h) is o(h)o(\|h\|) because Jf(y0)J_f(y_0) is a fixed linear map. The term rf(k)r_f(k) is o(k)o(\|k\|) and kJg(x0)h+rg(h)\|k\| \leq \|J_g(x_0)\| \|h\| + \|r_g(h)\|, so rf(k)=o(h)r_f(k) = o(\|h\|) as well. The remainder is o(h)o(\|h\|), identifying Jf(y0)Jg(x0)J_f(y_0) J_g(x_0) as the Jacobian of hh.

Why It Matters

Every layer in a neural network is a function f:Rd1Rdf_\ell: \mathbb{R}^{d_{\ell-1}} \to \mathbb{R}^{d_\ell}. A depth-LL network is the composition h=fLf1h = f_L \circ \cdots \circ f_1. The chain rule gives Jh=JfLJf1J_h = J_{f_L} \cdots J_{f_1}, a product of LL matrices. Forward-mode autodiff multiplies left-to-right; reverse-mode (backprop) multiplies right-to-left starting from the loss gradient. The choice of order is the difference between O(d2L)O(d^2 L) and O(d2L)O(d^2 L) flops with asymmetric memory profiles, which is why reverse mode wins for scalar losses with many parameters.

Failure Mode

Differentiability at the inner point is required, not just continuity. Composing a non-differentiable inner map with a differentiable outer one breaks the rule even when the composition happens to be differentiable. ReLU activations are not differentiable at zero, so neural-network practice uses subgradients there; modern autodiff frameworks pick a convention (typically 00 or 11) and document it.

Gradient Corollary

For scalar-valued compositions the chain rule has a familiar special case. Let L:RpRL: \mathbb{R}^p \to \mathbb{R} be differentiable. Define (x)=L(g(x))\ell(x) = L(g(x)) for g:RnRpg: \mathbb{R}^n \to \mathbb{R}^p. By the theorem, J(x0)=JL(g(x0))Jg(x0)J_\ell(x_0) = J_L(g(x_0)) \cdot J_g(x_0). Since LL is scalar-valued, JL(y)=L(y)TJ_L(y) = \nabla L(y)^T is a row vector of size pp, and JgJ_g is p×np \times n. The product is a row vector of size nn, so (x0)=Jg(x0)TL(g(x0)).\nabla \ell(x_0) = J_g(x_0)^T \nabla L(g(x_0)). This is a vector-Jacobian product (VJP): the upstream gradient L(g(x0))\nabla L(g(x_0)) is propagated backward by left-multiplying by JgTJ_g^T. Backprop is exactly this identity applied recursively layer by layer.

Implicit Function Chain Rule

Sometimes a variable is defined implicitly: F(x,y)=0F(x, y) = 0 specifies yy as a function of xx near a point where F/y\partial F / \partial y is invertible. The implicit function theorem guarantees a smooth y(x)y(x) exists locally, and differentiating F(x,y(x))=0F(x, y(x)) = 0 via the chain rule gives

\quad\Longrightarrow\quad \frac{dy}{dx} = -\left(\frac{\partial F}{\partial y}\right)^{-1} \frac{\partial F}{\partial x}.$$ [Implicit differentiation](/topics/implicit-differentiation) underpins meta-learning by implicit gradients, deep-equilibrium models (DEQs), and the gradient of an [optimization](/topics/projected-gradient-descent) solution with respect to its parameters. ## Worked Example: Two-Layer Network Let $g(x) = \sigma(W_1 x)$ and $h(x) = W_2 g(x)$ for activation $\sigma$ applied componentwise, with $x \in \mathbb{R}^n$, $W_1 \in \mathbb{R}^{m \times n}$, $W_2 \in \mathbb{R}^{p \times m}$. The Jacobian of the inner map is $J_g(x) = \mathrm{diag}(\sigma'(W_1 x)) \cdot W_1$, a $m \times n$ matrix. The Jacobian of the outer map at $y = g(x)$ is $W_2$. By the chain rule the composed Jacobian is $$J_h(x) = W_2 \cdot \mathrm{diag}(\sigma'(W_1 x)) \cdot W_1.$$ For a scalar loss $L(h(x))$ the gradient with respect to $x$ is the VJP $\nabla_x L = W_1^T \cdot \mathrm{diag}(\sigma'(W_1 x)) \cdot W_2^T \cdot \nabla L$, evaluated right-to-left, which is precisely what backprop computes. ## Common Confusions <Confusion title="Order of multiplication matters"> The product $J_f \cdot J_g$ is not the same as $J_g \cdot J_f$ in general, because matrix multiplication is non-commutative and even the shapes typically disagree. The outer Jacobian $J_f$ is evaluated at the inner output $g(x)$ and goes on the left. Reversing the order is the most common chain-rule mistake on multivariable calculus exams. </Confusion> <Confusion title="Gradients are not Jacobians"> For a scalar-valued function $f: \mathbb{R}^n \to \mathbb{R}$ the gradient $\nabla f$ is a column vector and the Jacobian is the row vector $J_f = (\nabla f)^T$. Many sources collapse the distinction and write $\nabla(f \circ g) = J_g^T \nabla f$, hiding the transpose inside the notation. When you implement autodiff or read papers, track the orientation explicitly: the chain rule gives row vectors via $J_L J_g$ and gradients via $J_g^T \nabla L$. </Confusion> ## Exercises <Exercise difficulty="core"> <Problem> Let $f(x, y) = x^2 + y^2$ and let $g(t) = (t \cos t, t \sin t)$ trace out a spiral. Compute $\frac{d}{dt}(f \circ g)(t)$ via the chain rule, then verify by direct computation. </Problem> <Hint> The Jacobian of $f$ as a $1 \times 2$ row is $(2x, 2y)$. The Jacobian of $g$ as a $2 \times 1$ column is $(\cos t - t \sin t, \sin t + t \cos t)^T$. Multiply. </Hint> <Solution> By the chain rule, $\frac{d}{dt}(f \circ g) = J_f(g(t)) \cdot J_g(t)$. Substituting $x = t \cos t$, $y = t \sin t$: $J_f(g(t)) = (2 t \cos t, 2 t \sin t)$. Then $J_f \cdot J_g = 2 t \cos t (\cos t - t \sin t) + 2 t \sin t (\sin t + t \cos t) = 2 t (\cos^2 t + \sin^2 t) + 2 t^2 (\sin t \cos t - \sin t \cos t) = 2 t$. Direct: $f(g(t)) = t^2 \cos^2 t + t^2 \sin^2 t = t^2$, and $d/dt[t^2] = 2t$. The chain rule and direct computation agree. </Solution> </Exercise> <Exercise difficulty="advanced"> <Problem> Use the chain rule to derive the gradient of the squared-loss $L(W) = \frac{1}{2} \|W x - y\|^2$ with respect to $W \in \mathbb{R}^{m \times n}$ for fixed $x \in \mathbb{R}^n$, $y \in \mathbb{R}^m$. Show that $\nabla_W L = (W x - y) x^T$. </Problem> <Hint> Treat $W \mapsto W x$ as a linear map from $\mathbb{R}^{m \times n}$ to $\mathbb{R}^m$. Its Jacobian (as a linear operator on matrices) is $M \mapsto M x$. Compose with $r \mapsto \frac{1}{2}\|r - y\|^2$, whose gradient is $r - y$, and apply the matrix-version of the chain rule using the trace inner product $\langle A, B \rangle = \mathrm{tr}(A^T B)$. </Hint> <Solution> Let $r = W x - y$. Then $\nabla_r L = r$. The directional derivative of $W \mapsto W x$ in direction $\Delta W$ is $\Delta W \cdot x$. The chain rule (under the trace inner product) gives the gradient as the unique matrix $G$ such that $\langle G, \Delta W \rangle = \langle r, \Delta W \cdot x \rangle$ for all $\Delta W$. The right side equals $\mathrm{tr}(\Delta W^T \cdot r x^T) = \langle r x^T, \Delta W \rangle$. By uniqueness $G = r x^T = (W x - y) x^T$. This is the rank-one outer-product gradient that drives least-squares fitting and the linear layer in any neural network. </Solution> </Exercise> ## References <ReferenceTabs> <ReferenceTab id="rudin-1976" label="Rudin 1976" type="book"> Walter Rudin. *Principles of Mathematical Analysis* (3rd ed.). McGraw-Hill, 1976. Theorem 9.15: the chain rule for differentiable maps between Banach spaces. The cleanest statement and proof in the literature. </ReferenceTab> <ReferenceTab id="spivak-1965" label="Spivak 1965" type="book"> Michael Spivak. *Calculus on Manifolds*. W. A. Benjamin, 1965. Chapter 2.5: chain rule via best linear approximation. Short, modern, coordinate-free presentation. </ReferenceTab> <ReferenceTab id="apostol-1969" label="Apostol 1969" type="book"> Tom M. Apostol. *Calculus, Volume II* (2nd ed.). Wiley, 1969. Sections 8.10-8.12: multivariable chain rule with worked examples and the implicit-function variant. </ReferenceTab> <ReferenceTab id="baydin-2018" label="Baydin et al. 2018" type="paper"> Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind. *Automatic Differentiation in Machine Learning: A Survey*. JMLR 18, 2018. Section 3 develops forward and reverse mode as the two associativity orderings of chained Jacobians. [arXiv:1502.05767](https://arxiv.org/abs/1502.05767) </ReferenceTab> <ReferenceTab id="griewank-2008" label="Griewank and Walther 2008" type="book"> Andreas Griewank and Andrea Walther. *Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation* (2nd ed.). SIAM, 2008. Chapter 3: matrix-form chain rule and the cheap-gradient principle. </ReferenceTab> <ReferenceTab id="magnus-2019" label="Magnus and Neudecker 2019" type="book"> Jan R. Magnus and Heinz Neudecker. *Matrix Differential Calculus with Applications in Statistics and Econometrics* (3rd ed.). Wiley, 2019. Chapter 5: matrix chain rule with the trace inner product, the working tool for matrix gradients in ML. </ReferenceTab> </ReferenceTabs> ## Related Topics - [The Jacobian Matrix](/topics/the-jacobian-matrix) - [Automatic Differentiation](/topics/automatic-differentiation) - [Feedforward Networks and Backpropagation](/topics/feedforward-networks-and-backpropagation) - [The Hessian Matrix](/topics/the-hessian-matrix) - [Divergence, Curl, and Line Integrals](/topics/divergence-curl-and-line-integrals)

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics