Vector Calculus Chain Rule

Sneiderman, Robby

Mathematical Infrastructure

Vector Calculus Chain Rule

The chain rule for compositions of multivariable maps. Jacobians multiply when functions compose; gradients of scalar-valued compositions become vector-Jacobian products. The result that makes backpropagation a one-line theorem.

CoreTier 1StableCore spine~25 min

Prerequisites

The Jacobian Matrix Vectors Matrices and Linear Maps Differentiation in Rn

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

mathematical-infrastructure | layer 0A | tier 1. This page has 3 direct prerequisites and 4 published dependents.

Open Atlas Prerequisites Leads to

What next

Automatic Differentiation

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The single-variable chain rule says $(f \circ g)'(x) = f'(g(x)) \cdot g'(x)$ . The multivariable version replaces those scalar derivatives with Jacobians, and the product becomes a matrix product. That one upgrade is enough to derive backpropagation in one line: a deep network is a composition $f_L \circ f_{L-1} \circ \cdots \circ f_1$ , its Jacobian is the product $J_L J_{L-1} \cdots J_1$ , and the gradient of a scalar loss is a vector-Jacobian product evaluated right-to-left.

This page states the rule with assumptions, gives the matrix form, derives the gradient corollary, and shows the implicit chain rule that handles constraints. It does not re-derive backprop end-to-end; that lives on its own page.

One rule, three derivative viewpoints

Composition first, then the specialized cases used in backprop and implicit differentiation.

The Rule

Theorem

Multivariable Chain Rule

Statement

Let $g: U \to V$ be differentiable at $x_0 \in U \subset \mathbb{R}^n$ , with $g(U) \subset V \subset \mathbb{R}^m$ , and let $f: V \to \mathbb{R}^p$ be differentiable at $y_0 = g(x_0)$ . Then the composition $h = f \circ g$ is differentiable at $x_0$ and its Jacobian satisfies $J_h(x_0) = J_f(g(x_0)) \cdot J_g(x_0),$ where the right-hand side is matrix multiplication of a $p \times m$ matrix with an $m \times n$ matrix.

Intuition

Differentiability of $g$ at $x_0$ means $g$ admits a best linear approximation: $g(x_0 + h) = g(x_0) + J_g(x_0) h + o(\|h\|)$ . Apply $f$ and use its own linear approximation at $g(x_0)$ . The composed approximation is linear with matrix $J_f(g(x_0)) \cdot J_g(x_0)$ . Because the linear approximation of a differentiable map is unique, this matrix is the Jacobian of $h$ .

Proof Sketch

By assumption $g(x_0 + h) - g(x_0) = J_g(x_0) h + r_g(h)$ with $r_g(h) = o(\|h\|)$ , and $f(y_0 + k) - f(y_0) = J_f(y_0) k + r_f(k)$ with $r_f(k) = o(\|k\|)$ . Set $k = J_g(x_0) h + r_g(h)$ and substitute: $h(x_0 + h) - h(x_0) = J_f(y_0) J_g(x_0) h + J_f(y_0) r_g(h) + r_f(k).$ The term $J_f(y_0) r_g(h)$ is $o(\|h\|)$ because $J_f(y_0)$ is a fixed linear map. The term $r_f(k)$ is $o(\|k\|)$ and $\|k\| \leq \|J_g(x_0)\| \|h\| + \|r_g(h)\|$ , so $r_f(k) = o(\|h\|)$ as well. The remainder is $o(\|h\|)$ , identifying $J_f(y_0) J_g(x_0)$ as the Jacobian of $h$ .

Why It Matters

Every layer in a neural network is a function $f_\ell: \mathbb{R}^{d_{\ell-1}} \to \mathbb{R}^{d_\ell}$ . A depth- $L$ network is the composition $h = f_L \circ \cdots \circ f_1$ . The chain rule gives $J_h = J_{f_L} \cdots J_{f_1}$ , a product of $L$ matrices. Forward-mode autodiff multiplies left-to-right; reverse-mode (backprop) multiplies right-to-left starting from the loss gradient. The asymmetry is not flop-equal-with-different-memory: forward mode computes one directional derivative per pass and so requires $n_{\text{in}}$ passes to recover a full Jacobian (cheap when input dimension is small), while reverse mode computes one gradient per pass and so requires $n_{\text{out}}$ passes (cheap when output dimension is small). For a scalar loss, $n_{\text{out}} = 1$ and a single reverse pass yields the full gradient at cost comparable to one forward evaluation (cheap-gradient principle). This input-vs-output-dimension scaling, not memory layout, is the actual reason backprop wins for deep nets with many parameters.

Failure Mode

Differentiability at the inner point is required, not just continuity. Composing a non-differentiable inner map with a differentiable outer one breaks the rule even when the composition happens to be differentiable. ReLU activations are not differentiable at zero, so neural-network practice uses subgradients there; modern autodiff frameworks pick a convention (typically $0$ or $1$ ) and document it.

report a correction →

Gradient Corollary

For scalar-valued compositions the chain rule has a familiar special case. Let $L: \mathbb{R}^p \to \mathbb{R}$ be differentiable. Define $\ell(x) = L(g(x))$ for $g: \mathbb{R}^n \to \mathbb{R}^p$ . By the theorem, $J_\ell(x_0) = J_L(g(x_0)) \cdot J_g(x_0)$ . Since $L$ is scalar-valued, $J_L(y) = \nabla L(y)^T$ is a row vector of size $p$ , and $J_g$ is $p \times n$ . The product is a row vector of size $n$ , so $\nabla \ell(x_0) = J_g(x_0)^T \nabla L(g(x_0)).$ This is a vector-Jacobian product (VJP): the upstream gradient $\nabla L(g(x_0))$ is propagated backward by left-multiplying by $J_g^T$ . Backprop is exactly this identity applied recursively layer by layer.

Implicit Function Chain Rule

Sometimes a variable is defined implicitly: $F(x, y) = 0$ specifies $y$ as a function of $x$ near a point where $\partial F / \partial y$ is invertible. The implicit function theorem guarantees a smooth $y(x)$ exists locally, and differentiating $F(x, y(x)) = 0$ via the chain rule gives

\frac{\partial F}{\partial x} + \frac{\partial F}{\partial y} \cdot \frac{dy}{dx} = 0 \quad\Longrightarrow\quad \frac{dy}{dx} = -\left(\frac{\partial F}{\partial y}\right)^{-1} \frac{\partial F}{\partial x}.

Implicit differentiation underpins meta-learning by implicit gradients, deep-equilibrium models (DEQs), and the gradient of an optimization solution with respect to its parameters.

Worked Example: Two-Layer Network

Let $g(x) = \sigma(W_1 x)$ and $h(x) = W_2 g(x)$ for activation $\sigma$ applied componentwise, with $x \in \mathbb{R}^n$ , $W_1 \in \mathbb{R}^{m \times n}$ , $W_2 \in \mathbb{R}^{p \times m}$ . The Jacobian of the inner map is $J_g(x) = \mathrm{diag}(\sigma'(W_1 x)) \cdot W_1$ , a $m \times n$ matrix. The Jacobian of the outer map at $y = g(x)$ is $W_2$ . By the chain rule the composed Jacobian is $J_h(x) = W_2 \cdot \mathrm{diag}(\sigma'(W_1 x)) \cdot W_1.$ For a scalar loss $L(h(x))$ the gradient with respect to $x$ is the VJP $\nabla_x L = W_1^T \cdot \mathrm{diag}(\sigma'(W_1 x)) \cdot W_2^T \cdot \nabla L$ , evaluated right-to-left, which is precisely what backprop computes.

Example

Chain rule on a single logistic neuron with numbers

Take a single logistic neuron $\hat{y} = \sigma(w^\top x + b)$ with $\sigma(z) = 1/(1 + e^{-z})$ , weights $w = (0.3, 0.7)^\top$ , bias $b = 0.1$ , and a training point $x = (1.0, 0.5)^\top$ with target $y = 1$ . The squared loss is $L = \tfrac{1}{2}(\hat{y} - y)^2$ . Walk the chain rule one node at a time.

Forward pass. $z = 0.3 \cdot 1.0 + 0.7 \cdot 0.5 + 0.1 = 0.75$ , so $\hat{y} = \sigma(0.75) = 1/(1 + e^{-0.75}) \approx 0.6792$ . The loss is $L \approx \tfrac{1}{2}(0.6792 - 1)^2 \approx 0.0515$ .

Backward pass. The chain rule gives $\frac{\partial L}{\partial w_j} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w_j}.$ Each factor has a clean form on this example:

$\partial L / \partial \hat{y} = \hat{y} - y = 0.6792 - 1 = -0.3208$ .
$\partial \hat{y} / \partial z = \sigma'(z) = \sigma(z)(1 - \sigma(z)) = 0.6792 \cdot 0.3208 \approx 0.2179$ .
$\partial z / \partial w_j = x_j$ , so $\partial z / \partial w_1 = 1.0$ and $\partial z / \partial w_2 = 0.5$ .

Multiplying: $\partial L / \partial w_1 = (-0.3208)(0.2179)(1.0) \approx -0.0699$ , $\partial L / \partial w_2 = (-0.3208)(0.2179)(0.5) \approx -0.0349$ , $\partial L / \partial b = (-0.3208)(0.2179)(1.0) \approx -0.0699$ .

A single SGD step with learning rate $\eta = 1.0$ gives $w \leftarrow w - \eta \nabla_w L = (0.3699,\, 0.7349)^\top$ and $b \leftarrow 0.1699$ . Re-evaluating, $z' = 1.1224$ and $\hat{y}' \approx 0.7544$ — closer to the target $y = 1$ .

The intermediate cache $(\hat{y}, \sigma'(z), x)$ is exactly the activation state backprop stores during the forward pass to reuse on the backward pass. This sigmoid-of-affine composition is the building block for backpropagation in feedforward networks. Goodfellow, Bengio, Courville, Deep Learning (2016), §6.5.2 walks the same two-step composition with general activations.

Common Confusions

Watch Out

Order of multiplication matters

The product $J_f \cdot J_g$ is not the same as $J_g \cdot J_f$ in general, because matrix multiplication is non-commutative and even the shapes typically disagree. The outer Jacobian $J_f$ is evaluated at the inner output $g(x)$ and goes on the left. Reversing the order is the most common chain-rule mistake on multivariable calculus exams.

Watch Out

Gradients are not Jacobians

For a scalar-valued function $f: \mathbb{R}^n \to \mathbb{R}$ the gradient $\nabla f$ is a column vector and the Jacobian is the row vector $J_f = (\nabla f)^T$ . Many sources collapse the distinction and write $\nabla(f \circ g) = J_g^T \nabla f$ , hiding the transpose inside the notation. When you implement autodiff or read papers, track the orientation explicitly: the chain rule gives row vectors via $J_L J_g$ and gradients via $J_g^T \nabla L$ .

Exercises

ExerciseCore

Problem

Let $f(x, y) = x^2 + y^2$ and let $g(t) = (t \cos t, t \sin t)$ trace out a spiral. Compute $\frac{d}{dt}(f \circ g)(t)$ via the chain rule, then verify by direct computation.

ExerciseAdvanced

Problem

Use the chain rule to derive the gradient of the squared-loss $L(W) = \frac{1}{2} \|W x - y\|^2$ with respect to $W \in \mathbb{R}^{m \times n}$ for fixed $x \in \mathbb{R}^n$ , $y \in \mathbb{R}^m$ . Show that $\nabla_W L = (W x - y) x^T$ .

References

Walter Rudin. Principles of Mathematical Analysis (3rd ed.). McGraw-Hill, 1976. Theorem 9.15: the chain rule for differentiable maps between Banach spaces. The cleanest statement and proof in the literature.
Michael Spivak. Calculus on Manifolds. W. A. Benjamin, 1965. Chapter 2.5: chain rule via best linear approximation. Short, modern, coordinate-free presentation.
Tom M. Apostol. Calculus, Volume II (2nd ed.). Wiley, 1969. Sections 8.10-8.12: multivariable chain rule with worked examples and the implicit-function variant.
Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind. Automatic Differentiation in Machine Learning: A Survey. JMLR 18, 2018. Section 3 develops forward and reverse mode as the two associativity orderings of chained Jacobians. arXiv:1502.05767
Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2nd ed.). SIAM, 2008. Chapter 3: matrix-form chain rule and the cheap-gradient principle.
Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics (3rd ed.). Wiley, 2019. Chapter 5: matrix chain rule with the trace inner product, the working tool for matrix gradients in ML.

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Differentiation in Rⁿlayer 0A · tier 1
The Jacobian Matrixlayer 0A · tier 1
Vectors, Matrices, and Linear Mapslayer 0A · tier 1

Derived topics

4

The Hessian Matrixlayer 0A · tier 1
Automatic Differentiationlayer 1 · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1
Divergence, Curl, and Line Integralslayer 0A · tier 2

Graph-backed continuations

Automatic Differentiation Feedforward Networks and Backpropagation The Hessian Matrix Divergence, Curl, and Line Integrals