Automatic Differentiation

Sneiderman, Robby

Mathematical Infrastructure

Automatic Differentiation

Forward mode computes Jacobian-vector products, reverse mode computes vector-Jacobian products: backpropagation is reverse-mode autodiff, and the asymmetry between the two modes explains why training neural networks is efficient.

CoreTier 1StableSupporting~50 min

Prerequisites

The Jacobian Matrix Differentiation in Rn Matrix Calculus Taylor Expansion

Start 8-question practice · 4 available Prereq Map

Learning position

Read this page in the graph.

mathematical-infrastructure | layer 1 | tier 1. This page has 6 direct prerequisites and 7 published dependents.

Open Atlas Prerequisites Leads to

What next

Feedforward Networks and Backpropagation

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every gradient-based optimization algorithm in machine learning depends on computing derivatives. For a neural network with millions of parameters and a scalar loss function, you need the gradient of the loss with respect to every parameter. Computing this naively would require one forward pass per parameter, which is impossibly expensive.

Automatic differentiation (autodiff) solves this. Reverse-mode autodiff (backpropagation) computes the entire gradient in a single backward pass, with cost proportional to a single forward pass. This is the algorithmic insight that makes training neural networks feasible.

Understanding the distinction between forward mode and reverse mode, and why reverse mode is efficient for the typical ML setup — many inputs, scalar output — is foundational for understanding modern deep learning frameworks.

Automatic differentiation is a choice of derivative flow direction

Jacobian-vector product (JVP): push one tangent direction forward.

Mental Model

Think of a computation as a directed acyclic graph (DAG). Each node is an elementary operation (add, multiply, exp, etc.). The input flows forward through the graph to produce the output. Autodiff applies the chain rule systematically along this graph.

Forward mode propagates derivatives forward through the graph alongside the computation. It answers: "if I perturb one input, how does the output change?"

Reverse mode propagates derivatives backward from the output to all inputs. It answers: "how does each input contribute to a change in the output?"

For a function $f: \mathbb{R}^n \to \mathbb{R}$ (many inputs, one output), forward mode needs $n$ passes (one per input) but reverse mode needs just one. This is why backpropagation uses reverse mode.

Formal Setup and Notation

Consider a function $f: \mathbb{R}^n \to \mathbb{R}^m$ composed of elementary operations:

$f = f_L \circ f_{L-1} \circ \cdots \circ f_1$

where each $f_\ell: \mathbb{R}^{d_{\ell-1}} \to \mathbb{R}^{d_\ell}$ is a simple function (matrix multiplication, elementwise nonlinearity, etc.) with Jacobian $J_\ell \in \mathbb{R}^{d_\ell \times d_{\ell-1}}$ .

By the chain rule, the Jacobian of $f$ is:

$J_f = J_L \cdot J_{L-1} \cdots J_1$

The question is: what order do we multiply these matrices?

Core Definitions

Definition

Jacobian-Vector Product (JVP)

A Jacobian-vector product computes $J_f \cdot v$ for a vector $v \in \mathbb{R}^n$ (called a tangent vector) without forming $J_f$ explicitly. This is done by associating the multiplications left-to-right:

$J_f \cdot v = J_L (J_{L-1} (\cdots (J_1 \cdot v) \cdots))$

Each step produces a vector (not a matrix), so each multiplication is matrix-vector, costing $O(d_\ell \cdot d_{\ell-1})$ .

Definition

Vector-Jacobian Product (VJP)

A vector-Jacobian product computes $u^\top \cdot J_f$ for a vector $u \in \mathbb{R}^m$ (called a cotangent vector) without forming $J_f$ explicitly. This is done by associating right-to-left:

$u^\top \cdot J_f = ((\cdots (u^\top \cdot J_L) \cdot J_{L-1}) \cdots) \cdot J_1$

Again, each step is a vector-matrix product, not matrix-matrix.

Definition

Forward Mode Autodiff

Forward mode computes JVPs. Given an input perturbation direction $v \in \mathbb{R}^n$ , forward mode propagates $v$ through the computation graph alongside the function evaluation, producing $J_f \cdot v$ at the output.

One forward-mode pass computes one column of the Jacobian (by setting $v = e_i$ , the $i$ -th standard basis vector). To get the full $m \times n$ Jacobian, you need $n$ passes.

Definition

Reverse Mode Autodiff (Backpropagation)

Reverse mode computes VJPs. Given an output sensitivity $u \in \mathbb{R}^m$ , reverse mode propagates $u$ backward through the computation graph, producing $u^\top J_f$ (a row of $J_f$ if $u = e_j$ ).

One reverse-mode pass computes one row of the Jacobian. To get the full Jacobian, you need $m$ passes.

Main Theorems

Theorem

Reverse Mode Computes All Gradients in O(1) Backward Passes

Statement

For a function $f: \mathbb{R}^n \to \mathbb{R}$ (scalar output), reverse-mode autodiff computes the full gradient $\nabla f \in \mathbb{R}^n$ in a single backward pass. The computational cost is at most a constant factor (typically 2-5x) times the cost of evaluating $f$ itself.

Intuition

The gradient is a single row of the $1 \times n$ Jacobian. Reverse mode computes one row per pass. So one pass suffices. This is why backpropagation is efficient: regardless of how many millions of parameters the network has, one backward pass gives the gradient with respect to all of them.

Proof Sketch

In the reverse pass, each node in the computational graph receives its adjoint (the derivative of the output with respect to that node) from its children, and passes it to its parents using the local Jacobian. Each edge in the graph is traversed once, and each local Jacobian-vector product costs proportional to the cost of the corresponding forward operation. Summing over all operations gives total cost proportional to the forward pass cost.

Why It Matters

This is arguably the single most important algorithmic result in deep learning. Without it, computing gradients for a network with $n$ parameters would cost $O(n)$ times the forward pass (one perturbation per parameter). With reverse mode, it costs $O(1)$ times the forward pass. The ratio is typically $n = 10^6$ to $10^{12}$ for modern networks.

Failure Mode

Reverse mode requires storing all intermediate values from the forward pass (the "tape"). For very deep networks or long sequences, this memory cost can be prohibitive. Gradient checkpointing trades memory for recomputation, storing only some intermediate values and recomputing the rest during the backward pass.

report a correction →

Proposition

Forward Mode Efficiency for Low-Dimensional Inputs

Statement

For a function $f: \mathbb{R}^n \to \mathbb{R}^m$ , forward-mode autodiff computes the full Jacobian in $n$ passes, each costing $O(1)$ times the forward evaluation. When $n \ll m$ (few inputs, many outputs), forward mode is more efficient than reverse mode.

Intuition

Forward mode computes one column of the Jacobian per pass (one directional derivative). If there are few input dimensions, few passes suffice. Reverse mode would need $m$ passes (one per output dimension) in this case.

Proof Sketch

Each forward pass propagates a tangent vector through the graph. The tangent at each node is a vector of the same dimension as the node's output, and the local JVP costs proportional to the local forward operation. With $n$ passes (one per input dimension), all $n$ columns of the Jacobian are computed.

Why It Matters

In most ML applications, forward mode is not used because $n \gg m$ (many parameters, scalar loss). But forward mode is important for computing Hessian-vector products (by combining forward and reverse mode), for sensitivity analysis, and for differential equation solvers where the number of outputs may exceed the number of inputs.

Failure Mode

Forward mode does not need to store intermediate values (no tape), so it has lower memory cost than reverse mode. But for the common $f: \mathbb{R}^n \to \mathbb{R}$ case, $n$ forward passes are far more expensive than one reverse pass.

report a correction →

Computational Graphs

A computational graph makes the chain rule explicit. Each node $v_i$ in the graph computes an elementary operation on its inputs. The edges represent data flow.

Forward pass: evaluate $v_1, v_2, \ldots, v_N$ in topological order. Store each $v_i$ .

Backward pass (reverse mode): define the adjoint $\bar{v}_i = \partial f / \partial v_i$ as a cotangent (a row vector, or equivalently a column vector of the same shape as $v_i$ ). Initialize $\bar{v}_N = 1$ (if $v_N$ is the scalar output). For each node in reverse topological order, accumulate contributions from every child $j$ with $i \to j$ :

$\bar{v}_i \mathrel{+}= \left(\frac{\partial v_j}{\partial v_i}\right)^{\!\top} \bar{v}_j$

The transpose of the local Jacobian is essential whenever $v_i$ or $v_j$ is a vector (or tensor): for $v_j = W v_i$ the local Jacobian is $W$ , and the adjoint update is $\bar{v}_i \mathrel{+}= W^\top \bar{v}_j$ , not $W \bar{v}_j$ . The $\mathrel{+}=$ rather than $=$ matters when $v_i$ has multiple children: each child contributes additively (the multivariable chain rule). For scalar intermediates the transpose is invisible because a $1 \times 1$ matrix equals its transpose, which is why the simpler-looking $\bar{v}_i \cdot \partial v_j/\partial v_i$ form is often quoted in scalar expositions.

Implementation Approaches

Tape-based (operator overloading): record a trace of operations during the forward pass, then replay it backward. PyTorch uses this approach with dynamic graphs: the tape is rebuilt each forward pass, allowing arbitrary Python control flow (loops, conditionals) in the model.

Source transformation: analyze the source code of the function and generate new code for the derivative function. Theano and TensorFlow 1.x used static graph compilation. JAX uses a functional transformation approach: jax.grad(f) returns a new function that computes the gradient of f.

PyTorch vs JAX: PyTorch builds the graph eagerly (tape-based, imperative). JAX traces the function lazily (functional transforms, composable). JAX's approach makes higher-order derivatives (grad of grad) and composition of transforms (vmap, jit, grad) natural.

Canonical Examples

Example

Reverse mode on a simple computation

Consider $f(x_1, x_2) = (x_1 x_2 + \sin(x_1))^2$ . The computational graph has nodes: $v_1 = x_1 x_2$ , $v_2 = \sin(x_1)$ , $v_3 = v_1 + v_2$ , $v_4 = v_3^2$ . The forward pass computes each node. The backward pass starts with $\bar{v}_4 = 1$ , then $\bar{v}_3 = 2v_3$ , then $\bar{v}_1 = \bar{v}_3$ , $\bar{v}_2 = \bar{v}_3$ , and finally $\bar{x}_1 = \bar{v}_1 \cdot x_2 + \bar{v}_2 \cdot \cos(x_1)$ , $\bar{x}_2 = \bar{v}_1 \cdot x_1$ . One backward pass gives both partial derivatives.

Common Confusions

Watch Out

Autodiff is not finite differences

Finite differences approximate derivatives numerically:

$\frac{\partial f}{\partial x_i} \approx \frac{f(x + h e_i) - f(x)}{h}$

This requires $n+1$ forward passes for $n$ partial derivatives, suffers from numerical instability (too small $h$ gives cancellation, too large gives truncation error), and is always slower than reverse-mode autodiff. Autodiff computes exact derivatives (up to floating-point arithmetic).

Watch Out

Autodiff is not symbolic differentiation

Symbolic differentiation (like in Mathematica) manipulates mathematical expressions and can produce exponentially large derivative expressions due to expression swell. Autodiff works on numerical computations, not symbolic expressions. It evaluates derivatives at specific numerical inputs, avoiding expression swell entirely.

Watch Out

Backpropagation is not specific to neural networks

Backpropagation is reverse-mode autodiff applied to any computational graph. It was discovered independently in several fields (control theory, optimization) before being applied to neural networks. Any differentiable program can be differentiated with reverse mode.

Summary

Forward mode: computes JVPs ( $J \cdot v$ ), costs $O(n)$ passes for the full Jacobian
Reverse mode: computes VJPs ( $u^\top \cdot J$ ), costs $O(m)$ passes for the full Jacobian
For $f: \mathbb{R}^n \to \mathbb{R}$ (the ML case), reverse mode needs 1 pass, forward mode needs $n$ passes
Backpropagation is reverse-mode autodiff
Reverse mode requires storing intermediate values (memory cost)
PyTorch: tape-based, dynamic graphs. JAX: functional transforms, composable

Exercises

ExerciseCore

Problem

A neural network maps $\mathbb{R}^{1000}$ to $\mathbb{R}$ (1000 parameters, scalar loss). How many forward/backward passes does forward-mode autodiff need to compute the full gradient? How many does reverse-mode need?

ExerciseCore

Problem

Consider the function $f(x) = e^{e^x}$ . Draw the computational graph and perform one reverse-mode pass to compute $f'(x)$ .

ExerciseAdvanced

Problem

Show how to compute a Hessian-vector product $H v = (\nabla^2 f) v$ using one forward-mode pass composed with one reverse-mode pass, without ever forming the full Hessian matrix. Why is this useful?

References

Canonical:

Griewank & Walther, Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2008), Chapters 1-3 and 6
Goodfellow, Bengio & Courville, Deep Learning (2016), Chapters 6 and 8
Rumelhart, Hinton & Williams, "Learning representations by back-propagating errors" (1986)

Current:

Baydin et al., "Automatic Differentiation in Machine Learning: a Survey", JMLR (2018), Sections 2-3
Bradbury et al., "JAX: composable transformations of Python+NumPy programs" (2018)
Pearlmutter, "Fast Exact Multiplication by the Hessian", Neural Computation (1994)
Bartholomew-Biggs et al., "Automatic differentiation of algorithms", Journal of Computational and Applied Mathematics (2000)

Next Topics

The natural next steps from automatic differentiation:

Feedforward networks and backpropagation: applying reverse-mode autodiff to train neural networks
Optimizer theory (SGD, Adam, Muon): what to do with the gradients once you have them

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Differentiation in Rⁿlayer 0A · tier 1
Taylor Expansionlayer 0A · tier 1
The Hessian Matrixlayer 0A · tier 1
The Jacobian Matrixlayer 0A · tier 1
Vector Calculus Chain Rulelayer 0A · tier 1

Derived topics

7

Feedforward Networks and Backpropagationlayer 2 · tier 1
Optimizer Theory: SGD, Adam, and Muonlayer 3 · tier 1
WebGPU for Machine Learninglayer 0B · tier 2
Implicit Differentiationlayer 2 · tier 2
Adjoint Sensitivity Methodlayer 3 · tier 2

+2 more on the derived-topics page.

Graph-backed continuations

Feedforward Networks and Backpropagation Optimizer Theory: SGD, Adam, and Muon Adjoint Sensitivity Method Implicit Differentiation Neural ODEs and Continuous-Depth Networks Physics-Informed Neural Networks WebGPU for Machine Learning