Calculus Objects
Automatic Differentiation
Forward mode computes Jacobian-vector products, reverse mode computes vector-Jacobian products: backpropagation is reverse-mode autodiff, and the asymmetry between the two modes explains why training neural networks is efficient.
Prerequisites
Why This Matters
Every gradient-based optimization algorithm in machine learning depends on computing derivatives. For a neural network with millions of parameters and a scalar loss function, you need the gradient of the loss with respect to every parameter. Computing this naively would require one forward pass per parameter, which is impossibly expensive.
Automatic differentiation (autodiff) solves this. Reverse-mode autodiff (backpropagation) computes the entire gradient in a single backward pass, with cost proportional to a single forward pass. This is the algorithmic insight that makes training neural networks feasible.
Understanding the distinction between forward mode and reverse mode, and why reverse mode is efficient for the typical ML setup (many inputs, scalar output), is foundational for understanding modern deep learning frameworks.
Mental Model
Think of a computation as a directed acyclic graph (DAG). Each node is an elementary operation (add, multiply, exp, etc.). The input flows forward through the graph to produce the output. Autodiff applies the chain rule systematically along this graph.
Forward mode propagates derivatives forward through the graph alongside the computation. It answers: "if I perturb one input, how does the output change?"
Reverse mode propagates derivatives backward from the output to all inputs. It answers: "how does each input contribute to a change in the output?"
For a function (many inputs, one output), forward mode needs passes (one per input) but reverse mode needs just one. This is why backpropagation uses reverse mode.
Formal Setup and Notation
Consider a function composed of elementary operations:
where each is a simple function (matrix multiplication, elementwise nonlinearity, etc.) with Jacobian .
By the chain rule, the Jacobian of is:
The question is: what order do we multiply these matrices?
Core Definitions
Jacobian-Vector Product (JVP)
A Jacobian-vector product computes for a vector (called a tangent vector) without forming explicitly. This is done by associating the multiplications left-to-right:
Each step produces a vector (not a matrix), so each multiplication is matrix-vector, costing .
Vector-Jacobian Product (VJP)
A vector-Jacobian product computes for a vector (called a cotangent vector) without forming explicitly. This is done by associating right-to-left:
Again, each step is a vector-matrix product, not matrix-matrix.
Forward Mode Autodiff
Forward mode computes JVPs. Given an input perturbation direction , forward mode propagates through the computation graph alongside the function evaluation, producing at the output.
One forward-mode pass computes one column of the Jacobian (by setting , the -th standard basis vector). To get the full Jacobian, you need passes.
Reverse Mode Autodiff (Backpropagation)
Reverse mode computes VJPs. Given an output sensitivity , reverse mode propagates backward through the computation graph, producing (a row of if ).
One reverse-mode pass computes one row of the Jacobian. To get the full Jacobian, you need passes.
Main Theorems
Reverse Mode Computes All Gradients in O(1) Backward Passes
Statement
For a function (scalar output), reverse-mode autodiff computes the full gradient in a single backward pass. The computational cost is at most a constant factor (typically 2-5x) times the cost of evaluating itself.
Intuition
The gradient is a single row of the Jacobian. Reverse mode computes one row per pass. So one pass suffices. This is why backpropagation is efficient: regardless of how many millions of parameters the network has, one backward pass gives the gradient with respect to all of them.
Proof Sketch
In the reverse pass, each node in the computational graph receives its adjoint (the derivative of the output with respect to that node) from its children, and passes it to its parents using the local Jacobian. Each edge in the graph is traversed once, and each local Jacobian-vector product costs proportional to the cost of the corresponding forward operation. Summing over all operations gives total cost proportional to the forward pass cost.
Why It Matters
This is arguably the single most important algorithmic result in deep learning. Without it, computing gradients for a network with parameters would cost times the forward pass (one perturbation per parameter). With reverse mode, it costs times the forward pass. The ratio is typically to for modern networks.
Failure Mode
Reverse mode requires storing all intermediate values from the forward pass (the "tape"). For very deep networks or long sequences, this memory cost can be prohibitive. Gradient checkpointing trades memory for recomputation, storing only some intermediate values and recomputing the rest during the backward pass.
Forward Mode Efficiency for Low-Dimensional Inputs
Statement
For a function , forward-mode autodiff computes the full Jacobian in passes, each costing times the forward evaluation. When (few inputs, many outputs), forward mode is more efficient than reverse mode.
Intuition
Forward mode computes one column of the Jacobian per pass (one directional derivative). If there are few input dimensions, few passes suffice. Reverse mode would need passes (one per output dimension) in this case.
Proof Sketch
Each forward pass propagates a tangent vector through the graph. The tangent at each node is a vector of the same dimension as the node's output, and the local JVP costs proportional to the local forward operation. With passes (one per input dimension), all columns of the Jacobian are computed.
Why It Matters
In most ML applications, forward mode is not used because (many parameters, scalar loss). But forward mode is important for computing Hessian-vector products (by combining forward and reverse mode), for sensitivity analysis, and for differential equation solvers where the number of outputs may exceed the number of inputs.
Failure Mode
Forward mode does not need to store intermediate values (no tape), so it has lower memory cost than reverse mode. But for the common case, forward passes are far more expensive than one reverse pass.
Computational Graphs
A computational graph makes the chain rule explicit. Each node in the graph computes an elementary operation on its inputs. The edges represent data flow.
Forward pass: evaluate in topological order. Store each .
Backward pass (reverse mode): define the adjoint . Initialize (if is the scalar output). For each node in reverse topological order:
This is the chain rule applied to the graph structure.
Implementation Approaches
Tape-based (operator overloading): record a trace of operations during the forward pass, then replay it backward. PyTorch uses this approach with dynamic graphs: the tape is rebuilt each forward pass, allowing arbitrary Python control flow (loops, conditionals) in the model.
Source transformation: analyze the source code of the function and generate
new code for the derivative function. Theano and TensorFlow 1.x used static
graph compilation. JAX uses a functional transformation approach:
jax.grad(f) returns a new function that computes the gradient of f.
PyTorch vs JAX: PyTorch builds the graph eagerly (tape-based, imperative). JAX traces the function lazily (functional transforms, composable). JAX's approach makes higher-order derivatives (grad of grad) and composition of transforms (vmap, jit, grad) natural.
Canonical Examples
Reverse mode on a simple computation
Consider . The computational graph has nodes: , , , . The forward pass computes each node. The backward pass starts with , then , then , , and finally , . One backward pass gives both partial derivatives.
Common Confusions
Autodiff is not finite differences
Finite differences approximate derivatives numerically:
This requires forward passes for partial derivatives, suffers from numerical instability (too small gives cancellation, too large gives truncation error), and is always slower than reverse-mode autodiff. Autodiff computes exact derivatives (up to floating-point arithmetic).
Autodiff is not symbolic differentiation
Symbolic differentiation (like in Mathematica) manipulates mathematical expressions and can produce exponentially large derivative expressions due to expression swell. Autodiff works on numerical computations, not symbolic expressions. It evaluates derivatives at specific numerical inputs, avoiding expression swell entirely.
Backpropagation is not specific to neural networks
Backpropagation is reverse-mode autodiff applied to any computational graph. It was discovered independently in several fields (control theory, optimization) before being applied to neural networks. Any differentiable program can be differentiated with reverse mode.
Summary
- Forward mode: computes JVPs (), costs passes for the full Jacobian
- Reverse mode: computes VJPs (), costs passes for the full Jacobian
- For (the ML case), reverse mode needs 1 pass, forward mode needs passes
- Backpropagation is reverse-mode autodiff
- Reverse mode requires storing intermediate values (memory cost)
- PyTorch: tape-based, dynamic graphs. JAX: functional transforms, composable
Exercises
Problem
A neural network maps to (1000 parameters, scalar loss). How many forward/backward passes does forward-mode autodiff need to compute the full gradient? How many does reverse-mode need?
Problem
Consider the function . Draw the computational graph and perform one reverse-mode pass to compute .
Problem
Show how to compute a Hessian-vector product using one forward-mode pass composed with one reverse-mode pass, without ever forming the full Hessian matrix. Why is this useful?
References
Canonical:
- Griewank & Walther, Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2008)
- Rumelhart, Hinton, Williams, "Learning representations by back-propagating errors" (1986)
Current:
- Baydin et al., "Automatic differentiation in machine learning: a survey" (JMLR 2018)
- Bradbury et al., "JAX: composable transformations of Python+NumPy programs" (2018)
Next Topics
The natural next steps from automatic differentiation:
- Feedforward networks and backpropagation: applying reverse-mode autodiff to train neural networks
- Optimizer theory (SGD, Adam, Muon): what to do with the gradients once you have them
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- The Jacobian MatrixLayer 0A