Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Calculus Objects

Automatic Differentiation

Forward mode computes Jacobian-vector products, reverse mode computes vector-Jacobian products: backpropagation is reverse-mode autodiff, and the asymmetry between the two modes explains why training neural networks is efficient.

CoreTier 1Stable~50 min

Prerequisites

Why This Matters

Every gradient-based optimization algorithm in machine learning depends on computing derivatives. For a neural network with millions of parameters and a scalar loss function, you need the gradient of the loss with respect to every parameter. Computing this naively would require one forward pass per parameter, which is impossibly expensive.

Automatic differentiation (autodiff) solves this. Reverse-mode autodiff (backpropagation) computes the entire gradient in a single backward pass, with cost proportional to a single forward pass. This is the algorithmic insight that makes training neural networks feasible.

Understanding the distinction between forward mode and reverse mode, and why reverse mode is efficient for the typical ML setup (many inputs, scalar output), is foundational for understanding modern deep learning frameworks.

∂v₁/∂x = 1∂v₁/∂y = 1∂f/∂v₁ = y = 2∂f/∂y = v₁ = 5x= 3∂f=2y= 2∂f=7v₁ = x+y= 5∂f=2f = v₁·y= 10∂f=1Forward (compute values)Backward (accumulate gradients)Chain rule: multiply local gradients along each path, sum over all paths

Mental Model

Think of a computation as a directed acyclic graph (DAG). Each node is an elementary operation (add, multiply, exp, etc.). The input flows forward through the graph to produce the output. Autodiff applies the chain rule systematically along this graph.

Forward mode propagates derivatives forward through the graph alongside the computation. It answers: "if I perturb one input, how does the output change?"

Reverse mode propagates derivatives backward from the output to all inputs. It answers: "how does each input contribute to a change in the output?"

For a function f:RnRf: \mathbb{R}^n \to \mathbb{R} (many inputs, one output), forward mode needs nn passes (one per input) but reverse mode needs just one. This is why backpropagation uses reverse mode.

Formal Setup and Notation

Consider a function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m composed of elementary operations:

f=fLfL1f1f = f_L \circ f_{L-1} \circ \cdots \circ f_1

where each f:Rd1Rdf_\ell: \mathbb{R}^{d_{\ell-1}} \to \mathbb{R}^{d_\ell} is a simple function (matrix multiplication, elementwise nonlinearity, etc.) with Jacobian JRd×d1J_\ell \in \mathbb{R}^{d_\ell \times d_{\ell-1}}.

By the chain rule, the Jacobian of ff is:

Jf=JLJL1J1J_f = J_L \cdot J_{L-1} \cdots J_1

The question is: what order do we multiply these matrices?

Core Definitions

Definition

Jacobian-Vector Product (JVP)

A Jacobian-vector product computes JfvJ_f \cdot v for a vector vRnv \in \mathbb{R}^n (called a tangent vector) without forming JfJ_f explicitly. This is done by associating the multiplications left-to-right:

Jfv=JL(JL1((J1v)))J_f \cdot v = J_L (J_{L-1} (\cdots (J_1 \cdot v) \cdots))

Each step produces a vector (not a matrix), so each multiplication is matrix-vector, costing O(dd1)O(d_\ell \cdot d_{\ell-1}).

Definition

Vector-Jacobian Product (VJP)

A vector-Jacobian product computes uJfu^\top \cdot J_f for a vector uRmu \in \mathbb{R}^m (called a cotangent vector) without forming JfJ_f explicitly. This is done by associating right-to-left:

uJf=(((uJL)JL1))J1u^\top \cdot J_f = ((\cdots (u^\top \cdot J_L) \cdot J_{L-1}) \cdots) \cdot J_1

Again, each step is a vector-matrix product, not matrix-matrix.

Definition

Forward Mode Autodiff

Forward mode computes JVPs. Given an input perturbation direction vRnv \in \mathbb{R}^n, forward mode propagates vv through the computation graph alongside the function evaluation, producing JfvJ_f \cdot v at the output.

One forward-mode pass computes one column of the Jacobian (by setting v=eiv = e_i, the ii-th standard basis vector). To get the full m×nm \times n Jacobian, you need nn passes.

Definition

Reverse Mode Autodiff (Backpropagation)

Reverse mode computes VJPs. Given an output sensitivity uRmu \in \mathbb{R}^m, reverse mode propagates uu backward through the computation graph, producing uJfu^\top J_f (a row of JfJ_f if u=eju = e_j).

One reverse-mode pass computes one row of the Jacobian. To get the full Jacobian, you need mm passes.

Main Theorems

Theorem

Reverse Mode Computes All Gradients in O(1) Backward Passes

Statement

For a function f:RnRf: \mathbb{R}^n \to \mathbb{R} (scalar output), reverse-mode autodiff computes the full gradient fRn\nabla f \in \mathbb{R}^n in a single backward pass. The computational cost is at most a constant factor (typically 2-5x) times the cost of evaluating ff itself.

Intuition

The gradient is a single row of the 1×n1 \times n Jacobian. Reverse mode computes one row per pass. So one pass suffices. This is why backpropagation is efficient: regardless of how many millions of parameters the network has, one backward pass gives the gradient with respect to all of them.

Proof Sketch

In the reverse pass, each node in the computational graph receives its adjoint (the derivative of the output with respect to that node) from its children, and passes it to its parents using the local Jacobian. Each edge in the graph is traversed once, and each local Jacobian-vector product costs proportional to the cost of the corresponding forward operation. Summing over all operations gives total cost proportional to the forward pass cost.

Why It Matters

This is arguably the single most important algorithmic result in deep learning. Without it, computing gradients for a network with nn parameters would cost O(n)O(n) times the forward pass (one perturbation per parameter). With reverse mode, it costs O(1)O(1) times the forward pass. The ratio is typically n=106n = 10^6 to 101210^{12} for modern networks.

Failure Mode

Reverse mode requires storing all intermediate values from the forward pass (the "tape"). For very deep networks or long sequences, this memory cost can be prohibitive. Gradient checkpointing trades memory for recomputation, storing only some intermediate values and recomputing the rest during the backward pass.

Proposition

Forward Mode Efficiency for Low-Dimensional Inputs

Statement

For a function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m, forward-mode autodiff computes the full Jacobian in nn passes, each costing O(1)O(1) times the forward evaluation. When nmn \ll m (few inputs, many outputs), forward mode is more efficient than reverse mode.

Intuition

Forward mode computes one column of the Jacobian per pass (one directional derivative). If there are few input dimensions, few passes suffice. Reverse mode would need mm passes (one per output dimension) in this case.

Proof Sketch

Each forward pass propagates a tangent vector through the graph. The tangent at each node is a vector of the same dimension as the node's output, and the local JVP costs proportional to the local forward operation. With nn passes (one per input dimension), all nn columns of the Jacobian are computed.

Why It Matters

In most ML applications, forward mode is not used because nmn \gg m (many parameters, scalar loss). But forward mode is important for computing Hessian-vector products (by combining forward and reverse mode), for sensitivity analysis, and for differential equation solvers where the number of outputs may exceed the number of inputs.

Failure Mode

Forward mode does not need to store intermediate values (no tape), so it has lower memory cost than reverse mode. But for the common f:RnRf: \mathbb{R}^n \to \mathbb{R} case, nn forward passes are far more expensive than one reverse pass.

Computational Graphs

A computational graph makes the chain rule explicit. Each node viv_i in the graph computes an elementary operation on its inputs. The edges represent data flow.

Forward pass: evaluate v1,v2,,vNv_1, v_2, \ldots, v_N in topological order. Store each viv_i.

Backward pass (reverse mode): define the adjoint vˉi=f/vi\bar{v}_i = \partial f / \partial v_i. Initialize vˉN=1\bar{v}_N = 1 (if vNv_N is the scalar output). For each node in reverse topological order:

vˉi=j:ijvˉjvjvi\bar{v}_i = \sum_{j : i \to j} \bar{v}_j \cdot \frac{\partial v_j}{\partial v_i}

This is the chain rule applied to the graph structure.

Implementation Approaches

Tape-based (operator overloading): record a trace of operations during the forward pass, then replay it backward. PyTorch uses this approach with dynamic graphs: the tape is rebuilt each forward pass, allowing arbitrary Python control flow (loops, conditionals) in the model.

Source transformation: analyze the source code of the function and generate new code for the derivative function. Theano and TensorFlow 1.x used static graph compilation. JAX uses a functional transformation approach: jax.grad(f) returns a new function that computes the gradient of f.

PyTorch vs JAX: PyTorch builds the graph eagerly (tape-based, imperative). JAX traces the function lazily (functional transforms, composable). JAX's approach makes higher-order derivatives (grad of grad) and composition of transforms (vmap, jit, grad) natural.

Canonical Examples

Example

Reverse mode on a simple computation

Consider f(x1,x2)=(x1x2+sin(x1))2f(x_1, x_2) = (x_1 x_2 + \sin(x_1))^2. The computational graph has nodes: v1=x1x2v_1 = x_1 x_2, v2=sin(x1)v_2 = \sin(x_1), v3=v1+v2v_3 = v_1 + v_2, v4=v32v_4 = v_3^2. The forward pass computes each node. The backward pass starts with vˉ4=1\bar{v}_4 = 1, then vˉ3=2v3\bar{v}_3 = 2v_3, then vˉ1=vˉ3\bar{v}_1 = \bar{v}_3, vˉ2=vˉ3\bar{v}_2 = \bar{v}_3, and finally xˉ1=vˉ1x2+vˉ2cos(x1)\bar{x}_1 = \bar{v}_1 \cdot x_2 + \bar{v}_2 \cdot \cos(x_1), xˉ2=vˉ1x1\bar{x}_2 = \bar{v}_1 \cdot x_1. One backward pass gives both partial derivatives.

Common Confusions

Watch Out

Autodiff is not finite differences

Finite differences approximate derivatives numerically:

fxif(x+hei)f(x)h\frac{\partial f}{\partial x_i} \approx \frac{f(x + h e_i) - f(x)}{h}

This requires n+1n+1 forward passes for nn partial derivatives, suffers from numerical instability (too small hh gives cancellation, too large gives truncation error), and is always slower than reverse-mode autodiff. Autodiff computes exact derivatives (up to floating-point arithmetic).

Watch Out

Autodiff is not symbolic differentiation

Symbolic differentiation (like in Mathematica) manipulates mathematical expressions and can produce exponentially large derivative expressions due to expression swell. Autodiff works on numerical computations, not symbolic expressions. It evaluates derivatives at specific numerical inputs, avoiding expression swell entirely.

Watch Out

Backpropagation is not specific to neural networks

Backpropagation is reverse-mode autodiff applied to any computational graph. It was discovered independently in several fields (control theory, optimization) before being applied to neural networks. Any differentiable program can be differentiated with reverse mode.

Summary

  • Forward mode: computes JVPs (JvJ \cdot v), costs O(n)O(n) passes for the full Jacobian
  • Reverse mode: computes VJPs (uJu^\top \cdot J), costs O(m)O(m) passes for the full Jacobian
  • For f:RnRf: \mathbb{R}^n \to \mathbb{R} (the ML case), reverse mode needs 1 pass, forward mode needs nn passes
  • Backpropagation is reverse-mode autodiff
  • Reverse mode requires storing intermediate values (memory cost)
  • PyTorch: tape-based, dynamic graphs. JAX: functional transforms, composable

Exercises

ExerciseCore

Problem

A neural network maps R1000\mathbb{R}^{1000} to R\mathbb{R} (1000 parameters, scalar loss). How many forward/backward passes does forward-mode autodiff need to compute the full gradient? How many does reverse-mode need?

ExerciseCore

Problem

Consider the function f(x)=eexf(x) = e^{e^x}. Draw the computational graph and perform one reverse-mode pass to compute f(x)f'(x).

ExerciseAdvanced

Problem

Show how to compute a Hessian-vector product Hv=(2f)vH v = (\nabla^2 f) v using one forward-mode pass composed with one reverse-mode pass, without ever forming the full Hessian matrix. Why is this useful?

References

Canonical:

  • Griewank & Walther, Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2008)
  • Rumelhart, Hinton, Williams, "Learning representations by back-propagating errors" (1986)

Current:

  • Baydin et al., "Automatic differentiation in machine learning: a survey" (JMLR 2018)
  • Bradbury et al., "JAX: composable transformations of Python+NumPy programs" (2018)

Next Topics

The natural next steps from automatic differentiation:

  • Feedforward networks and backpropagation: applying reverse-mode autodiff to train neural networks
  • Optimizer theory (SGD, Adam, Muon): what to do with the gradients once you have them

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics