Concept discrimination·Difficulty 3/5·Target edge: vector-calculus-chain-rule → feedforward-networks-and-backpropagation

Why backprop is reverse-mode, not forward-mode

Question

A network computes L = f4(f3(f2(f1(x)))) where x is a vector input and L is a scalar loss. To compute the gradient dL/dx via the chain rule, the product J_4 * J_3 * J_2 * J_1 (Jacobians) can be evaluated either left-to-right (forward) or right-to-left (reverse). Why does backpropagation evaluate it right-to-left?

Why this matters

This is the entire computational reason deep learning is feasible at scale. Forward-mode autodiff costs O(input dim * forward cost) per output; reverse-mode costs O(forward cost) per output regardless of input dimension. For a scalar loss with millions of parameters, reverse-mode wins by a factor of millions. Getting this wrong on an interview signals the candidate has not thought about why backprop has the cost structure it does.

Common mistake

Believing reverse-mode is necessary; it's only necessary for the cost reason. For scalar inputs and vector outputs (e.g. computing the Jacobian of a loss with respect to a single hyperparameter), forward-mode is cheaper and is the right tool.

Source anchor

content/topics/feedforward-networks-and-backpropagation.mdx#backprop-complexity

/topics/vector-calculus-chain-rule /topics/feedforward-networks-and-backpropagation

← All proof tasks·Knowledge state →