Why backprop is reverse-mode, not forward-mode
Question
A network computes L = f4(f3(f2(f1(x)))) where x is a vector input and L is a scalar loss. To compute the gradient dL/dx via the chain rule, the product J_4 * J_3 * J_2 * J_1 (Jacobians) can be evaluated either left-to-right (forward) or right-to-left (reverse). Why does backpropagation evaluate it right-to-left?
Why this matters
This is the entire computational reason deep learning is feasible at scale. Forward-mode autodiff costs O(input dim * forward cost) per output; reverse-mode costs O(forward cost) per output regardless of input dimension. For a scalar loss with millions of parameters, reverse-mode wins by a factor of millions. Getting this wrong on an interview signals the candidate has not thought about why backprop has the cost structure it does.
Common mistake
Believing reverse-mode is necessary; it's only necessary for the cost reason. For scalar inputs and vector outputs (e.g. computing the Jacobian of a loss with respect to a single hyperparameter), forward-mode is cheaper and is the right tool.
Source anchor
content/topics/feedforward-networks-and-backpropagation.mdx#backprop-complexity