Foundations
Differentiation in Rn
Partial derivatives, the gradient, directional derivatives, the total derivative (Frechet), and the multivariable chain rule. Why the gradient points in the steepest ascent direction, and why this matters for all of optimization.
Prerequisites
Why This Matters
Gradient descent is the workhorse of machine learning. Every neural network, every logistic regression, every optimizer computes a gradient and steps in the negative gradient direction. Understanding what the gradient is, why it points in the direction of steepest ascent, and how the chain rule composes derivatives through layers is prerequisite for understanding any optimization method.
Partial Derivatives
Partial Derivative
For , the partial derivative with respect to the -th variable at point is:
where is the -th standard basis vector. This measures the rate of change of when only varies and all other variables are held fixed.
Partial derivatives exist whenever the function is differentiable along coordinate directions. But the existence of all partial derivatives does not guarantee that is differentiable in the full sense. The total derivative (below) is the correct generalization.
The Gradient
Gradient
For with all partial derivatives existing at , the gradient is the vector of partial derivatives:
The gradient lives in the same space as the input .
Directional Derivative
Directional Derivative
For a unit vector (), the directional derivative of at in direction is:
When is differentiable at : .
The Gradient Points in the Steepest Ascent Direction
Gradient as Steepest Ascent Direction
Statement
Among all unit vectors with , the directional derivative is maximized when . The maximum value is . The minimum is achieved by with value .
Intuition
The directional derivative is the dot product of the gradient with the direction. By the Cauchy-Schwarz inequality, this is maximized when is aligned with and minimized when is opposite. The gradient direction is the direction of fastest increase; the negative gradient is the direction of fastest decrease.
Proof Sketch
By Cauchy-Schwarz: with equality when . Similarly, the minimum is achieved when .
Why It Matters
This is the theoretical justification for gradient descent. To decrease a loss function as quickly as possible per unit step, move in the direction . This idea is the foundation of convex optimization. Every gradient-based optimizer (SGD, Adam, Adagrad) uses the negative gradient as its primary signal.
Failure Mode
The gradient gives the steepest descent direction only for infinitesimal steps. For finite step sizes, the actual decrease depends on the curvature (second derivatives). A large step in the gradient direction can overshoot and increase the loss. This is why learning rate selection matters, and why second-order methods (Newton's method, natural gradient) consider curvature.
Total Derivative (Frechet Derivative)
Total Derivative
A function is (Frechet) differentiable at if there exists a linear map such that:
The linear map is the total derivative . It is represented by the Jacobian matrix. For (scalar-valued), is a row vector.
The total derivative is the correct notion of differentiability for multivariable functions. It says: near is well-approximated by its linearization .
Chain Rule in Multiple Variables
Multivariable Chain Rule
Statement
If is differentiable at and is differentiable at , then the composition is differentiable at and:
The derivative of the composition is the product of the Jacobian matrices.
Intuition
Each differentiable function is locally linear. Composing two locally linear maps gives a locally linear map, and the matrix of the composition is the product of the matrices. This is just the chain rule from single-variable calculus generalized to matrices.
Proof Sketch
Let and . By differentiability of : . By differentiability of : . Since is linear, this equals . So .
Why It Matters
Backpropagation is the chain rule applied to a computational graph. A neural network is a composition , and:
Backpropagation computes this product right-to-left (reverse mode), which is efficient because the output dimension (loss = scalar) is small.
Failure Mode
The chain rule requires both and to be differentiable. Non-differentiable activations (ReLU at zero) technically violate this, but work in practice because the set of non-differentiable points has measure zero and subgradients suffice for optimization.
Common Confusions
Partial derivatives existing does not imply differentiability
The function for and has both partial derivatives at the origin ( and ) but is not continuous, let alone differentiable, at the origin. The total derivative is a strictly stronger condition than the existence of partial derivatives.
The gradient is not the same as the derivative
For , the gradient is a column vector in . The total derivative is a row vector (a matrix). They contain the same information but are different mathematical objects: . This distinction matters when composing derivatives via the chain rule.
Gradient descent works on parameters, not inputs
In ML, you compute , the gradient with respect to model parameters . You do not optimize over inputs . The gradient tells you how to change to reduce the loss, not how to change the data.
Summary
- Partial derivative: rate of change along one coordinate axis
- Gradient: vector of all partial derivatives; lives in input space
- The gradient points in the direction of steepest ascent (Cauchy-Schwarz)
- Total derivative (Frechet): the best linear approximation to near
- Chain rule: , i.e., multiply Jacobians
- Backpropagation is the chain rule applied right-to-left through a neural network
Exercises
Problem
Let . Compute and find the directional derivative in the direction .
Problem
A two-layer neural network computes where is applied elementwise and , . Use the chain rule to write in terms of the Jacobians of each layer.
References
Canonical:
- Rudin, Principles of Mathematical Analysis (1976), Chapter 9
- Spivak, Calculus on Manifolds (1965), Chapter 2
- Apostol, Mathematical Analysis (1974), Chapter 12 (multivariable differential calculus)
Current:
- Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 4 (Numerical Computation)
- Boyd & Vandenberghe, Convex Optimization (2004), Appendix A.4 (gradients and Hessians)
- Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 5 (vector calculus)
Next Topics
- The Jacobian matrix: the full matrix of partial derivatives for vector-valued functions
- Convex optimization basics: using gradients to solve optimization problems
- Automatic differentiation: computing gradients efficiently via computation graphs
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
Builds on This
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Gradient Descent VariantsLayer 1
- Maximum Likelihood EstimationLayer 0B