Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Differentiation in Rn

Partial derivatives, the gradient, directional derivatives, the total derivative (Frechet), and the multivariable chain rule. Why the gradient points in the steepest ascent direction, and why this matters for all of optimization.

CoreTier 1Stable~40 min

Why This Matters

Gradient descent is the workhorse of machine learning. Every neural network, every logistic regression, every optimizer computes a gradient and steps in the negative gradient direction. Understanding what the gradient is, why it points in the direction of steepest ascent, and how the chain rule composes derivatives through layers is prerequisite for understanding any optimization method.

Partial Derivatives

Definition

Partial Derivative

For f:RnRf: \mathbb{R}^n \to \mathbb{R}, the partial derivative with respect to the ii-th variable at point aa is:

fxi(a)=limh0f(a+hei)f(a)h\frac{\partial f}{\partial x_i}(a) = \lim_{h \to 0} \frac{f(a + h e_i) - f(a)}{h}

where eie_i is the ii-th standard basis vector. This measures the rate of change of ff when only xix_i varies and all other variables are held fixed.

Partial derivatives exist whenever the function is differentiable along coordinate directions. But the existence of all partial derivatives does not guarantee that ff is differentiable in the full sense. The total derivative (below) is the correct generalization.

The Gradient

Definition

Gradient

For f:RnRf: \mathbb{R}^n \to \mathbb{R} with all partial derivatives existing at aa, the gradient is the vector of partial derivatives:

f(a)=(fx1(a),,fxn(a))TRn\nabla f(a) = \left(\frac{\partial f}{\partial x_1}(a), \ldots, \frac{\partial f}{\partial x_n}(a)\right)^T \in \mathbb{R}^n

The gradient lives in the same space as the input aa.

Directional Derivative

Definition

Directional Derivative

For a unit vector vRnv \in \mathbb{R}^n (v=1\|v\| = 1), the directional derivative of ff at aa in direction vv is:

Dvf(a)=limt0f(a+tv)f(a)tD_v f(a) = \lim_{t \to 0} \frac{f(a + tv) - f(a)}{t}

When ff is differentiable at aa: Dvf(a)=f(a)TvD_v f(a) = \nabla f(a)^T v.

The Gradient Points in the Steepest Ascent Direction

Theorem

Gradient as Steepest Ascent Direction

Statement

Among all unit vectors vv with v=1\|v\| = 1, the directional derivative Dvf(a)=f(a)TvD_v f(a) = \nabla f(a)^T v is maximized when v=f(a)/f(a)v = \nabla f(a) / \|\nabla f(a)\|. The maximum value is f(a)\|\nabla f(a)\|. The minimum is achieved by v=f(a)/f(a)v = -\nabla f(a) / \|\nabla f(a)\| with value f(a)-\|\nabla f(a)\|.

Intuition

The directional derivative f(a)Tv\nabla f(a)^T v is the dot product of the gradient with the direction. By the Cauchy-Schwarz inequality, this is maximized when vv is aligned with f(a)\nabla f(a) and minimized when vv is opposite. The gradient direction is the direction of fastest increase; the negative gradient is the direction of fastest decrease.

Proof Sketch

By Cauchy-Schwarz: f(a)Tvf(a)v=f(a)\nabla f(a)^T v \leq \|\nabla f(a)\| \cdot \|v\| = \|\nabla f(a)\| with equality when v=f(a)/f(a)v = \nabla f(a) / \|\nabla f(a)\|. Similarly, the minimum f(a)-\|\nabla f(a)\| is achieved when v=f(a)/f(a)v = -\nabla f(a) / \|\nabla f(a)\|.

Why It Matters

This is the theoretical justification for gradient descent. To decrease a loss function L(θ)\mathcal{L}(\theta) as quickly as possible per unit step, move in the direction L(θ)-\nabla \mathcal{L}(\theta). This idea is the foundation of convex optimization. Every gradient-based optimizer (SGD, Adam, Adagrad) uses the negative gradient as its primary signal.

Failure Mode

The gradient gives the steepest descent direction only for infinitesimal steps. For finite step sizes, the actual decrease depends on the curvature (second derivatives). A large step in the gradient direction can overshoot and increase the loss. This is why learning rate selection matters, and why second-order methods (Newton's method, natural gradient) consider curvature.

Total Derivative (Frechet Derivative)

Definition

Total Derivative

A function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m is (Frechet) differentiable at aa if there exists a linear map L:RnRmL: \mathbb{R}^n \to \mathbb{R}^m such that:

limh0f(a+h)f(a)Lhh=0\lim_{h \to 0} \frac{\|f(a + h) - f(a) - Lh\|}{\|h\|} = 0

The linear map LL is the total derivative Df(a)Df(a). It is represented by the m×nm \times n Jacobian matrix. For f:RnRf: \mathbb{R}^n \to \mathbb{R} (scalar-valued), Df(a)=f(a)TDf(a) = \nabla f(a)^T is a row vector.

The total derivative is the correct notion of differentiability for multivariable functions. It says: ff near aa is well-approximated by its linearization f(a)+Df(a)(xa)f(a) + Df(a)(x - a).

Chain Rule in Multiple Variables

Theorem

Multivariable Chain Rule

Statement

If g:RnRkg: \mathbb{R}^n \to \mathbb{R}^k is differentiable at aa and f:RkRmf: \mathbb{R}^k \to \mathbb{R}^m is differentiable at g(a)g(a), then the composition fgf \circ g is differentiable at aa and:

D(fg)(a)=Df(g(a))Dg(a)D(f \circ g)(a) = Df(g(a)) \cdot Dg(a)

The derivative of the composition is the product of the Jacobian matrices.

Intuition

Each differentiable function is locally linear. Composing two locally linear maps gives a locally linear map, and the matrix of the composition is the product of the matrices. This is just the chain rule from single-variable calculus generalized to matrices.

Proof Sketch

Let Lf=Df(g(a))L_f = Df(g(a)) and Lg=Dg(a)L_g = Dg(a). By differentiability of gg: g(a+h)=g(a)+Lgh+o(h)g(a + h) = g(a) + L_g h + o(\|h\|). By differentiability of ff: f(g(a+h))=f(g(a)+Lgh+o(h))=f(g(a))+Lf(Lgh+o(h))+o(Lgh+o(h))f(g(a + h)) = f(g(a) + L_g h + o(\|h\|)) = f(g(a)) + L_f(L_g h + o(\|h\|)) + o(\|L_g h + o(\|h\|)\|). Since LfL_f is linear, this equals f(g(a))+LfLgh+o(h)f(g(a)) + L_f L_g h + o(\|h\|). So D(fg)(a)=LfLgD(f \circ g)(a) = L_f L_g.

Why It Matters

Backpropagation is the chain rule applied to a computational graph. A neural network is a composition f=fLfL1f1f = f_L \circ f_{L-1} \circ \cdots \circ f_1, and:

Df(a)=DfL(zL1)DfL1(zL2)Df1(a)Df(a) = Df_L(z_{L-1}) \cdot Df_{L-1}(z_{L-2}) \cdots Df_1(a)

Backpropagation computes this product right-to-left (reverse mode), which is efficient because the output dimension (loss = scalar) is small.

Failure Mode

The chain rule requires both ff and gg to be differentiable. Non-differentiable activations (ReLU at zero) technically violate this, but work in practice because the set of non-differentiable points has measure zero and subgradients suffice for optimization.

Common Confusions

Watch Out

Partial derivatives existing does not imply differentiability

The function f(x,y)=xy/(x2+y2)f(x, y) = xy / (x^2 + y^2) for (x,y)(0,0)(x,y) \neq (0,0) and f(0,0)=0f(0,0) = 0 has both partial derivatives at the origin (f/x=0\partial f / \partial x = 0 and f/y=0\partial f / \partial y = 0) but is not continuous, let alone differentiable, at the origin. The total derivative is a strictly stronger condition than the existence of partial derivatives.

Watch Out

The gradient is not the same as the derivative

For f:RnRf: \mathbb{R}^n \to \mathbb{R}, the gradient f(a)\nabla f(a) is a column vector in Rn\mathbb{R}^n. The total derivative Df(a)Df(a) is a row vector (a 1×n1 \times n matrix). They contain the same information but are different mathematical objects: Df(a)=f(a)TDf(a) = \nabla f(a)^T. This distinction matters when composing derivatives via the chain rule.

Watch Out

Gradient descent works on parameters, not inputs

In ML, you compute θL(θ)\nabla_\theta \mathcal{L}(\theta), the gradient with respect to model parameters θ\theta. You do not optimize over inputs xx. The gradient tells you how to change θ\theta to reduce the loss, not how to change the data.

Summary

  • Partial derivative: rate of change along one coordinate axis
  • Gradient: vector of all partial derivatives; lives in input space
  • The gradient points in the direction of steepest ascent (Cauchy-Schwarz)
  • Total derivative (Frechet): the best linear approximation to ff near aa
  • Chain rule: D(fg)(a)=Df(g(a))Dg(a)D(f \circ g)(a) = Df(g(a)) \cdot Dg(a), i.e., multiply Jacobians
  • Backpropagation is the chain rule applied right-to-left through a neural network

Exercises

ExerciseCore

Problem

Let f(x,y)=x2y+exyf(x, y) = x^2 y + e^{xy}. Compute f(1,0)\nabla f(1, 0) and find the directional derivative in the direction v=(3/5,4/5)v = (3/5, 4/5).

ExerciseAdvanced

Problem

A two-layer neural network computes f(x)=σ(W2σ(W1x))f(x) = \sigma(W_2 \sigma(W_1 x)) where σ\sigma is applied elementwise and W1Rk×nW_1 \in \mathbb{R}^{k \times n}, W2R1×kW_2 \in \mathbb{R}^{1 \times k}. Use the chain rule to write f/W1\partial f / \partial W_1 in terms of the Jacobians of each layer.

References

Canonical:

  • Rudin, Principles of Mathematical Analysis (1976), Chapter 9
  • Spivak, Calculus on Manifolds (1965), Chapter 2
  • Apostol, Mathematical Analysis (1974), Chapter 12 (multivariable differential calculus)

Current:

  • Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 4 (Numerical Computation)
  • Boyd & Vandenberghe, Convex Optimization (2004), Appendix A.4 (gradients and Hessians)
  • Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 5 (vector calculus)

Next Topics

  • The Jacobian matrix: the full matrix of partial derivatives for vector-valued functions
  • Convex optimization basics: using gradients to solve optimization problems
  • Automatic differentiation: computing gradients efficiently via computation graphs

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics