Differentiation in Rⁿ

Sneiderman, Robby

Foundations

Differentiation in Rⁿ

Partial derivatives, directional derivatives, gradients, total derivatives, and the multivariable chain rule. The point is not notation: differentiability means one linear map predicts all small directions.

CoreTier 1StableSupporting~55 min

Prerequisites

Sets Functions and Relations Vectors Matrices and Linear Maps Continuity in Rn

Quiz (21)Pulse Check Prereq Map

Why This Matters

Machine learning optimization is built on one sentence:

Five-panel infographic on multivariable differentiation: directional derivatives, partial derivatives, the total derivative as a linear map, the Jacobian and gradient, the chain rule for composition of differentiable maps, and applications (backpropagation, implicit function theorem, manifold optimization). — The total derivative is the linear map that best approximates a function at a point. The Jacobian is its matrix; the gradient is the special case for scalar-valued functions.

Near the current parameters, replace the loss by a linear or quadratic approximation, then choose a step.

That sentence only makes sense if you understand differentiation in $\mathbb{R}^n$ . Gradients tell SGD and Adam which way the loss changes. Jacobians compose through neural-network layers. Hessians explain curvature, Newton steps, conditioning, and why a learning rate can explode.

The trap is thinking that multivariable differentiation is just "take partial derivatives." It is stricter than that. The real object is the total derivative: one linear map that predicts the function for every small direction at once.

The Local Linear Model

For $f:\mathbb{R}^n\to\mathbb{R}^m$ , differentiability at $a$ means there is a linear map $Df(a):\mathbb{R}^n\to\mathbb{R}^m$ such that

$f(a+h)=f(a)+Df(a)h+r(h),\qquad \|r(h)\|/\|h\|\to 0.$

The remainder $r(h)$ must be small compared with the step length $\|h\|$ . This is the multivariable version of a tangent line, but now the tangent object is a linear map.

Coordinate Derivatives

Definition

Partial Derivative

For $f:\mathbb{R}^n\to\mathbb{R}$ , the partial derivative with respect to coordinate $i$ at $a$ is

$\frac{\partial f}{\partial x_i}(a)=\lim_{t\to 0}\frac{f(a+t e_i)-f(a)}{t}.$

It measures change when only the $i$ th coordinate moves.

Partial derivatives are coordinate probes. They are useful, but they do not by themselves prove the function has a good local linear approximation in every direction.

Definition

Directional Derivative

For a unit vector $v\in\mathbb{R}^n$ , the directional derivative is

$D_v f(a)=\lim_{t\to 0}\frac{f(a+tv)-f(a)}{t}.$

It measures the rate of change along the line through $a$ in direction $v$ .

Directional derivatives probe more directions than partial derivatives, but even all directional derivatives existing is not the same thing as differentiability. Differentiability requires those directional rates to come from one linear map.

Gradient And Total Derivative

Definition

Gradient $\nabla f (a)$

For a scalar-valued function $f:\mathbb{R}^n\to\mathbb{R}$ , the gradient is the column vector

$\nabla f(a)=\left(\frac{\partial f}{\partial x_1}(a),\ldots,\frac{\partial f}{\partial x_n}(a)\right)^T.$

When $f$ is differentiable, $Df(a)h=\nabla f(a)^T h$ .

The gradient is the vector representation of the total derivative for scalar outputs. The total derivative itself is a linear functional, often written as a row vector. This row-versus-column distinction matters when multiplying Jacobians in the chain rule.

Definition

Jacobian Matrix

For $f:\mathbb{R}^n\to\mathbb{R}^m$ , the total derivative $Df(a)$ is represented by the $m$ by $n$ Jacobian matrix

$J_f(a)_{ij}=\frac{\partial f_i}{\partial x_j}(a).$

The Jacobian maps input perturbations $h$ to first-order output perturbations.

Steepest Direction

Theorem

Gradient As Steepest Ascent Direction

Statement

If $\nabla f(a) \neq 0$ , then among all unit vectors $v$ , the directional derivative $D_v f(a) = \nabla f(a)^T v$ is maximized by $v = \nabla f(a)/\|\nabla f(a)\|$ , and the maximum value is $\|\nabla f(a)\|$ . The minimum is achieved in the negative-gradient direction with value $-\|\nabla f(a)\|$ .

If $\nabla f(a) = 0$ , the expression $\nabla f(a)/\|\nabla f(a)\|$ is undefined; in this case every directional derivative satisfies $D_v f(a) = 0$ , every unit vector is simultaneously a maximizer and minimizer, and there is no first-order steepest direction. Determining whether $a$ is a min, max, or saddle requires second-order information (the Hessian).

Intuition

The dot product is largest when two vectors point in the same direction. The gradient points uphill fastest; the negative gradient points downhill fastest for an infinitesimal Euclidean step.

Proof Sketch

By Cauchy-Schwarz, $\nabla f(a)^T v\leq \|\nabla f(a)\|\|v\|=\|\nabla f(a)\|$ . Equality holds exactly when $v$ is a positive scalar multiple of $\nabla f(a)$ . The minimum follows by using the opposite direction.

Why It Matters

This is the mathematical reason gradient descent, SGD, and Adam use $-\nabla_\theta \mathcal{L}(\theta)$ as their primary signal.

Failure Mode

This is an infinitesimal claim. A finite step can overshoot because curvature enters through higher-order terms. That is why line search, learning-rate schedules, and second-order methods exist.

report a correction →

When Partials Are Enough

Theorem

Continuous Partial Derivatives Imply Differentiability

Statement

If all first partial derivatives of $f:\mathbb{R}^n\to\mathbb{R}^m$ exist in a neighborhood of $a$ and are continuous at $a$ , then $f$ is differentiable at $a$ .

Intuition

Continuity of the partial derivatives prevents coordinate-wise rates from changing wildly as you approach the point from different directions. The coordinate probes assemble into one stable linear approximation.

Proof Sketch

Write $f(a+h)-f(a)$ as a telescoping sum that changes one coordinate at a time. Apply the one-variable mean value theorem to each coordinate change. Continuity of the partial derivatives turns each coefficient into the corresponding partial derivative at $a$ plus a small error. The sum becomes $Df(a)h+o(\|h\|)$ .

Why It Matters

Most smooth ML losses satisfy this condition away from nonsmooth activations or constraints, so computing the Jacobian from partial derivatives is usually legitimate.

report a correction →

Chain Rule

Theorem

Multivariable Chain Rule

Statement

If $g:\mathbb{R}^n\to\mathbb{R}^k$ is differentiable at $a$ and $f:\mathbb{R}^k\to\mathbb{R}^m$ is differentiable at $g(a)$ , then $f\circ g$ is differentiable at $a$ and

$D(f\circ g)(a)=Df(g(a))Dg(a).$

Intuition

Each differentiable map is locally linear. Composing functions locally composes their linear approximations, so the derivative of the composition is matrix multiplication.

Proof Sketch

Use $g(a+h)=g(a)+Dg(a)h+o(\|h\|)$ . Then apply the differentiability of $f$ at $g(a)$ to the perturbation $Dg(a)h+o(\|h\|)$ . Linearity of $Df(g(a))$ produces $Df(g(a))Dg(a)h$ plus a remainder smaller than $\|h\|$ .

Why It Matters

Backpropagation is the chain rule on a computational graph. A network is a composition of layers; reverse-mode automatic differentiation multiplies local derivatives in the direction that is efficient for scalar losses.

Failure Mode

Nonsmooth points such as ReLU at zero break classical differentiability. Optimizers still work by using subgradients or implementation conventions, but theorem assumptions should not be silently ignored.

report a correction →

ML Translation Table

Math object	ML interpretation	What to check
$x\in\mathbb{R}^n$	parameters, features, or activations	which variable is being differentiated
$f(x)$	loss, layer map, score, or metric	scalar or vector output
$\nabla f(x)$	first-order signal for update direction	norm, scale, noise, clipping
$Df(x)$ / Jacobian	local sensitivity of outputs to inputs	shape, conditioning, chain rule order
$D^2f(x)$ / Hessian	curvature of the scalar loss	positive definite, indefinite, ill-conditioned
$o(\\|h\\|)$	ignored higher-order error	when finite steps make the linear model invalid

Common Confusions

Watch Out

Partial derivatives existing is not enough

There are functions with all partial derivatives at a point that are not continuous there, and therefore not differentiable. Partial derivatives are coordinate checks; differentiability is an all-directions linear approximation.

Watch Out

The gradient is not the derivative for vector outputs

For $f:\mathbb{R}^n\to\mathbb{R}$ , the gradient represents the derivative. For $f:\mathbb{R}^n\to\mathbb{R}^m$ , the derivative is a Jacobian matrix, not a single gradient vector.

Watch Out

Steepest descent depends on the norm

The Euclidean gradient is steepest for Euclidean step size. If you change the geometry, the steepest direction changes. Natural gradient and preconditioned methods exploit exactly this fact.

Watch Out

Backprop is not magic

Backpropagation is not a separate calculus. It is the multivariable chain rule organized efficiently so that scalar loss gradients are computed without forming every full Jacobian.

Q&A For Mastery

Why does differentiability require an open neighborhood? You need to approach the point from many directions. On a boundary or constrained set, ordinary derivatives may need to be replaced by restricted, one-sided, or manifold derivatives.

What should I remember for neural networks? Each layer has a local derivative. Backprop multiplies those local derivatives in reverse order to get gradients with respect to parameters.

Why do gradients vanish or explode? Long products of Jacobians can shrink or amplify vectors. This is the calculus reason recurrent nets and deep networks need normalization, residual connections, initialization care, and gating.

What is the practical shape check? If $f:\mathbb{R}^n\to\mathbb{R}^m$ , then $J_f$ is $m$ by $n$ . Multiplying $J_f h$ should produce a vector in $\mathbb{R}^m$ .

What To Remember

Differentiability means one linear map approximates all small directions.
Partial derivatives are coordinate probes, not the whole story.
For scalar outputs, $Df(a)h=\nabla f(a)^T h$ .
The negative gradient is the steepest infinitesimal descent direction in Euclidean geometry.
The chain rule is matrix multiplication of local linear maps.
Backpropagation is the chain rule arranged for computational efficiency.

Exercises

ExerciseCore

Problem

Let $f(x,y)=x^2y+e^{xy}$ . Compute $\nabla f(1,0)$ and the directional derivative in direction $v=(3/5,4/5)$ .

ExerciseAdvanced

Problem

Suppose $f:\mathbb{R}^n\to\mathbb{R}$ is differentiable at $a$ . Prove that if $\nabla f(a)=0$ , then every directional derivative at $a$ equals zero.

ExerciseResearch

Problem

Explain why ReLU networks can be trained with gradient methods even though ReLU is not differentiable at zero.

References

Canonical:

Rudin, Principles of Mathematical Analysis, 3rd ed., Chapter 9.
Spivak, Calculus on Manifolds, Chapter 2.
Apostol, Mathematical Analysis, 2nd ed., Chapter 12.

ML-facing:

Deisenroth, Faisal, and Ong, Mathematics for Machine Learning, Chapter 5.
Goodfellow, Bengio, and Courville, Deep Learning, Chapter 4.
Boyd and Vandenberghe, Convex Optimization, Appendix A.4.

Next Topics

The Jacobian matrix: the full derivative matrix for vector-valued maps.
Vector calculus chain rule: deeper shape discipline for composition.
Automatic differentiation: how software computes gradients efficiently.
The Hessian matrix: second-order local models and curvature.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Continuity in Rⁿlayer 0A · tier 1
Sets, Functions, and Relationslayer 0A · tier 1
Vectors, Matrices, and Linear Mapslayer 0A · tier 1

Derived topics

11

Taylor Expansionlayer 0A · tier 1
The Hessian Matrixlayer 0A · tier 1
The Jacobian Matrixlayer 0A · tier 1
Vector Calculus Chain Rulelayer 0A · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1

+6 more on the derived-topics page.

Graph-backed continuations

The Jacobian Matrix Vector Calculus Chain Rule Taylor Expansion The Hessian Matrix Automatic Differentiation Activation Functions Convex Optimization Basics Feedforward Networks and Backpropagation Gradient Descent Variants Line Search Methods Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency