Foundations
Taylor Expansion
Taylor approximation in one and many variables. Every optimization algorithm is a Taylor approximation: gradient descent uses first order, Newton's method uses second order.
Why This Matters
Every optimization algorithm you use in ML is a Taylor approximation in disguise.
Gradient descent: approximate (first order), then move in the direction that decreases this linear approximation.
Newton's method: approximate (second order), then minimize this quadratic exactly.
Adam, L-BFGS, natural gradient: all variations on which Taylor terms to keep and how to estimate them. Understanding Taylor expansion means understanding the foundation of all gradient-based optimization.
Single Variable Taylor Expansion
Taylor Polynomial
The -th order Taylor polynomial of centered at is:
This is the unique polynomial of degree that matches and its first derivatives at .
The cases that matter most:
First order (linear approximation):
Second order (quadratic approximation):
Remainder Terms
The approximation is only useful if you can bound the error.
Lagrange Remainder
If is times continuously differentiable on an interval containing and , the remainder after the -th order Taylor polynomial is:
for some between and .
Integral Form of Remainder
Under the same conditions:
This form is often more useful for bounding because you can estimate the integral directly without locating the unknown point .
Main Theorems
Taylor Theorem with Lagrange Remainder
Statement
If is times continuously differentiable on an interval containing and , then:
for some between and .
Intuition
The Taylor polynomial matches perfectly at . The error depends on how much the -th derivative varies. If is bounded by on , then .
Proof Sketch
Define where is chosen so . Note trivially. By Rolle's theorem applied repeatedly, find between and where . Solving for gives the Lagrange form.
Why It Matters
This is what lets you bound the error of gradient descent. If you use the first-order approximation and has bounded second derivative (), then the approximation error over a step of size is at most . This is exactly why gradient descent with step size converges for -smooth functions.
Failure Mode
The theorem requires sufficient differentiability. If is only once differentiable, you cannot write a second-order expansion with Lagrange remainder. The bound is also only useful when is small; for large deviations the remainder can dominate.
Multivariate Taylor Expansion
For , the first-order expansion at is:
where is the gradient.
The second-order expansion is:
where is the Hessian matrix with entries .
If is twice continuously differentiable, the Hessian is symmetric. If additionally is convex, the Hessian is positive semidefinite at every point.
Canonical Examples
Gradient descent step size from Taylor
Let be -smooth, meaning everywhere. By second-order Taylor:
Minimizing the right side over gives , yielding:
This is the descent lemma, the workhorse of gradient descent convergence proofs.
Common Confusions
Taylor expansion is local, not global
The Taylor polynomial centered at approximates well near . It says nothing about the function far from . The function has a Taylor series that converges everywhere, but near 0 is not well-approximated by any polynomial centered at 0. In optimization, this locality is why step sizes must be small enough.
Second-order methods are not always better
Newton's method uses the Hessian and converges faster per step. But computing and inverting an Hessian costs storage and time. For neural networks with millions of parameters, this is infeasible. First-order methods win by being cheap per step, even if they need more steps.
Exercises
Problem
Compute the second-order Taylor expansion of at . Use the Lagrange remainder to bound the error for .
Problem
Let be twice continuously differentiable with for all . Prove that .
References
Canonical:
- Rudin, Principles of Mathematical Analysis (1976), Chapter 5
- Apostol, Mathematical Analysis (1974), Chapter 5
Current:
- Boyd & Vandenberghe, Convex Optimization (2004), Section 9.1 (Taylor approximation in optimization)
- Nesterov, Introductory Lectures on Convex Optimization (2004), Section 1.2
Last reviewed: April 2026