Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

Taylor Expansion

Taylor approximation in one and many variables. Every optimization algorithm is a Taylor approximation: gradient descent uses first order, Newton's method uses second order.

CoreTier 1Stable~40 min

Why This Matters

Every optimization algorithm you use in ML is a Taylor approximation in disguise.

Gradient descent: approximate f(x+h)f(x)+f(x)Thf(x + h) \approx f(x) + \nabla f(x)^T h (first order), then move in the direction that decreases this linear approximation.

Newton's method: approximate f(x+h)f(x)+f(x)Th+12hT2f(x)hf(x + h) \approx f(x) + \nabla f(x)^T h + \frac{1}{2} h^T \nabla^2 f(x) h (second order), then minimize this quadratic exactly.

Adam, L-BFGS, natural gradient: all variations on which Taylor terms to keep and how to estimate them. Understanding Taylor expansion means understanding the foundation of all gradient-based optimization.

Single Variable Taylor Expansion

Definition

Taylor Polynomial

The kk-th order Taylor polynomial of ff centered at aa is:

Tk(x;a)=j=0kf(j)(a)j!(xa)jT_k(x; a) = \sum_{j=0}^{k} \frac{f^{(j)}(a)}{j!}(x - a)^j

This is the unique polynomial of degree k\leq k that matches ff and its first kk derivatives at aa.

The cases that matter most:

First order (linear approximation):

f(x+h)f(x)+f(x)hf(x + h) \approx f(x) + f'(x) h

Second order (quadratic approximation):

f(x+h)f(x)+f(x)h+12f(x)h2f(x + h) \approx f(x) + f'(x) h + \frac{1}{2} f''(x) h^2

Remainder Terms

The approximation is only useful if you can bound the error.

Definition

Lagrange Remainder

If ff is (k+1)(k+1) times continuously differentiable on an interval containing aa and xx, the remainder after the kk-th order Taylor polynomial is:

Rk(x;a)=f(k+1)(c)(k+1)!(xa)k+1R_k(x; a) = \frac{f^{(k+1)}(c)}{(k+1)!}(x - a)^{k+1}

for some cc between aa and xx.

Definition

Integral Form of Remainder

Under the same conditions:

Rk(x;a)=axf(k+1)(t)k!(xt)kdtR_k(x; a) = \int_a^x \frac{f^{(k+1)}(t)}{k!}(x - t)^k \, dt

This form is often more useful for bounding because you can estimate the integral directly without locating the unknown point cc.

Main Theorems

Theorem

Taylor Theorem with Lagrange Remainder

Statement

If f:RRf: \mathbb{R} \to \mathbb{R} is (k+1)(k+1) times continuously differentiable on an interval II containing aa and xx, then:

f(x)=j=0kf(j)(a)j!(xa)j+f(k+1)(c)(k+1)!(xa)k+1f(x) = \sum_{j=0}^{k} \frac{f^{(j)}(a)}{j!}(x-a)^j + \frac{f^{(k+1)}(c)}{(k+1)!}(x-a)^{k+1}

for some cc between aa and xx.

Intuition

The Taylor polynomial matches ff perfectly at aa. The error depends on how much the (k+1)(k+1)-th derivative varies. If f(k+1)|f^{(k+1)}| is bounded by MM on II, then RkMxak+1/(k+1)!|R_k| \leq M|x-a|^{k+1}/(k+1)!.

Proof Sketch

Define g(t)=f(x)Tk(x;t)C(xt)k+1g(t) = f(x) - T_k(x; t) - C(x-t)^{k+1} where CC is chosen so g(a)=0g(a) = 0. Note g(x)=0g(x) = 0 trivially. By Rolle's theorem applied repeatedly, find cc between aa and xx where g(k+1)(c)=0g^{(k+1)}(c) = 0. Solving for CC gives the Lagrange form.

Why It Matters

This is what lets you bound the error of gradient descent. If you use the first-order approximation and ff has bounded second derivative (fL|f''| \leq L), then the approximation error over a step of size hh is at most L2h2\frac{L}{2}h^2. This is exactly why gradient descent with step size 1/L1/L converges for LL-smooth functions.

Failure Mode

The theorem requires sufficient differentiability. If ff is only once differentiable, you cannot write a second-order expansion with Lagrange remainder. The bound is also only useful when xa|x - a| is small; for large deviations the remainder can dominate.

Multivariate Taylor Expansion

For f:RnRf: \mathbb{R}^n \to \mathbb{R}, the first-order expansion at xx is:

f(x+h)=f(x)+f(x)Th+O(h2)f(x + h) = f(x) + \nabla f(x)^T h + O(\|h\|^2)

where f(x)Rn\nabla f(x) \in \mathbb{R}^n is the gradient.

The second-order expansion is:

f(x+h)=f(x)+f(x)Th+12hT2f(x)h+O(h3)f(x + h) = f(x) + \nabla f(x)^T h + \frac{1}{2} h^T \nabla^2 f(x) \, h + O(\|h\|^3)

where 2f(x)Rn×n\nabla^2 f(x) \in \mathbb{R}^{n \times n} is the Hessian matrix with entries [2f(x)]ij=2fxixj[\nabla^2 f(x)]_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}.

If ff is twice continuously differentiable, the Hessian is symmetric. If additionally ff is convex, the Hessian is positive semidefinite at every point.

Canonical Examples

Example

Gradient descent step size from Taylor

Let ff be LL-smooth, meaning 2f(x)L\|\nabla^2 f(x)\| \leq L everywhere. By second-order Taylor:

f(xηf(x))f(x)ηf(x)2+Lη22f(x)2f(x - \eta \nabla f(x)) \leq f(x) - \eta \|\nabla f(x)\|^2 + \frac{L \eta^2}{2} \|\nabla f(x)\|^2

Minimizing the right side over η\eta gives η=1/L\eta^* = 1/L, yielding:

f(x1Lf(x))f(x)12Lf(x)2f(x - \frac{1}{L} \nabla f(x)) \leq f(x) - \frac{1}{2L} \|\nabla f(x)\|^2

This is the descent lemma, the workhorse of gradient descent convergence proofs.

Common Confusions

Watch Out

Taylor expansion is local, not global

The Taylor polynomial centered at aa approximates ff well near aa. It says nothing about the function far from aa. The function exe^x has a Taylor series that converges everywhere, but sin(1/x)\sin(1/x) near 0 is not well-approximated by any polynomial centered at 0. In optimization, this locality is why step sizes must be small enough.

Watch Out

Second-order methods are not always better

Newton's method uses the Hessian and converges faster per step. But computing and inverting an n×nn \times n Hessian costs O(n2)O(n^2) storage and O(n3)O(n^3) time. For neural networks with millions of parameters, this is infeasible. First-order methods win by being cheap per step, even if they need more steps.

Exercises

ExerciseCore

Problem

Compute the second-order Taylor expansion of f(x)=log(1+x)f(x) = \log(1 + x) at x=0x = 0. Use the Lagrange remainder to bound the error for x0.1|x| \leq 0.1.

ExerciseAdvanced

Problem

Let f:RnRf: \mathbb{R}^n \to \mathbb{R} be twice continuously differentiable with 2f(x)opL\|\nabla^2 f(x)\|_{\text{op}} \leq L for all xx. Prove that f(y)f(x)f(x)T(yx)L2yx2\|f(y) - f(x) - \nabla f(x)^T(y-x)\| \leq \frac{L}{2}\|y-x\|^2.

References

Canonical:

  • Rudin, Principles of Mathematical Analysis (1976), Chapter 5
  • Apostol, Mathematical Analysis (1974), Chapter 5

Current:

  • Boyd & Vandenberghe, Convex Optimization (2004), Section 9.1 (Taylor approximation in optimization)
  • Nesterov, Introductory Lectures on Convex Optimization (2004), Section 1.2

Last reviewed: April 2026

Next Topics