Taylor Expansion

Sneiderman, Robby

Foundations

Taylor Expansion

Taylor approximation in one and many variables. Every optimization algorithm is a Taylor approximation: gradient descent uses first order, Newton's method uses second order.

CoreTier 1StableSupporting~40 min

Prerequisites

Continuity in Rn Differentiation in Rn

Start 8-question practice · 6 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

foundations | layer 0A | tier 1. This page has 2 direct prerequisites and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Convex Optimization Basics

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every optimization algorithm you use in ML is a Taylor approximation in disguise.

Gradient descent: approximate $f(x + h) \approx f(x) + \nabla f(x)^T h$ (first order). The symbol here is the gradient. The linear model is unbounded below, so you cannot literally minimize it; instead you minimize it under a step-size constraint (a trust region $\|h\| \leq \Delta$ , a proximal/quadratic regularizer $\frac{1}{2\eta}\|h\|^2$ , or equivalently a fixed step size $\eta$ ). The proximal form $\arg\min_h \nabla f(x)^T h + \frac{1}{2\eta}\|h\|^2$ recovers $h = -\eta \nabla f(x)$ — the gradient descent step.

Newton's method: approximate $f(x + h) \approx f(x) + \nabla f(x)^T h + \frac{1}{2} h^T \nabla^2 f(x) h$ (second order), then minimize this quadratic exactly.

Adam, L-BFGS, natural gradient: all variations on which Taylor terms to keep and how to estimate them. Understanding Taylor expansion means understanding the foundation of all gradient-based optimization.

Single Variable Taylor Expansion

Definition

Taylor Polynomial $T_{k} (x; a)$

The $k$ -th order Taylor polynomial of $f$ centered at $a$ is:

$T_k(x; a) = \sum_{j=0}^{k} \frac{f^{(j)}(a)}{j!}(x - a)^j$

This is the unique polynomial of degree $\leq k$ that matches $f$ and its first $k$ derivatives at $a$ .

The cases that matter most:

First order (linear approximation):

$f(x + h) \approx f(x) + f'(x) h$

Second order (quadratic approximation):

$f(x + h) \approx f(x) + f'(x) h + \frac{1}{2} f''(x) h^2$

Remainder Terms

The approximation is only useful if you can bound the error.

Definition

Lagrange Remainder

If $f$ is $(k+1)$ times continuously differentiable on an interval containing $a$ and $x$ , the remainder after the $k$ -th order Taylor polynomial is:

$R_k(x; a) = \frac{f^{(k+1)}(c)}{(k+1)!}(x - a)^{k+1}$

for some $c$ between $a$ and $x$ .

Definition

Integral Form of Remainder

Under the same conditions:

$R_k(x; a) = \int_a^x \frac{f^{(k+1)}(t)}{k!}(x - t)^k \, dt$

This form is often more useful for bounding because you can estimate the integral directly without locating the unknown point $c$ .

Main Theorems

Theorem

Taylor Theorem with Lagrange Remainder

Statement

If $f: \mathbb{R} \to \mathbb{R}$ is $(k+1)$ times continuously differentiable on an interval $I$ containing $a$ and $x$ , then:

$f(x) = \sum_{j=0}^{k} \frac{f^{(j)}(a)}{j!}(x-a)^j + \frac{f^{(k+1)}(c)}{(k+1)!}(x-a)^{k+1}$

for some $c$ between $a$ and $x$ .

Intuition

The Taylor polynomial matches $f$ perfectly at $a$ . The error depends on how much the $(k+1)$ -th derivative varies. If $|f^{(k+1)}|$ is bounded by $M$ on $I$ , then $|R_k| \leq M|x-a|^{k+1}/(k+1)!$ .

Proof Sketch

Define $g(t) = f(x) - T_k(x; t) - C(x-t)^{k+1}$ where $C$ is chosen so $g(a) = 0$ . Note $g(x) = 0$ trivially. By Rolle's theorem applied repeatedly, find $c$ between $a$ and $x$ where $g^{(k+1)}(c) = 0$ . Solving for $C$ gives the Lagrange form.

Why It Matters

This is what lets you bound the error of gradient descent. If you use the first-order approximation and $f$ has bounded second derivative ( $|f''| \leq L$ ), then the approximation error over a step of size $h$ is at most $\frac{L}{2}h^2$ . This is exactly why gradient descent with step size $1/L$ converges for $L$ -smooth functions.

Failure Mode

The theorem requires sufficient differentiability. If $f$ is only once differentiable, you cannot write a second-order expansion with Lagrange remainder. The bound is also only useful when $|x - a|$ is small; for large deviations the remainder can dominate.

report a correction →

Multivariate Taylor Expansion

For $f: \mathbb{R}^n \to \mathbb{R}$ , the first-order expansion at $x$ is:

$f(x + h) = f(x) + \nabla f(x)^T h + O(\|h\|^2)$

where $\nabla f(x) \in \mathbb{R}^n$ is the gradient.

The second-order expansion is:

$f(x + h) = f(x) + \nabla f(x)^T h + \frac{1}{2} h^T \nabla^2 f(x) \, h + O(\|h\|^3)$

where $\nabla^2 f(x) \in \mathbb{R}^{n \times n}$ is the Hessian matrix with entries $[\nabla^2 f(x)]_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$ .

If $f$ is twice continuously differentiable, the Hessian is symmetric. If additionally $f$ is convex, the Hessian is positive semidefinite at every point.

The full $k$ -th order expansion generalizes using multi-index notation. For a multi-index $\alpha = (\alpha_1, \ldots, \alpha_n)$ with $|\alpha| = \alpha_1 + \cdots + \alpha_n$ :

$f(x + h) = \sum_{|\alpha| \leq k} \frac{\partial^\alpha f(x)}{\alpha!} h^\alpha + R_k(x; h)$

where $\alpha! = \alpha_1! \cdots \alpha_n!$ and $h^\alpha = h_1^{\alpha_1} \cdots h_n^{\alpha_n}$ . In practice, the second-order form is what appears in Newton's method and quasi-Newton approximations; the Hessian quadratic $h^T H h$ captures all second-order directional information.

Why convex optimization works: When $f$ is convex, the second-order Taylor expansion becomes a global lower bound: $f(y) \geq f(x) + \nabla f(x)^T (y - x)$ for all $x, y$ . This supporting hyperplane property is equivalent to convexity and underpins every convergence proof for gradient-based methods.

Worked Examples

Example

Taylor approximation of e^x around zero

Let $f(x) = e^x$ expanded at $a = 0$ . The derivatives are all $f^{(j)}(x) = e^x$ , so $f^{(j)}(0) = 1$ for every $j$ .

The $k$ -th order Taylor polynomial is:

$T_k(x; 0) = \sum_{j=0}^{k} \frac{x^j}{j!} = 1 + x + \frac{x^2}{2} + \frac{x^3}{6} + \cdots + \frac{x^k}{k!}$

The Lagrange remainder after $k$ terms is:

$R_k(x; 0) = \frac{e^c}{(k+1)!} x^{k+1}$

for some $c$ between $0$ and $x$ . Because $e^c < e^{|x|}$ , we get the bound $|R_k(x; 0)| \leq e^{|x|} \cdot |x|^{k+1} / (k+1)!$ .

Concretely, for $x = 0.5$ and $k = 3$ :

$T_3(0.5) = 1 + 0.5 + 0.125 + 0.02083 = 1.64583$
True value: $e^{0.5} = 1.64872$
Error: $0.00289$
Bound: $e^{0.5} \cdot (0.5)^4 / 24 = 1.649 \cdot 0.002604 = 0.00429$

The bound is loose but confirms the approximation is accurate to three decimal places.

As $k$ increases the remainder $|x|^{k+1}/(k+1)! \to 0$ for any fixed $x$ because factorials grow faster than any exponential; the Taylor series for $e^x$ converges everywhere.

Example

Remainder bound: log(1+x) at x=0.3

Let $f(x) = \log(1 + x)$ , $a = 0$ , $k = 2$ . Derivatives:

$f'(x) = 1/(1+x)$ , so $f'(0) = 1$
$f''(x) = -1/(1+x)^2$ , so $f''(0) = -1$
$f'''(x) = 2/(1+x)^3$

Taylor polynomial: $T_2(x; 0) = x - x^2/2$ .

Lagrange remainder: $R_2(x; 0) = f'''(c) \cdot x^3 / 6 = 2/(1+c)^3 \cdot x^3/6$ for $c \in (0, x)$ .

For $x = 0.3$ : the worst case is $c \to 0$ , giving $|R_2| \leq 2 \cdot (0.3)^3 / 6 = 0.009$ .

Actual: $\log(1.3) = 0.26236$ . Polynomial: $0.3 - 0.045 = 0.255$ . Error: $0.00736$ , within the bound.

Example

Gradient descent step size from Taylor

Let $f$ be $L$ -smooth, meaning $\|\nabla^2 f(x)\| \leq L$ everywhere. By second-order Taylor:

$f(x - \eta \nabla f(x)) \leq f(x) - \eta \|\nabla f(x)\|^2 + \frac{L \eta^2}{2} \|\nabla f(x)\|^2$

Minimizing the right side over $\eta$ gives $\eta^* = 1/L$ , yielding:

$f\!\left(x - \frac{1}{L} \nabla f(x)\right) \leq f(x) - \frac{1}{2L} \|\nabla f(x)\|^2$

This is the descent lemma, the workhorse of gradient descent convergence proofs. The step size $1/L$ is precisely the reciprocal of the Hessian's spectral norm bound.

Newton's Method as Second-Order Taylor Minimization

Newton's method minimizes a function by iteratively fitting and minimizing the local second-order Taylor model. At iterate $x_t$ , the model is:

$m(h) = f(x_t) + \nabla f(x_t)^T h + \frac{1}{2} h^T \nabla^2 f(x_t) \, h$

Setting $\nabla_h m(h) = 0$ gives $\nabla^2 f(x_t) h = -\nabla f(x_t)$ , so the Newton step is $h^* = -[\nabla^2 f(x_t)]^{-1} \nabla f(x_t)$ .

If $f$ is strongly convex with condition number $\kappa = L/\mu$ (where $\mu \mathbf{I} \preceq \nabla^2 f \preceq L \mathbf{I}$ ), then:

Gradient descent requires $O(\kappa \log(1/\varepsilon))$ iterations to reach precision $\varepsilon$ .
Newton's method achieves quadratic convergence in the local region: $\|x_{t+1} - x^*\| \leq C \|x_t - x^*\|^2$ , provided the Hessian is Lipschitz continuous (i.e. $\|\nabla^2 f(x) - \nabla^2 f(y)\| \leq M \|x - y\|$ ). Strong convexity alone gives only superlinear convergence; the quadratic rate comes from the third-order Taylor remainder, which is bounded only when the Hessian is Lipschitz. Without Hessian-Lipschitz, Newton can stall, oscillate, or overshoot far from the optimum, which is why practical implementations combine Newton with line search, trust regions, or cubic regularization (Nesterov-Polyak).

The gap is dramatic for ill-conditioned problems. For $\kappa = 10^4$ , gradient descent needs roughly $10^4$ steps to reduce the error by $e^{-1}$ ; Newton's method reaches machine precision in fewer than 10 steps from a good starting point.

The cost: each Newton step requires solving an $n \times n$ linear system, $O(n^3)$ in general. This is why quasi-Newton methods (BFGS, L-BFGS) approximate the Hessian inverse without explicit computation.

Common Confusions

Watch Out

Taylor expansion is local, not global

The Taylor polynomial centered at $a$ approximates $f$ well near $a$ . It says nothing about the function far from $a$ . The function $e^x$ has a Taylor series that converges everywhere, but $\sin(1/x)$ near 0 is not well-approximated by any polynomial centered at 0. In optimization, this locality is why step sizes must be small enough.

Watch Out

Second-order methods are not always better

Newton's method uses the Hessian and converges faster per step. But computing and inverting an $n \times n$ Hessian costs $O(n^2)$ storage and $O(n^3)$ time. For neural networks with millions of parameters, this is infeasible. First-order methods win by being cheap per step, even if they need more steps.

Exercises

ExerciseCore

Problem

Compute the second-order Taylor expansion of $f(x) = \log(1 + x)$ at $x = 0$ . Use the Lagrange remainder to bound the error for $|x| \leq 0.1$ .

ExerciseAdvanced

Problem

Let $f: \mathbb{R}^n \to \mathbb{R}$ be twice continuously differentiable with $\|\nabla^2 f(x)\|_{\text{op}} \leq L$ for all $x$ . Prove that $\|f(y) - f(x) - \nabla f(x)^T(y-x)\| \leq \frac{L}{2}\|y-x\|^2$ .

References

Canonical:

Rudin, Principles of Mathematical Analysis (1976), Chapter 5: Taylor's theorem, remainder forms, and uniform convergence of series
Apostol, Mathematical Analysis (1974), Chapter 5: single-variable Taylor theorem with Lagrange and integral remainder
Spivak, Calculus on Manifolds (1965), Chapter 2: multivariate Taylor expansion and the Hessian

Current:

Boyd & Vandenberghe, Convex Optimization (2004), Section 9.1: Taylor approximation in optimization, descent lemma derivation
Nesterov, Introductory Lectures on Convex Optimization (2004), Section 1.2: smoothness conditions and gradient descent bounds via Taylor
Nocedal & Wright, Numerical Optimization (2006), Chapter 2: Taylor models as the basis for line search and trust region methods

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Continuity in Rⁿlayer 0A · tier 1
Differentiation in Rⁿlayer 0A · tier 1

Derived topics

3

Automatic Differentiationlayer 1 · tier 1
Convex Optimization Basicslayer 1 · tier 1
Newton's Methodlayer 1 · tier 1

Graph-backed continuations

Convex Optimization Basics Newton's Method Automatic Differentiation