Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Optimization

Newton's Method

The gold standard for fast local convergence: use second-order information (the Hessian) to take optimal quadratic steps. Quadratic convergence when it works, catastrophic failure when it doesn't.

CoreTier 1Stable~50 min

Why This Matters

Gradient descent uses only first-order information: the gradient tells you which direction is downhill, but not how far to go. Newton's method uses second-order information. The curvature of the function. to determine both the direction and the step size. When the function is smooth and you are close to a minimum, this yields quadratic convergence: the number of correct digits doubles at each step.

Newton's method is the theoretical gold standard for local convergence. Every optimization algorithm you use in ML is either Newton's method, an approximation to it, or motivated by understanding where Newton's method fails and why.

f(x) = (x-2)² + 1xxxxxminimum048-10123Each step: xₙ₊₁ = xₙ - f'(xₙ)/f''(xₙ)Quadratic convergence: errors square each step

Mental Model

At the current iterate x(t)x^{(t)}, build a second-order Taylor approximation to the function. This gives you a quadratic model. Minimize the quadratic exactly. This has a closed-form solution. Move to that minimizer. Repeat.

If the function is actually quadratic, Newton's method converges in one step. If the function is "nearly quadratic" near the optimum (which smooth functions are), convergence is extremely fast.

Newton's Method for Root Finding

The original form of Newton's method finds roots of equations.

Definition

Newton's Method (Root Finding)

Given a differentiable function f:RRf: \mathbb{R} \to \mathbb{R}, Newton's method for finding a root f(x)=0f(x^*) = 0 iterates:

x(t+1)=x(t)f(x(t))f(x(t))x^{(t+1)} = x^{(t)} - \frac{f(x^{(t)})}{f'(x^{(t)})}

Geometrically: linearize ff at x(t)x^{(t)} and find where the tangent line crosses zero.

In multiple dimensions, finding a root of F:RdRdF: \mathbb{R}^d \to \mathbb{R}^d becomes:

x(t+1)=x(t)[JF(x(t))]1F(x(t))x^{(t+1)} = x^{(t)} - [J_F(x^{(t)})]^{-1} F(x^{(t)})

where JFJ_F is the Jacobian matrix of FF.

Newton's Method for Optimization

To minimize f:RdRf: \mathbb{R}^d \to \mathbb{R}, we want f(x)=0\nabla f(x^*) = 0. Apply Newton's root-finding method to F=fF = \nabla f:

Definition

Newton's Method (Optimization)

The Newton step for minimizing ff is:

x(t+1)=x(t)[2f(x(t))]1f(x(t))x^{(t+1)} = x^{(t)} - [\nabla^2 f(x^{(t)})]^{-1} \nabla f(x^{(t)})

where 2f(x(t))Rd×d\nabla^2 f(x^{(t)}) \in \mathbb{R}^{d \times d} is the Hessian matrix of second derivatives, and f(x(t))Rd\nabla f(x^{(t)}) \in \mathbb{R}^d is the gradient.

The Second-Order Taylor Motivation

At x(t)x^{(t)}, the second-order Taylor expansion of ff is:

f(x)f(x(t))+f(x(t))(xx(t))+12(xx(t))2f(x(t))(xx(t))f(x) \approx f(x^{(t)}) + \nabla f(x^{(t)})^\top (x - x^{(t)}) + \frac{1}{2}(x - x^{(t)})^\top \nabla^2 f(x^{(t)}) (x - x^{(t)})

Call this quadratic model m(x)m(x). Setting m(x)=0\nabla m(x) = 0:

f(x(t))+2f(x(t))(xx(t))=0\nabla f(x^{(t)}) + \nabla^2 f(x^{(t)})(x - x^{(t)}) = 0

x=x(t)[2f(x(t))]1f(x(t))x = x^{(t)} - [\nabla^2 f(x^{(t)})]^{-1} \nabla f(x^{(t)})

This is exactly the Newton step. Newton's method minimizes the local quadratic model at each iteration.

Quadratic Convergence

Theorem

Quadratic Convergence of Newton's Method

Statement

Let f:RdRf: \mathbb{R}^d \to \mathbb{R} be twice continuously differentiable with 2f\nabla^2 f Lipschitz continuous (constant LL). Let xx^* be a local minimizer with 2f(x)0\nabla^2 f(x^*) \succ 0. Then there exists δ>0\delta > 0 such that if x(0)x<δ\|x^{(0)} - x^*\| < \delta, Newton's method converges quadratically:

x(t+1)xCx(t)x2\|x^{(t+1)} - x^*\| \leq C \|x^{(t)} - x^*\|^2

for some constant C>0C > 0 depending on LL and 2f(x)\nabla^2 f(x^*).

Intuition

Quadratic convergence means the error is squared at each step. If you start with error 10210^{-2}, after one step it is roughly 10410^{-4}, then 10810^{-8}, then 101610^{-16}. The number of correct digits doubles every iteration. This is spectacularly fast. much faster than the linear convergence of gradient descent, where you only gain a constant factor per step.

Proof Sketch

At xx^*, f(x)=0\nabla f(x^*) = 0. The Newton step gives:

x(t+1)x=x(t)x[2f(x(t))]1f(x(t))x^{(t+1)} - x^* = x^{(t)} - x^* - [\nabla^2 f(x^{(t)})]^{-1} \nabla f(x^{(t)})

=[2f(x(t))]1[2f(x(t))(x(t)x)(f(x(t))f(x))]= [\nabla^2 f(x^{(t)})]^{-1} \left[\nabla^2 f(x^{(t)})(x^{(t)} - x^*) - (\nabla f(x^{(t)}) - \nabla f(x^*))\right]

The term in brackets is the error in a first-order Taylor expansion of f\nabla f. By the Lipschitz condition on the Hessian, this is O(x(t)x2)O(\|x^{(t)} - x^*\|^2). Dividing by 2f(x(t))\nabla^2 f(x^{(t)}) (which is close to 2f(x)\nabla^2 f(x^*) for x(t)x^{(t)} near xx^*) preserves the quadratic bound.

Why It Matters

Quadratic convergence is the gold standard. No first-order method can achieve it. This speed explains why Newton's method (and its approximations) dominate when applicable.

Failure Mode

The "sufficiently close" requirement is critical. Far from the optimum, pure Newton's method can diverge, overshoot, or converge to a maximum or saddle point (Newton finds stationary points, not necessarily minima). This motivates damped Newton and trust-region methods.

Damped Newton (Newton with Line Search)

Pure Newton takes the full step [2f]1f-[\nabla^2 f]^{-1} \nabla f. Far from the optimum, this step may be too large or even go uphill. The fix:

x(t+1)=x(t)αt[2f(x(t))]1f(x(t))x^{(t+1)} = x^{(t)} - \alpha_t [\nabla^2 f(x^{(t)})]^{-1} \nabla f(x^{(t)})

where αt(0,1]\alpha_t \in (0, 1] is chosen by a line search (e.g., backtracking line search with Armijo condition). Far from the optimum, αt\alpha_t may be small; close to the optimum, αt=1\alpha_t = 1 is accepted and you recover the full quadratic convergence rate.

The Newton direction d=[2f]1fd = -[\nabla^2 f]^{-1} \nabla f is a descent direction whenever 2f0\nabla^2 f \succ 0 (positive definite), because:

fd=f[2f]1f<0\nabla f^\top d = -\nabla f^\top [\nabla^2 f]^{-1} \nabla f < 0

Why Newton Is Impractical for Large-Scale ML

The Hessian 2fRd×d\nabla^2 f \in \mathbb{R}^{d \times d} is the bottleneck:

  1. Storage: For a neural network with d=108d = 10^8 parameters, the Hessian has d2=1016d^2 = 10^{16} entries. This does not fit in any memory.
  2. Computation: Computing the full Hessian requires O(d2)O(d^2) work per element, and solving the linear system 2fΔx=f\nabla^2 f \cdot \Delta x = -\nabla f takes O(d3)O(d^3) with direct methods.
  3. Non-convexity: For non-convex problems (neural networks), the Hessian may have negative eigenvalues, making the Newton direction ascend rather than descend.

This is why quasi-Newton methods (which approximate the Hessian cheaply) and first-order methods (SGD, Adam) dominate in practice.

Common Confusions

Watch Out

Newton for optimization vs. Newton for root-finding are different algorithms

Newton's method for root-finding solves f(x)=0f(x) = 0. Newton's method for optimization solves f(x)=0\nabla f(x) = 0. They have the same structure (linearize and solve), but the function being linearized is different: ff itself in root-finding, f\nabla f in optimization. In 1D, Newton for optimization uses the second derivative ff'' (curvature), while root-finding uses the first derivative ff' (slope). The convergence theory is analogous but the algorithms solve structurally different problems.

Watch Out

Newton's method can converge to maxima or saddle points

Newton's method finds stationary points where f=0\nabla f = 0. It does not distinguish between minima, maxima, and saddle points. If you start near a saddle point, Newton may converge to it happily. Checking that the Hessian is positive definite at convergence is necessary to confirm you found a local minimum. Modifications like adding a multiple of the identity to the Hessian (2f+λI\nabla^2 f + \lambda I) ensure positive definiteness.

Summary

  • Newton's step: x(t+1)=x(t)[2f(x(t))]1f(x(t))x^{(t+1)} = x^{(t)} - [\nabla^2 f(x^{(t)})]^{-1} \nabla f(x^{(t)})
  • Motivated by minimizing the local second-order Taylor model
  • Quadratic convergence near a strict local minimum with Lipschitz Hessian
  • Converges in exactly 1 step for quadratic functions
  • Requires computing and inverting the d×dd \times d Hessian: O(d2)O(d^2) storage, O(d3)O(d^3) solve
  • Impractical for d>104d > 10^4 in its pure form
  • Damped Newton (line search) gives global convergence guarantees

Exercises

ExerciseCore

Problem

Let f(x)=12xAxbxf(x) = \frac{1}{2} x^\top A x - b^\top x where A0A \succ 0 (positive definite). Show that Newton's method converges to the minimizer x=A1bx^* = A^{-1}b in exactly one step from any starting point.

ExerciseAdvanced

Problem

Show that gradient descent on the same quadratic f(x)=12xAxbxf(x) = \frac{1}{2}x^\top Ax - b^\top x with step size α=2/(λmax+λmin)\alpha = 2/(\lambda_{\max} + \lambda_{\min}) (optimal fixed step size) has convergence rate:

x(t)xA(κ1κ+1)tx(0)xA\|x^{(t)} - x^*\|_A \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^t \|x^{(0)} - x^*\|_A

where κ=λmax/λmin\kappa = \lambda_{\max}/\lambda_{\min} is the condition number. Compare this to Newton's one-step convergence.

References

Canonical:

  • Nocedal & Wright, Numerical Optimization (2006), Chapters 2-3 and Section 3.3 (Newton with line search)
  • Boyd & Vandenberghe, Convex Optimization (2004), Section 9.5
  • Dennis & Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations (1996), Chapter 5

Current:

  • Nesterov, Lectures on Convex Optimization (2018), Chapter 1
  • Conn, Gould, Toint, Trust-Region Methods (2000), Chapter 6 (trust-region Newton methods)
  • Bottou, Curtis, Nocedal, "Optimization Methods for Large-Scale Machine Learning" (2018), SIAM Review. Section 4 on second-order methods at scale.

Next Topics

The natural next steps from Newton's method:

  • Quasi-Newton methods: approximate the Hessian cheaply (BFGS, L-BFGS)
  • Line search methods: choosing step sizes with Wolfe conditions

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics