Numerical Optimization
Newton's Method
The gold standard for fast local convergence: use second-order information (the Hessian) to take optimal quadratic steps. Quadratic convergence when it works, catastrophic failure when it doesn't.
Prerequisites
Why This Matters
Gradient descent uses only first-order information: the gradient tells you which direction is downhill, but not how far to go. Newton's method uses second-order information. The curvature of the function. to determine both the direction and the step size. When the function is smooth and you are close to a minimum, this yields quadratic convergence: the number of correct digits doubles at each step.
Newton's method is the theoretical gold standard for local convergence. Every optimization algorithm you use in ML is either Newton's method, an approximation to it, or motivated by understanding where Newton's method fails and why.
Mental Model
At the current iterate , build a second-order Taylor approximation to the function. This gives you a quadratic model. Minimize the quadratic exactly. This has a closed-form solution. Move to that minimizer. Repeat.
If the function is actually quadratic, Newton's method converges in one step. If the function is "nearly quadratic" near the optimum (which smooth functions are), convergence is extremely fast.
Newton's Method for Root Finding
The original form of Newton's method finds roots of equations.
Newton's Method (Root Finding)
Given a differentiable function , Newton's method for finding a root iterates:
Geometrically: linearize at and find where the tangent line crosses zero.
In multiple dimensions, finding a root of becomes:
where is the Jacobian matrix of .
Newton's Method for Optimization
To minimize , we want . Apply Newton's root-finding method to :
Newton's Method (Optimization)
The Newton step for minimizing is:
where is the Hessian matrix of second derivatives, and is the gradient.
The Second-Order Taylor Motivation
At , the second-order Taylor expansion of is:
Call this quadratic model . Setting :
This is exactly the Newton step. Newton's method minimizes the local quadratic model at each iteration.
Quadratic Convergence
Quadratic Convergence of Newton's Method
Statement
Let be twice continuously differentiable with Lipschitz continuous (constant ). Let be a local minimizer with . Then there exists such that if , Newton's method converges quadratically:
for some constant depending on and .
Intuition
Quadratic convergence means the error is squared at each step. If you start with error , after one step it is roughly , then , then . The number of correct digits doubles every iteration. This is spectacularly fast. much faster than the linear convergence of gradient descent, where you only gain a constant factor per step.
Proof Sketch
At , . The Newton step gives:
The term in brackets is the error in a first-order Taylor expansion of . By the Lipschitz condition on the Hessian, this is . Dividing by (which is close to for near ) preserves the quadratic bound.
Why It Matters
Quadratic convergence is the gold standard. No first-order method can achieve it. This speed explains why Newton's method (and its approximations) dominate when applicable.
Failure Mode
The "sufficiently close" requirement is critical. Far from the optimum, pure Newton's method can diverge, overshoot, or converge to a maximum or saddle point (Newton finds stationary points, not necessarily minima). This motivates damped Newton and trust-region methods.
Damped Newton (Newton with Line Search)
Pure Newton takes the full step . Far from the optimum, this step may be too large or even go uphill. The fix:
where is chosen by a line search (e.g., backtracking line search with Armijo condition). Far from the optimum, may be small; close to the optimum, is accepted and you recover the full quadratic convergence rate.
The Newton direction is a descent direction whenever (positive definite), because:
Why Newton Is Impractical for Large-Scale ML
The Hessian is the bottleneck:
- Storage: For a neural network with parameters, the Hessian has entries. This does not fit in any memory.
- Computation: Computing the full Hessian requires work per element, and solving the linear system takes with direct methods.
- Non-convexity: For non-convex problems (neural networks), the Hessian may have negative eigenvalues, making the Newton direction ascend rather than descend.
This is why quasi-Newton methods (which approximate the Hessian cheaply) and first-order methods (SGD, Adam) dominate in practice.
Common Confusions
Newton for optimization vs. Newton for root-finding are different algorithms
Newton's method for root-finding solves . Newton's method for optimization solves . They have the same structure (linearize and solve), but the function being linearized is different: itself in root-finding, in optimization. In 1D, Newton for optimization uses the second derivative (curvature), while root-finding uses the first derivative (slope). The convergence theory is analogous but the algorithms solve structurally different problems.
Newton's method can converge to maxima or saddle points
Newton's method finds stationary points where . It does not distinguish between minima, maxima, and saddle points. If you start near a saddle point, Newton may converge to it happily. Checking that the Hessian is positive definite at convergence is necessary to confirm you found a local minimum. Modifications like adding a multiple of the identity to the Hessian () ensure positive definiteness.
Summary
- Newton's step:
- Motivated by minimizing the local second-order Taylor model
- Quadratic convergence near a strict local minimum with Lipschitz Hessian
- Converges in exactly 1 step for quadratic functions
- Requires computing and inverting the Hessian: storage, solve
- Impractical for in its pure form
- Damped Newton (line search) gives global convergence guarantees
Exercises
Problem
Let where (positive definite). Show that Newton's method converges to the minimizer in exactly one step from any starting point.
Problem
Show that gradient descent on the same quadratic with step size (optimal fixed step size) has convergence rate:
where is the condition number. Compare this to Newton's one-step convergence.
References
Canonical:
- Nocedal & Wright, Numerical Optimization (2006), Chapters 2-3 and Section 3.3 (Newton with line search)
- Boyd & Vandenberghe, Convex Optimization (2004), Section 9.5
- Dennis & Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations (1996), Chapter 5
Current:
- Nesterov, Lectures on Convex Optimization (2018), Chapter 1
- Conn, Gould, Toint, Trust-Region Methods (2000), Chapter 6 (trust-region Newton methods)
- Bottou, Curtis, Nocedal, "Optimization Methods for Large-Scale Machine Learning" (2018), SIAM Review. Section 4 on second-order methods at scale.
Next Topics
The natural next steps from Newton's method:
- Quasi-Newton methods: approximate the Hessian cheaply (BFGS, L-BFGS)
- Line search methods: choosing step sizes with Wolfe conditions
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
Builds on This
- Interior Point MethodsLayer 3
- Nonlinear Gauss-SeidelLayer 3
- Quasi-Newton MethodsLayer 2
- Secant MethodLayer 1
- Second-Order Optimization MethodsLayer 3
- Trust Region MethodsLayer 2