Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Optimization

Trust Region Methods

Build a local quadratic model of the objective, minimize it within a trusted ball, and adjust the ball size based on prediction quality. Trust region methods provide convergence guarantees even for non-convex problems and inspired TRPO in reinforcement learning.

AdvancedTier 2Stable~45 min
0

Why This Matters

Newton's method converges fast when it works, but it can diverge spectacularly when the quadratic approximation is poor (far from the optimum, non-convex regions, negative curvature). Line search methods address this by controlling the step direction and finding a good step length. Trust region methods take the opposite approach: they control the step size directly.

The idea is simple and powerful: build a local model you trust within a ball of radius Δ\Delta. If the model predicts well, grow the ball. If it predicts poorly, shrink the ball. This adaptive strategy gives global convergence guarantees even for non-convex objectives, making trust regions the method of choice in many engineering optimization solvers.

Mental Model

You are standing on a hilly landscape in fog. You can see the ground clearly within a few meters (your trust region) but not further. You build a quadratic model of the terrain within your visible range and walk to the lowest point of that model. If the actual terrain matched your model well (you descended as expected), you trust yourself to see further next time (expand the region). If the terrain surprised you (you went up instead of down), you shrink your trusted range and try a more cautious step.

Formal Setup

Consider the unconstrained optimization problem minxf(x)\min_{\mathbf{x}} f(\mathbf{x}) where ff is twice continuously differentiable.

Definition

Quadratic Model

At the current iterate xk\mathbf{x}_k, the local quadratic model is:

mk(p)=f(xk)+f(xk)Tp+12pTBkpm_k(\mathbf{p}) = f(\mathbf{x}_k) + \nabla f(\mathbf{x}_k)^T \mathbf{p} + \frac{1}{2} \mathbf{p}^T \mathbf{B}_k \mathbf{p}

where p=xxk\mathbf{p} = \mathbf{x} - \mathbf{x}_k is the step and Bk\mathbf{B}_k is either the exact Hessian 2f(xk)\nabla^2 f(\mathbf{x}_k) or an approximation (e.g., BFGS).

Definition

Trust Region Subproblem

The trust region subproblem minimizes the model within a ball:

pk=argminp  mk(p)subject topΔk\mathbf{p}_k = \arg\min_{\mathbf{p}} \; m_k(\mathbf{p}) \quad \text{subject to} \quad \|\mathbf{p}\| \leq \Delta_k

where Δk>0\Delta_k > 0 is the trust region radius. When Bk\mathbf{B}_k is positive definite and the unconstrained Newton step pN=Bk1f\mathbf{p}^N = -\mathbf{B}_k^{-1}\nabla f satisfies pNΔk\|\mathbf{p}^N\| \leq \Delta_k, the trust region constraint is inactive and the step equals the Newton step.

Definition

Ratio of Actual to Predicted Reduction

The quality of the model is measured by:

ρk=f(xk)f(xk+pk)mk(0)mk(pk)\rho_k = \frac{f(\mathbf{x}_k) - f(\mathbf{x}_k + \mathbf{p}_k)}{m_k(\mathbf{0}) - m_k(\mathbf{p}_k)}

The numerator is the actual reduction in ff. The denominator is the predicted reduction from the model. If ρk1\rho_k \approx 1, the model is accurate. If ρk<0\rho_k < 0, the step increased ff.

The Trust Region Algorithm

The algorithm has a clean structure:

  1. Solve the subproblem. Compute (approximately) pk=argminpΔkmk(p)\mathbf{p}_k = \arg\min_{\|\mathbf{p}\| \leq \Delta_k} m_k(\mathbf{p})
  2. Evaluate the ratio. Compute ρk\rho_k
  3. Accept or reject. If ρk>η1\rho_k > \eta_1 (typically η1=0.1\eta_1 = 0.1), accept the step: xk+1=xk+pk\mathbf{x}_{k+1} = \mathbf{x}_k + \mathbf{p}_k. Otherwise, reject: xk+1=xk\mathbf{x}_{k+1} = \mathbf{x}_k
  4. Update the radius.
    • If ρk>η2\rho_k > \eta_2 (e.g., 0.75) and pk=Δk\|\mathbf{p}_k\| = \Delta_k: expand Δk+1=2Δk\Delta_{k+1} = 2\Delta_k
    • If ρk<η1\rho_k < \eta_1: contract Δk+1=Δk/4\Delta_{k+1} = \Delta_k / 4
    • Otherwise: keep Δk+1=Δk\Delta_{k+1} = \Delta_k

Solving the Subproblem

The trust region subproblem is a constrained quadratic program. Exact solution requires finding a Lagrange multiplier μ0\mu \geq 0 such that:

(Bk+μI)p=f(xk),μ(pΔk)=0,Bk+μI0(\mathbf{B}_k + \mu \mathbf{I})\mathbf{p} = -\nabla f(\mathbf{x}_k), \qquad \mu(\|\mathbf{p}\| - \Delta_k) = 0, \qquad \mathbf{B}_k + \mu \mathbf{I} \succeq 0

In practice, approximate solutions suffice. Two popular approaches:

Cauchy point. The steepest-descent step clipped to the trust region boundary. Cheap to compute and sufficient for convergence guarantees.

Dogleg method. Interpolate between the Cauchy point and the full Newton step along a piecewise linear path. Gives a better step than the Cauchy point with minimal additional cost.

Lemma

Cauchy Point Sufficient Decrease

Statement

The Cauchy point pkC\mathbf{p}_k^C (the steepest-descent step restricted to the trust region) achieves a predicted reduction of at least:

mk(0)mk(pkC)12f(xk)min ⁣(Δk,f(xk)Bk)m_k(\mathbf{0}) - m_k(\mathbf{p}_k^C) \geq \frac{1}{2}\|\nabla f(\mathbf{x}_k)\| \min\!\left(\Delta_k, \frac{\|\nabla f(\mathbf{x}_k)\|}{\|\mathbf{B}_k\|}\right)

Intuition

The Cauchy point guarantees meaningful progress. Either the trust region is small (and you take the full steepest-descent step scaled to the boundary) or it is large (and you take an optimal steepest-descent step). Either way, the reduction is proportional to the gradient norm.

Why It Matters

This bound is the engine of the global convergence proof. Any trust region method that achieves at least the Cauchy decrease is guaranteed to converge. More sophisticated subproblem solvers (dogleg, Steihaug-CG) do better but are not strictly necessary for convergence.

Global Convergence

Theorem

Global Convergence of Trust Region Methods

Statement

If the trust region method produces iterates {xk}\{\mathbf{x}_k\} with steps achieving at least the Cauchy decrease, then:

lim infkf(xk)=0\liminf_{k \to \infty} \|\nabla f(\mathbf{x}_k)\| = 0

That is, the method finds points with gradient arbitrarily close to zero. If ff has a finite number of stationary points and the iterates are bounded, then {xk}\{\mathbf{x}_k\} converges to a stationary point.

Intuition

The argument is by contradiction. If the gradient is bounded away from zero, then each accepted step achieves at least a fixed reduction (by the Cauchy point bound). Since ff is bounded below, there can be only finitely many accepted steps, contradiction. The key mechanism: if a step is rejected, the radius shrinks, the model becomes more accurate locally, and eventually a step must be accepted.

Proof Sketch

Suppose f(xk)ϵ>0\|\nabla f(\mathbf{x}_k)\| \geq \epsilon > 0 for all kk. By the Cauchy decrease lemma, each accepted step reduces ff by at least cϵmin(Δk,ϵ/M)c\epsilon \min(\Delta_k, \epsilon/M) where MM bounds Bk\|\mathbf{B}_k\|. If Δk\Delta_k stays bounded below, the sum of reductions diverges, which contradicts ff being bounded below. If Δk0\Delta_k \to 0, then eventually ρk1\rho_k \to 1 (the model becomes exact locally by Taylor's theorem), so steps are accepted and the radius is increased. This contradicts Δk0\Delta_k \to 0.

Why It Matters

This is a global convergence result: it holds regardless of the starting point and without assuming convexity. Line search Newton methods also converge globally under similar conditions, but trust region methods handle negative curvature more naturally (the trust region constraint prevents steps along directions of negative curvature from being too long).

Failure Mode

Convergence to a stationary point does not mean convergence to a local minimum. The method can converge to a saddle point. Modifications (using negative curvature directions) can ensure convergence to a point with positive semidefinite Hessian (second-order stationary point), but this requires more sophisticated subproblem solvers.

Trust Region vs Line Search

PropertyTrust RegionLine Search
ControlsStep size (radius Δ\Delta)Step length (α\alpha) along a direction
DirectionEmerges from subproblemChosen first (gradient, Newton, etc.)
Negative curvatureHandled naturallyRequires modifications
Non-convex convergenceGlobal convergence with Cauchy decreaseRequires Wolfe conditions
Cost per iterationSubproblem solveFunction evaluations for line search
AdjustsRadius based on ρk\rho_kStep length based on sufficient decrease

The fundamental difference: line search first chooses a direction and then decides how far to go. Trust region decides both direction and length simultaneously by optimizing within a ball. Trust region is more robust in non-convex settings.

Connection to TRPO in Reinforcement Learning

Trust Region Policy Optimization (TRPO) applies the trust region idea to policy optimization. Instead of a Euclidean ball pΔ\|\mathbf{p}\| \leq \Delta, TRPO uses a KL divergence constraint:

maxθ  E ⁣[πθ(as)πθold(as)Aπθold(s,a)]subject toE ⁣[DKL(πθoldπθ)]δ\max_\theta \; \mathbb{E}\!\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} A^{\pi_{\theta_{\text{old}}}}(s, a)\right] \quad \text{subject to} \quad \mathbb{E}\!\left[D_{\text{KL}}(\pi_{\theta_{\text{old}}} \| \pi_\theta)\right] \leq \delta

The KL constraint ensures the new policy is "close" to the old policy in a distribution-theoretic sense. This prevents catastrophic policy updates, the same role that the trust region radius plays in numerical optimization. PPO (Proximal Policy Optimization) approximates TRPO by clipping the importance ratio instead of solving the constrained problem, trading some theoretical elegance for simplicity.

Common Confusions

Watch Out

The trust region is not a constraint on the final solution

The trust region pΔk\|\mathbf{p}\| \leq \Delta_k constrains each step, not the solution. The solution is unconstrained; the trust region only limits how far you move per iteration. It is a tool for safe exploration, not a constraint on the problem itself.

Watch Out

Shrinking the radius does not mean the method is failing

Radius contraction is a healthy part of the algorithm. It means the local model was inaccurate, so the method reduces its ambition and tries a more cautious step. The convergence proof relies on radius contraction: when Δk\Delta_k shrinks sufficiently, the quadratic Taylor approximation becomes accurate and the step is accepted.

Summary

  • Trust region: minimize a quadratic model within pΔ\|\mathbf{p}\| \leq \Delta
  • Accept step if ρk=(actual reduction)/(predicted reduction)\rho_k = (\text{actual reduction}) / (\text{predicted reduction}) is above a threshold
  • Expand Δ\Delta when the model predicts well; shrink when it does not
  • Cauchy point guarantees sufficient decrease proportional to fmin(Δ,f/B)\|\nabla f\| \min(\Delta, \|\nabla f\|/\|\mathbf{B}\|)
  • Global convergence: gradient norms converge to zero, even for non-convex ff
  • Trust region controls step SIZE; line search controls step DIRECTION
  • TRPO in RL is a trust region method with KL divergence as the distance metric

Exercises

ExerciseCore

Problem

At iteration kk, the gradient is f=[3,4]T\nabla f = [3, 4]^T (so f=5\|\nabla f\| = 5), the trust region radius is Δ=2\Delta = 2, and there is no curvature information (Bk=0\mathbf{B}_k = \mathbf{0}). What is the Cauchy point? What is the predicted reduction?

ExerciseAdvanced

Problem

Explain why trust region methods handle negative curvature better than line search methods. Consider a 2D function where the Hessian has one positive and one negative eigenvalue at the current point.

ExerciseResearch

Problem

TRPO uses a KL divergence constraint DKL(πoldπθ)δD_{\text{KL}}(\pi_{\text{old}} \| \pi_\theta) \leq \delta instead of a Euclidean ball. Why is the KL divergence a better choice for policy optimization? What goes wrong with a Euclidean constraint on the parameters θ\theta?

References

Canonical:

  • Nocedal & Wright, Numerical Optimization (2006), Chapter 4
  • Conn, Gould, Toint, Trust-Region Methods (2000)

Current:

  • Schulman et al., "Trust Region Policy Optimization" (ICML 2015)
  • Curtis, Robinson, "Exploiting Negative Curvature in Trust Region Methods" (2019)

Next Topics

Trust region methods connect to the broader optimization landscape in ML, including natural gradient methods (Fisher information as a trust region metric) and proximal policy optimization.

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.