Numerical Optimization
Augmented Lagrangian and ADMM
The augmented Lagrangian adds a quadratic penalty to enforce constraints while preserving exact solutions. ADMM splits problems into tractable subproblems solved alternately, enabling distributed and non-smooth optimization for Lasso, matrix completion, and large-scale ML.
Prerequisites
Why This Matters
Many important ML problems have structure that a single monolithic optimizer cannot exploit. The Lasso has a smooth least-squares term plus a non-smooth penalty. Distributed training needs to split a problem across machines. Matrix completion has a rank constraint that must be handled carefully.
ADMM (Alternating Direction Method of Multipliers) handles all of these by splitting the problem into subproblems, each of which is easy to solve. It combines the decomposability of dual ascent with the convergence properties of the augmented Lagrangian method. Since Boyd et al.'s 2011 monograph, ADMM has become a standard tool for structured optimization in machine learning, signal processing, and statistics.
Mental Model
You want to minimize subject to . You cannot minimize over and jointly because and have different structure (e.g., one is smooth, the other is non-smooth). ADMM alternates: first optimize over (holding fixed), then optimize over (holding fixed), then update a dual variable that enforces the constraint. Each step is easy; the alternation converges to the joint solution.
Formal Setup
The Penalty Method and Its Problem
The simplest approach to constrained optimization: penalize constraint violations. For the problem subject to , the penalty method minimizes:
As , the solution approaches the constrained optimum. But large makes the problem ill-conditioned: the Hessian has eigenvalues proportional to , and optimization becomes slow.
The Augmented Lagrangian
Augmented Lagrangian
The augmented Lagrangian for subject to is:
It combines the ordinary Lagrangian (with dual variable ) with a quadratic penalty (weight ).
Augmented Lagrangian Exactness
Statement
Unlike the pure penalty method, the augmented Lagrangian method achieves the exact constrained optimum at a finite penalty parameter , provided the dual variable is updated correctly. The method of multipliers alternates:
Under convexity and the Slater condition, converges to the primal optimum and converges to the dual optimum.
Intuition
The dual variable adapts to enforce the constraint. If the constraint is violated at iteration , increases, penalizing future violations more. This adaptive mechanism means you do not need ; a moderate suffices because does the fine-tuning.
Why It Matters
The augmented Lagrangian method avoids the ill-conditioning of the pure penalty method while maintaining convergence. ADMM is the augmented Lagrangian method applied to a split (decomposed) problem.
ADMM: The Algorithm
Consider the structured problem:
ADMM Updates
The ADMM algorithm applies the augmented Lagrangian method with alternating minimization:
x-update:
z-update:
Dual update:
where is the scaled dual variable.
The power of ADMM is that the -update only involves (plus a quadratic) and the -update only involves (plus a quadratic). If and have different structure, each subproblem can be solved with the appropriate specialized method.
Convergence
ADMM Convergence
Statement
Under convexity and the existence of a saddle point, ADMM converges:
- Residual convergence: (primal feasibility)
- Objective convergence: (optimal value)
- Dual convergence: (dual optimality)
The convergence rate is for the primal and dual residuals.
Intuition
ADMM makes progress in two ways each iteration: the primal updates reduce the objective (or at least the augmented Lagrangian), and the dual update enforces feasibility. The rate means it reaches moderate accuracy quickly but converges slowly to high accuracy. This is acceptable for many ML applications where approximate solutions suffice.
Proof Sketch
Define the primal residual and the dual residual . Show that the augmented Lagrangian decreases sufficiently each iteration: the decrease is lower bounded by . Since is bounded below, the residuals must converge to zero. Telescoping the bound gives the rate.
Why It Matters
The convergence guarantee holds under very mild conditions (convexity, saddle point existence) and does not require smoothness of or . This is why ADMM handles non-smooth objectives like the norm naturally.
Failure Mode
The rate is slow for problems requiring high precision. ADMM is not competitive with specialized methods (interior point, proximal gradient with acceleration) when high-accuracy solutions are needed. The penalty parameter significantly affects practical convergence speed and may need adaptive tuning.
Applications
Lasso via ADMM
The Lasso problem splits as and with the constraint . The x-update is a ridge regression (closed-form). The z-update is soft-thresholding: where .
Consensus Optimization
For distributed ML, split data across machines. Each machine has a local objective . The consensus problem is:
ADMM introduces a global variable and the constraint . Each local update involves only and can be computed independently on machine . The z-update is just averaging: . Communication is limited to gathering local variables and broadcasting the average.
Matrix Completion
For matrix completion (Netflix problem), the objective involves a low-rank constraint or nuclear norm regularization. ADMM splits the problem so that one subproblem is a least-squares fit to observed entries and the other is a singular value thresholding step that enforces low rank.
Common Confusions
ADMM is not just a penalty method
The dual variable update is what makes ADMM work. Without it, you have a penalty method that requires for exactness and suffers from ill-conditioning. The dual update corrects for constraint violations at finite . This is the key insight of the augmented Lagrangian approach.
Rho is not a regularization parameter
The penalty weight affects convergence speed but not the solution. All values of converge to the same answer. In practice, too small gives slow primal convergence, too large gives slow dual convergence. Adaptive schemes adjust during optimization.
Summary
- Augmented Lagrangian = Lagrangian + quadratic penalty. Achieves exact solutions at finite (unlike pure penalty methods)
- ADMM splits a problem into an x-update, a z-update, and a dual update. Each subproblem has simpler structure than the original
- Convergence is guaranteed under convexity at rate
- Lasso via ADMM: ridge regression + soft thresholding
- Consensus ADMM distributes optimization across machines with minimal communication (local compute + global averaging)
- ADMM excels at moderate accuracy; use specialized methods when high accuracy is required
Exercises
Problem
Write out the ADMM updates for the Lasso problem . What is the closed-form solution for the x-update? What is the z-update?
Problem
In consensus ADMM with machines, each local x-update is . Show that the z-update is the average of all .
Problem
ADMM converges at for convex problems. Can ADMM be applied to non-convex problems (e.g., neural network training)? What guarantees exist, and what can go wrong?
References
Canonical:
- Boyd et al., "Distributed Optimization and Statistical Learning via ADMM" (2011)
- Bertsekas & Tsitsiklis, Parallel and Distributed Computation (1989)
Current:
- Parikh & Boyd, "Proximal Algorithms" (Foundations and Trends, 2014)
- Hong et al., "Convergence Analysis of ADMM for Nonconvex Optimization" (2016)
Next Topics
The natural next steps from ADMM:
- Trust region methods: another approach to constrained optimization where the constraint is a region of trust around the current iterate
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Convex DualityLayer 0B