Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Numerical Optimization

Augmented Lagrangian and ADMM

The augmented Lagrangian adds a quadratic penalty to enforce constraints while preserving exact solutions. ADMM splits problems into tractable subproblems solved alternately, enabling distributed and non-smooth optimization for Lasso, matrix completion, and large-scale ML.

AdvancedTier 2Stable~55 min
0

Why This Matters

Many important ML problems have structure that a single monolithic optimizer cannot exploit. The Lasso has a smooth least-squares term plus a non-smooth 1\ell_1 penalty. Distributed training needs to split a problem across machines. Matrix completion has a rank constraint that must be handled carefully.

ADMM (Alternating Direction Method of Multipliers) handles all of these by splitting the problem into subproblems, each of which is easy to solve. It combines the decomposability of dual ascent with the convergence properties of the augmented Lagrangian method. Since Boyd et al.'s 2011 monograph, ADMM has become a standard tool for structured optimization in machine learning, signal processing, and statistics.

Mental Model

You want to minimize f(x)+g(z)f(\mathbf{x}) + g(\mathbf{z}) subject to Ax+Bz=c\mathbf{A}\mathbf{x} + \mathbf{B}\mathbf{z} = \mathbf{c}. You cannot minimize over x\mathbf{x} and z\mathbf{z} jointly because ff and gg have different structure (e.g., one is smooth, the other is non-smooth). ADMM alternates: first optimize over x\mathbf{x} (holding z\mathbf{z} fixed), then optimize over z\mathbf{z} (holding x\mathbf{x} fixed), then update a dual variable that enforces the constraint. Each step is easy; the alternation converges to the joint solution.

Formal Setup

The Penalty Method and Its Problem

The simplest approach to constrained optimization: penalize constraint violations. For the problem minxf(x)\min_{\mathbf{x}} f(\mathbf{x}) subject to Ax=b\mathbf{A}\mathbf{x} = \mathbf{b}, the penalty method minimizes:

f(x)+ρ2Axb2f(\mathbf{x}) + \frac{\rho}{2}\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2

As ρ\rho \to \infty, the solution approaches the constrained optimum. But large ρ\rho makes the problem ill-conditioned: the Hessian has eigenvalues proportional to ρ\rho, and optimization becomes slow.

The Augmented Lagrangian

Definition

Augmented Lagrangian

The augmented Lagrangian for minxf(x)\min_{\mathbf{x}} f(\mathbf{x}) subject to Ax=b\mathbf{A}\mathbf{x} = \mathbf{b} is:

Lρ(x,λ)=f(x)+λT(Axb)+ρ2Axb2\mathcal{L}_\rho(\mathbf{x}, \boldsymbol{\lambda}) = f(\mathbf{x}) + \boldsymbol{\lambda}^T(\mathbf{A}\mathbf{x} - \mathbf{b}) + \frac{\rho}{2}\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2

It combines the ordinary Lagrangian (with dual variable λ\boldsymbol{\lambda}) with a quadratic penalty (weight ρ>0\rho > 0).

Proposition

Augmented Lagrangian Exactness

Statement

Unlike the pure penalty method, the augmented Lagrangian method achieves the exact constrained optimum at a finite penalty parameter ρ\rho, provided the dual variable λ\boldsymbol{\lambda} is updated correctly. The method of multipliers alternates:

xk+1=argminx  Lρ(x,λk)\mathbf{x}^{k+1} = \arg\min_{\mathbf{x}} \; \mathcal{L}_\rho(\mathbf{x}, \boldsymbol{\lambda}^k)

λk+1=λk+ρ(Axk+1b)\boldsymbol{\lambda}^{k+1} = \boldsymbol{\lambda}^k + \rho(\mathbf{A}\mathbf{x}^{k+1} - \mathbf{b})

Under convexity and the Slater condition, xk\mathbf{x}^k converges to the primal optimum and λk\boldsymbol{\lambda}^k converges to the dual optimum.

Intuition

The dual variable λ\boldsymbol{\lambda} adapts to enforce the constraint. If the constraint is violated at iteration kk, λ\boldsymbol{\lambda} increases, penalizing future violations more. This adaptive mechanism means you do not need ρ\rho \to \infty; a moderate ρ\rho suffices because λ\boldsymbol{\lambda} does the fine-tuning.

Why It Matters

The augmented Lagrangian method avoids the ill-conditioning of the pure penalty method while maintaining convergence. ADMM is the augmented Lagrangian method applied to a split (decomposed) problem.

ADMM: The Algorithm

Consider the structured problem:

minx,z  f(x)+g(z)subject toAx+Bz=c\min_{\mathbf{x}, \mathbf{z}} \; f(\mathbf{x}) + g(\mathbf{z}) \quad \text{subject to} \quad \mathbf{A}\mathbf{x} + \mathbf{B}\mathbf{z} = \mathbf{c}

Definition

ADMM Updates

The ADMM algorithm applies the augmented Lagrangian method with alternating minimization:

x-update: xk+1=argminx[f(x)+ρ2Ax+Bzkc+uk2]\mathbf{x}^{k+1} = \arg\min_{\mathbf{x}} \left[f(\mathbf{x}) + \frac{\rho}{2}\|\mathbf{A}\mathbf{x} + \mathbf{B}\mathbf{z}^k - \mathbf{c} + \mathbf{u}^k\|^2\right]

z-update: zk+1=argminz[g(z)+ρ2Axk+1+Bzc+uk2]\mathbf{z}^{k+1} = \arg\min_{\mathbf{z}} \left[g(\mathbf{z}) + \frac{\rho}{2}\|\mathbf{A}\mathbf{x}^{k+1} + \mathbf{B}\mathbf{z} - \mathbf{c} + \mathbf{u}^k\|^2\right]

Dual update: uk+1=uk+Axk+1+Bzk+1c\mathbf{u}^{k+1} = \mathbf{u}^k + \mathbf{A}\mathbf{x}^{k+1} + \mathbf{B}\mathbf{z}^{k+1} - \mathbf{c}

where u=λ/ρ\mathbf{u} = \boldsymbol{\lambda}/\rho is the scaled dual variable.

The power of ADMM is that the x\mathbf{x}-update only involves ff (plus a quadratic) and the z\mathbf{z}-update only involves gg (plus a quadratic). If ff and gg have different structure, each subproblem can be solved with the appropriate specialized method.

Convergence

Theorem

ADMM Convergence

Statement

Under convexity and the existence of a saddle point, ADMM converges:

  1. Residual convergence: Axk+Bzkc0\mathbf{A}\mathbf{x}^k + \mathbf{B}\mathbf{z}^k - \mathbf{c} \to \mathbf{0} (primal feasibility)
  2. Objective convergence: f(xk)+g(zk)pf(\mathbf{x}^k) + g(\mathbf{z}^k) \to p^* (optimal value)
  3. Dual convergence: uku\mathbf{u}^k \to \mathbf{u}^* (dual optimality)

The convergence rate is O(1/k)O(1/k) for the primal and dual residuals.

Intuition

ADMM makes progress in two ways each iteration: the primal updates reduce the objective (or at least the augmented Lagrangian), and the dual update enforces feasibility. The O(1/k)O(1/k) rate means it reaches moderate accuracy quickly but converges slowly to high accuracy. This is acceptable for many ML applications where approximate solutions suffice.

Proof Sketch

Define the primal residual rk=Axk+Bzkc\mathbf{r}^k = \mathbf{A}\mathbf{x}^k + \mathbf{B}\mathbf{z}^k - \mathbf{c} and the dual residual sk=ρATB(zkzk1)\mathbf{s}^k = \rho \mathbf{A}^T \mathbf{B}(\mathbf{z}^k - \mathbf{z}^{k-1}). Show that the augmented Lagrangian Lρ\mathcal{L}_\rho decreases sufficiently each iteration: the decrease is lower bounded by (ρ/2)rk2+(1/(2ρ))sk2(\rho/2)\|\mathbf{r}^k\|^2 + (1/(2\rho))\|\mathbf{s}^k\|^2. Since Lρ\mathcal{L}_\rho is bounded below, the residuals must converge to zero. Telescoping the bound gives the O(1/k)O(1/k) rate.

Why It Matters

The convergence guarantee holds under very mild conditions (convexity, saddle point existence) and does not require smoothness of ff or gg. This is why ADMM handles non-smooth objectives like the 1\ell_1 norm naturally.

Failure Mode

The O(1/k)O(1/k) rate is slow for problems requiring high precision. ADMM is not competitive with specialized methods (interior point, proximal gradient with acceleration) when high-accuracy solutions are needed. The penalty parameter ρ\rho significantly affects practical convergence speed and may need adaptive tuning.

Applications

Lasso via ADMM

The Lasso problem minx12Axb2+λx1\min_{\mathbf{x}} \frac{1}{2}\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 + \lambda\|\mathbf{x}\|_1 splits as f(x)=12Axb2f(\mathbf{x}) = \frac{1}{2}\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 and g(z)=λz1g(\mathbf{z}) = \lambda\|\mathbf{z}\|_1 with the constraint x=z\mathbf{x} = \mathbf{z}. The x-update is a ridge regression (closed-form). The z-update is soft-thresholding: zik+1=Sλ/ρ(xik+1+uik)z_i^{k+1} = S_{\lambda/\rho}(x_i^{k+1} + u_i^k) where Sκ(a)=sign(a)max(aκ,0)S_\kappa(a) = \text{sign}(a)\max(|a|-\kappa, 0).

Consensus Optimization

For distributed ML, split data across NN machines. Each machine ii has a local objective fi(xi)f_i(\mathbf{x}_i). The consensus problem is:

mini=1Nfi(xi)subject tox1=x2==xN\min \sum_{i=1}^{N} f_i(\mathbf{x}_i) \quad \text{subject to} \quad \mathbf{x}_1 = \mathbf{x}_2 = \cdots = \mathbf{x}_N

ADMM introduces a global variable z\mathbf{z} and the constraint xi=z\mathbf{x}_i = \mathbf{z}. Each local update involves only fif_i and can be computed independently on machine ii. The z-update is just averaging: zk+1=1Ni(xik+1+uik)\mathbf{z}^{k+1} = \frac{1}{N}\sum_i (\mathbf{x}_i^{k+1} + \mathbf{u}_i^k). Communication is limited to gathering local variables and broadcasting the average.

Matrix Completion

For matrix completion (Netflix problem), the objective involves a low-rank constraint or nuclear norm regularization. ADMM splits the problem so that one subproblem is a least-squares fit to observed entries and the other is a singular value thresholding step that enforces low rank.

Common Confusions

Watch Out

ADMM is not just a penalty method

The dual variable update is what makes ADMM work. Without it, you have a penalty method that requires ρ\rho \to \infty for exactness and suffers from ill-conditioning. The dual update corrects for constraint violations at finite ρ\rho. This is the key insight of the augmented Lagrangian approach.

Watch Out

Rho is not a regularization parameter

The penalty weight ρ\rho affects convergence speed but not the solution. All values of ρ>0\rho > 0 converge to the same answer. In practice, too small ρ\rho gives slow primal convergence, too large ρ\rho gives slow dual convergence. Adaptive schemes adjust ρ\rho during optimization.

Summary

  • Augmented Lagrangian = Lagrangian + quadratic penalty. Achieves exact solutions at finite ρ\rho (unlike pure penalty methods)
  • ADMM splits a problem into an x-update, a z-update, and a dual update. Each subproblem has simpler structure than the original
  • Convergence is guaranteed under convexity at O(1/k)O(1/k) rate
  • Lasso via ADMM: ridge regression + soft thresholding
  • Consensus ADMM distributes optimization across machines with minimal communication (local compute + global averaging)
  • ADMM excels at moderate accuracy; use specialized methods when high accuracy is required

Exercises

ExerciseCore

Problem

Write out the ADMM updates for the Lasso problem minx12Axb2+λx1\min_{\mathbf{x}} \frac{1}{2}\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 + \lambda\|\mathbf{x}\|_1. What is the closed-form solution for the x-update? What is the z-update?

ExerciseAdvanced

Problem

In consensus ADMM with NN machines, each local x-update is xik+1=argminxi[fi(xi)+(ρ/2)xizk+uik2]\mathbf{x}_i^{k+1} = \arg\min_{\mathbf{x}_i} [f_i(\mathbf{x}_i) + (\rho/2)\|\mathbf{x}_i - \mathbf{z}^k + \mathbf{u}_i^k\|^2]. Show that the z-update is the average of all xik+1+uik\mathbf{x}_i^{k+1} + \mathbf{u}_i^k.

ExerciseResearch

Problem

ADMM converges at O(1/k)O(1/k) for convex problems. Can ADMM be applied to non-convex problems (e.g., neural network training)? What guarantees exist, and what can go wrong?

References

Canonical:

  • Boyd et al., "Distributed Optimization and Statistical Learning via ADMM" (2011)
  • Bertsekas & Tsitsiklis, Parallel and Distributed Computation (1989)

Current:

  • Parikh & Boyd, "Proximal Algorithms" (Foundations and Trends, 2014)
  • Hong et al., "Convergence Analysis of ADMM for Nonconvex Optimization" (2016)

Next Topics

The natural next steps from ADMM:

  • Trust region methods: another approach to constrained optimization where the constraint is a region of trust around the current iterate

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics