Equilibrium and Implicit-Layer Models

Sneiderman, Robby

Beyond LLMS

Equilibrium and Implicit-Layer Models

Deep Equilibrium Models (DEQ) replace explicit depth with a fixed-point equation: instead of stacking L layers, solve for the equilibrium state where one more layer would not change the output. This enables infinite-depth networks with constant memory, using implicit differentiation for backprop.

AdvancedTier 2CurrentSupporting~45 min

Prerequisites

Skip Connections and Resnets Implicit Differentiation

Prereq Map

Learning position

Read this page in the graph.

beyond-llms | layer 4 | tier 2. This page has 2 direct prerequisites and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Neural ODEs and Continuous-Depth Networks

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

A standard transformer with $L$ layers computes $h^{(l)} = f_\theta(h^{(l-1)})$ sequentially for $l = 1, \ldots, L$ . Training costs $O(L)$ memory (for storing intermediate activations) and the architecture must choose $L$ in advance.

Deep Equilibrium Models (DEQ) ask: what if we skip directly to where the iterations would converge? Instead of computing $L$ steps, solve for the fixed point $h^* = f_\theta(h^*)$ . If such a fixed point exists, it represents the state that an infinitely deep network would reach. Training requires differentiating through this fixed point, which is done analytically via implicit differentiation, not by unrolling.

The result: infinite effective depth with constant memory. The same idea applies beyond neural networks: any computation defined as the solution of an equation rather than a sequence of explicit steps is an implicit layer.

The Fixed-Point Formulation

Proposition

Deep Equilibrium Model

Statement

A DEQ replaces the recurrence $h^{(l+1)} = f_\theta(h^{(l)})$ with the fixed-point equation:

$h^* = f_\theta(h^*)$

By the Banach fixed-point theorem, if $f_\theta$ is a contraction (i.e., $\|f_\theta(a) - f_\theta(b)\| \leq \gamma \|a - b\|$ for some $\gamma < 1$ ), then:

A unique fixed point $h^*$ exists.
Fixed-point iteration $h^{(l+1)} = f_\theta(h^{(l)})$ converges to $h^*$ from any initialization.
The convergence rate is $\|h^{(l)} - h^*\| \leq \gamma^l \|h^{(0)} - h^*\|$ .

In practice, the fixed point is found using Anderson acceleration or Broyden's method rather than simple iteration.

Intuition

Think of a standard ResNet as running a simulation forward in time. A DEQ says: I do not care about the trajectory, only the final equilibrium. If the system is stable (contractive), the equilibrium is unique regardless of starting point. This is like asking "what temperature will this room reach?" instead of simulating every second of heat transfer.

The equilibrium $h^*$ is the representation that an infinitely deep weight-tied network would produce. You compute it by solving an equation, not by running infinite layers.

Why It Matters

DEQ models achieve competitive performance with explicit-depth transformers while using $O(1)$ memory in depth (only the fixed point and solver state are stored, not $L$ intermediate activations). This makes very deep effective computations feasible on limited hardware. The fixed-point perspective also connects neural networks to classical numerical analysis: the solver can use decades of research on efficient fixed-point methods.

Failure Mode

The contraction assumption is critical. If $f_\theta$ is not contractive (Lipschitz constant $\geq 1$ ), the fixed-point iteration may diverge, oscillate, or converge to different points from different initializations. In practice, spectral normalization or careful initialization is used to ensure contractivity. Training can be unstable because the Lipschitz constant varies with $\theta$ , and a parameter update can push the model from contractive to non-contractive.

report a correction →

Backpropagation Through the Fixed Point

Theorem

Implicit Differentiation for Fixed Points

Statement

Given a fixed point $h^* = f(h^*, \theta)$ and a loss $L(h^*)$ , the gradient with respect to $\theta$ is:

$\frac{dL}{d\theta} = \frac{\partial L}{\partial h^*} \left(I - \frac{\partial f}{\partial h}\bigg|_{h^*}\right)^{-1} \frac{\partial f}{\partial \theta}\bigg|_{h^*}$

This is derived by differentiating $h^* = f(h^*, \theta)$ with respect to $\theta$ :

$\frac{dh^*}{d\theta} = \frac{\partial f}{\partial h}\frac{dh^*}{d\theta} + \frac{\partial f}{\partial \theta}$

Solving: $\frac{dh^*}{d\theta} = \left(I - \frac{\partial f}{\partial h}\right)^{-1} \frac{\partial f}{\partial \theta}$ .

The matrix inverse $(I - J_f)^{-1}$ (where $J_f = \partial f / \partial h$ ) is computed via a linear solve, not explicit inversion.

Intuition

Standard backprop differentiates through each layer sequentially: the gradient flows backward through all $L$ layers. Implicit differentiation bypasses this entirely. It says: whatever path the forward computation took to reach $h^*$ , the gradient depends only on the Jacobian of $f$ at the fixed point, not on the iteration history.

This is why DEQ models have $O(1)$ memory for backprop: you do not need to store intermediate iterates. You only need the fixed point $h^*$ and the ability to solve a linear system involving $(I - J_f)$ .

Why It Matters

This decouples the forward solve from the backward pass. You can use any solver (Anderson, Broyden, Newton) for the forward pass, and the backward pass is always the same linear system. This is more flexible than neural ODEs where the backward pass must use the adjoint ODE. It also means the gradient is exact (up to the linear solve tolerance), regardless of how many forward iterations were used.

Failure Mode

The linear system $(I - J_f)v = b$ requires $\|J_f\| < 1$ for the Neumann series $(I - J_f)^{-1} = I + J_f + J_f^2 + \cdots$ to converge. This is guaranteed when $f$ is contractive. If the contraction constant is close to 1, the linear system is ill-conditioned and the gradient can be noisy. In practice, the linear solve is done with iterative methods (conjugate gradient, GMRES) with early termination, trading exact gradients for stable approximate ones.

report a correction →

DEQ vs Neural ODE vs Standard Networks

Property	Standard ( $L$ layers)	Neural ODE	DEQ
Forward computation	$L$ sequential steps	Adaptive ODE solve	Fixed-point solve
Memory (forward)	$O(L \cdot d)$	$O(d)$ (adjoint)	$O(d)$
Memory (backward)	$O(L \cdot d)$	$O(d)$ (adjoint)	$O(d)$
Effective depth	Fixed ( $L$ )	Adaptive (solver steps)	Infinite (at equilibrium)
Expressiveness	Unconstrained	Homeomorphism (no crossing)	Depends on $f$ structure
Training speed	Fast (parallel in batch)	Slow (sequential ODE)	Moderate (iterative solve)
Gradient accuracy	Exact	Approximate (numerical)	Exact (up to linear solve)

Hypernetworks and Weight Generation

A related concept: hypernetworks are networks that generate the weights of another network. Given a context $c$ , a hypernetwork $g_\phi(c)$ outputs weight matrices $\theta = g_\phi(c)$ that parameterize a task-specific network $f_\theta$ .

Hypernetworks are implicit in a different sense: the relationship between $\phi$ (hypernetwork parameters) and the final prediction involves a two-stage computation $f_{g_\phi(c)}(x)$ . Differentiating through this composition is straightforward with automatic differentiation but requires careful memory management when the generated weights are large.

Hypernetworks connect to meta-learning (generating task-specific weights from few examples) and model compression (generating efficient weights conditioned on input properties).

Common Confusions

Watch Out

DEQ models do not actually have infinite layers

A DEQ is not an infinitely deep network. It is a network with one layer applied repeatedly until convergence. The "infinite depth" interpretation is that the fixed point is what an infinitely deep weight-tied network would converge to. But the actual computation involves a finite number of solver iterations (typically 20 to 50 in practice). The benefit is that you do not need to choose the number of iterations in advance: the solver decides when convergence is reached.

Watch Out

Implicit differentiation is not the same as unrolling

Unrolling means treating each iteration of the forward solver as a layer and backpropagating through all of them. This costs $O(K)$ memory for $K$ iterations and computes an approximate gradient that depends on the number of iterations. Implicit differentiation computes the exact gradient at the fixed point regardless of how many iterations were used to find it. The cost is one linear solve, not $K$ backward steps.

Exercises

ExerciseCore

Problem

A DEQ uses $f_\theta(h) = Wh + b$ with $W \in \mathbb{R}^{d \times d}$ . What condition on $W$ ensures the fixed point exists and is unique? Compute the fixed point explicitly.

ExerciseAdvanced

Problem

Derive the implicit differentiation formula $\frac{dh^*}{d\theta} = (I - J_f)^{-1} \frac{\partial f}{\partial \theta}$ from the fixed-point equation $h^* = f(h^*, \theta)$ . Under what conditions is this well-defined?

References

Canonical (DEQ):

Bai, Kolter, Koltun, "Deep Equilibrium Models" (NeurIPS 2019; arXiv:1909.01377). The foundational paper.
Bai, Kolter, Koltun, "Multiscale Deep Equilibrium Models" (NeurIPS 2020; arXiv:2006.08656). Scaling DEQ to vision tasks.

Implicit layers more broadly:

Amos & Kolter, "OptNet: Differentiable Optimization as a Layer in Neural Networks" (ICML 2017; arXiv:1703.00443). Argmin of a quadratic program as a differentiable layer; the foundational implicit-layer paper.
Agrawal, Amos, Barratt, Boyd, Diamond, Kolter, "Differentiable Convex Optimization Layers" (NeurIPS 2019; arXiv:1910.12430). The cvxpylayers library; arbitrary disciplined-convex programs as layers.
Lorraine, Vicol, Duvenaud, "Optimizing Millions of Hyperparameters by Implicit Differentiation" (AISTATS 2020; arXiv:1911.02590). Implicit differentiation for nested optimization at very large scale; the standard reference for hyperparameter optimization via the implicit-function theorem.
Florence et al., "Implicit Behavioral Cloning" (CoRL 2022; arXiv:2109.00137). Implicit policies as energy-based models; argmin lookup at inference time.

DEQ training and theory:

Geng et al., "On Training Implicit Models" (NeurIPS 2021; arXiv:2111.05177). Stability and training techniques.
Fung et al., "JFB: Jacobian-Free Backpropagation for Implicit Models" (AAAI 2022; arXiv:2103.12803). Avoiding the linear solve at gradient time.
El Ghaoui et al., "Implicit Deep Learning" (SIAM J. on Matrix Analysis and Applications 42(4), 2021; arXiv:1908.06315). General framework for implicit models beyond DEQ.
Winston & Kolter, "Monotone Operator Equilibrium Networks" (NeurIPS 2020; arXiv:2006.08591). Monotone-operator DEQs with guaranteed convergence under weaker-than-contractive conditions.
Bai, Koltun, Kolter, "Stabilizing Equilibrium Models by Jacobian Regularization" (ICML 2021; arXiv:2106.14342). Training stability via Jacobian spectral regularization.

Next Topics

Neural ODEs: continuous-depth models that use ODE solvers instead of fixed-point solvers
Second-order optimization: methods that also require solving linear systems involving Jacobians

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Skip Connections and ResNetslayer 2 · tier 1
Implicit Differentiationlayer 2 · tier 2

Derived topics

3

Second-Order Optimization Methodslayer 3 · tier 2
Neural ODEs and Continuous-Depth Networkslayer 4 · tier 3
Continuous Thought Machineslayer 5 · tier 3

Graph-backed continuations

Neural ODEs and Continuous-Depth Networks Second-Order Optimization Methods Continuous Thought Machines