Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Beyond Llms

Equilibrium and Implicit-Layer Models

Deep Equilibrium Models (DEQ) replace explicit depth with a fixed-point equation: instead of stacking L layers, solve for the equilibrium state where one more layer would not change the output. This enables infinite-depth networks with constant memory, using implicit differentiation for backprop.

AdvancedTier 2Current~45 min
0

Why This Matters

A standard transformer with LL layers computes h(l)=fθ(h(l1))h^{(l)} = f_\theta(h^{(l-1)}) sequentially for l=1,,Ll = 1, \ldots, L. Training costs O(L)O(L) memory (for storing intermediate activations) and the architecture must choose LL in advance.

Deep Equilibrium Models (DEQ) ask: what if we skip directly to where the iterations would converge? Instead of computing LL steps, solve for the fixed point h=fθ(h)h^* = f_\theta(h^*). If such a fixed point exists, it represents the state that an infinitely deep network would reach. Training requires differentiating through this fixed point, which is done analytically via implicit differentiation, not by unrolling.

The result: infinite effective depth with constant memory. The same idea applies beyond neural networks: any computation defined as the solution of an equation rather than a sequence of explicit steps is an implicit layer.

The Fixed-Point Formulation

Proposition

Deep Equilibrium Model

Statement

A DEQ replaces the recurrence h(l+1)=fθ(h(l))h^{(l+1)} = f_\theta(h^{(l)}) with the fixed-point equation:

h=fθ(h)h^* = f_\theta(h^*)

By the Banach fixed-point theorem, if fθf_\theta is a contraction (i.e., fθ(a)fθ(b)γab\|f_\theta(a) - f_\theta(b)\| \leq \gamma \|a - b\| for some γ<1\gamma < 1), then:

  1. A unique fixed point hh^* exists.
  2. Fixed-point iteration h(l+1)=fθ(h(l))h^{(l+1)} = f_\theta(h^{(l)}) converges to hh^* from any initialization.
  3. The convergence rate is h(l)hγlh(0)h\|h^{(l)} - h^*\| \leq \gamma^l \|h^{(0)} - h^*\|.

In practice, the fixed point is found using Anderson acceleration or Broyden's method rather than simple iteration.

Intuition

Think of a standard ResNet as running a simulation forward in time. A DEQ says: I do not care about the trajectory, only the final equilibrium. If the system is stable (contractive), the equilibrium is unique regardless of starting point. This is like asking "what temperature will this room reach?" instead of simulating every second of heat transfer.

The equilibrium hh^* is the representation that an infinitely deep weight-tied network would produce. You compute it by solving an equation, not by running infinite layers.

Why It Matters

DEQ models achieve competitive performance with explicit-depth transformers while using O(1)O(1) memory in depth (only the fixed point and solver state are stored, not LL intermediate activations). This makes very deep effective computations feasible on limited hardware. The fixed-point perspective also connects neural networks to classical numerical analysis: the solver can use decades of research on efficient fixed-point methods.

Failure Mode

The contraction assumption is critical. If fθf_\theta is not contractive (Lipschitz constant 1\geq 1), the fixed-point iteration may diverge, oscillate, or converge to different points from different initializations. In practice, spectral normalization or careful initialization is used to ensure contractivity. Training can be unstable because the Lipschitz constant varies with θ\theta, and a parameter update can push the model from contractive to non-contractive.

Backpropagation Through the Fixed Point

Theorem

Implicit Differentiation for Fixed Points

Statement

Given a fixed point h=f(h,θ)h^* = f(h^*, \theta) and a loss L(h)L(h^*), the gradient with respect to θ\theta is:

dLdθ=Lh(Ifhh)1fθh\frac{dL}{d\theta} = \frac{\partial L}{\partial h^*} \left(I - \frac{\partial f}{\partial h}\bigg|_{h^*}\right)^{-1} \frac{\partial f}{\partial \theta}\bigg|_{h^*}

This is derived by differentiating h=f(h,θ)h^* = f(h^*, \theta) with respect to θ\theta:

dhdθ=fhdhdθ+fθ\frac{dh^*}{d\theta} = \frac{\partial f}{\partial h}\frac{dh^*}{d\theta} + \frac{\partial f}{\partial \theta}

Solving: dhdθ=(Ifh)1fθ\frac{dh^*}{d\theta} = \left(I - \frac{\partial f}{\partial h}\right)^{-1} \frac{\partial f}{\partial \theta}.

The matrix inverse (IJf)1(I - J_f)^{-1} (where Jf=f/hJ_f = \partial f / \partial h) is computed via a linear solve, not explicit inversion.

Intuition

Standard backprop differentiates through each layer sequentially: the gradient flows backward through all LL layers. Implicit differentiation bypasses this entirely. It says: whatever path the forward computation took to reach hh^*, the gradient depends only on the Jacobian of ff at the fixed point, not on the iteration history.

This is why DEQ models have O(1)O(1) memory for backprop: you do not need to store intermediate iterates. You only need the fixed point hh^* and the ability to solve a linear system involving (IJf)(I - J_f).

Why It Matters

This decouples the forward solve from the backward pass. You can use any solver (Anderson, Broyden, Newton) for the forward pass, and the backward pass is always the same linear system. This is more flexible than neural ODEs where the backward pass must use the adjoint ODE. It also means the gradient is exact (up to the linear solve tolerance), regardless of how many forward iterations were used.

Failure Mode

The linear system (IJf)v=b(I - J_f)v = b requires Jf<1\|J_f\| < 1 for the Neumann series (IJf)1=I+Jf+Jf2+(I - J_f)^{-1} = I + J_f + J_f^2 + \cdots to converge. This is guaranteed when ff is contractive. If the contraction constant is close to 1, the linear system is ill-conditioned and the gradient can be noisy. In practice, the linear solve is done with iterative methods (conjugate gradient, GMRES) with early termination, trading exact gradients for stable approximate ones.

DEQ vs Neural ODE vs Standard Networks

PropertyStandard (LL layers)Neural ODEDEQ
Forward computationLL sequential stepsAdaptive ODE solveFixed-point solve
Memory (forward)O(Ld)O(L \cdot d)O(d)O(d) (adjoint)O(d)O(d)
Memory (backward)O(Ld)O(L \cdot d)O(d)O(d) (adjoint)O(d)O(d)
Effective depthFixed (LL)Adaptive (solver steps)Infinite (at equilibrium)
ExpressivenessUnconstrainedHomeomorphism (no crossing)Depends on ff structure
Training speedFast (parallel in batch)Slow (sequential ODE)Moderate (iterative solve)
Gradient accuracyExactApproximate (numerical)Exact (up to linear solve)

Hypernetworks and Weight Generation

A related concept: hypernetworks are networks that generate the weights of another network. Given a context cc, a hypernetwork gϕ(c)g_\phi(c) outputs weight matrices θ=gϕ(c)\theta = g_\phi(c) that parameterize a task-specific network fθf_\theta.

Hypernetworks are implicit in a different sense: the relationship between ϕ\phi (hypernetwork parameters) and the final prediction involves a two-stage computation fgϕ(c)(x)f_{g_\phi(c)}(x). Differentiating through this composition is straightforward with automatic differentiation but requires careful memory management when the generated weights are large.

Hypernetworks connect to meta-learning (generating task-specific weights from few examples) and model compression (generating efficient weights conditioned on input properties).

Common Confusions

Watch Out

DEQ models do not actually have infinite layers

A DEQ is not an infinitely deep network. It is a network with one layer applied repeatedly until convergence. The "infinite depth" interpretation is that the fixed point is what an infinitely deep weight-tied network would converge to. But the actual computation involves a finite number of solver iterations (typically 20 to 50 in practice). The benefit is that you do not need to choose the number of iterations in advance: the solver decides when convergence is reached.

Watch Out

Implicit differentiation is not the same as unrolling

Unrolling means treating each iteration of the forward solver as a layer and backpropagating through all of them. This costs O(K)O(K) memory for KK iterations and computes an approximate gradient that depends on the number of iterations. Implicit differentiation computes the exact gradient at the fixed point regardless of how many iterations were used to find it. The cost is one linear solve, not KK backward steps.

Exercises

ExerciseCore

Problem

A DEQ uses fθ(h)=Wh+bf_\theta(h) = Wh + b with WRd×dW \in \mathbb{R}^{d \times d}. What condition on WW ensures the fixed point exists and is unique? Compute the fixed point explicitly.

ExerciseAdvanced

Problem

Derive the implicit differentiation formula dhdθ=(IJf)1fθ\frac{dh^*}{d\theta} = (I - J_f)^{-1} \frac{\partial f}{\partial \theta} from the fixed-point equation h=f(h,θ)h^* = f(h^*, \theta). Under what conditions is this well-defined?

References

Canonical:

  • Bai, Kolter, Koltun, "Deep Equilibrium Models" (NeurIPS 2019). The foundational paper.
  • Bai, Kolter, Koltun, "Multiscale Deep Equilibrium Models" (NeurIPS 2020). Scaling DEQ to vision tasks.

Current:

  • Geng et al., "On Training Implicit Models" (NeurIPS 2021). Stability and training techniques.

  • Fung et al., "JFB: Jacobian-Free Backpropagation for Implicit Models" (2022). Avoiding the linear solve.

  • Zhang et al., Dive into Deep Learning (2023), Chapters 14-17

Next Topics

  • Neural ODEs: continuous-depth models that use ODE solvers instead of fixed-point solvers
  • Second-order optimization: methods that also require solving linear systems involving Jacobians

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics