Beyond Llms
Equilibrium and Implicit-Layer Models
Deep Equilibrium Models (DEQ) replace explicit depth with a fixed-point equation: instead of stacking L layers, solve for the equilibrium state where one more layer would not change the output. This enables infinite-depth networks with constant memory, using implicit differentiation for backprop.
Prerequisites
Why This Matters
A standard transformer with layers computes sequentially for . Training costs memory (for storing intermediate activations) and the architecture must choose in advance.
Deep Equilibrium Models (DEQ) ask: what if we skip directly to where the iterations would converge? Instead of computing steps, solve for the fixed point . If such a fixed point exists, it represents the state that an infinitely deep network would reach. Training requires differentiating through this fixed point, which is done analytically via implicit differentiation, not by unrolling.
The result: infinite effective depth with constant memory. The same idea applies beyond neural networks: any computation defined as the solution of an equation rather than a sequence of explicit steps is an implicit layer.
The Fixed-Point Formulation
Deep Equilibrium Model
Statement
A DEQ replaces the recurrence with the fixed-point equation:
By the Banach fixed-point theorem, if is a contraction (i.e., for some ), then:
- A unique fixed point exists.
- Fixed-point iteration converges to from any initialization.
- The convergence rate is .
In practice, the fixed point is found using Anderson acceleration or Broyden's method rather than simple iteration.
Intuition
Think of a standard ResNet as running a simulation forward in time. A DEQ says: I do not care about the trajectory, only the final equilibrium. If the system is stable (contractive), the equilibrium is unique regardless of starting point. This is like asking "what temperature will this room reach?" instead of simulating every second of heat transfer.
The equilibrium is the representation that an infinitely deep weight-tied network would produce. You compute it by solving an equation, not by running infinite layers.
Why It Matters
DEQ models achieve competitive performance with explicit-depth transformers while using memory in depth (only the fixed point and solver state are stored, not intermediate activations). This makes very deep effective computations feasible on limited hardware. The fixed-point perspective also connects neural networks to classical numerical analysis: the solver can use decades of research on efficient fixed-point methods.
Failure Mode
The contraction assumption is critical. If is not contractive (Lipschitz constant ), the fixed-point iteration may diverge, oscillate, or converge to different points from different initializations. In practice, spectral normalization or careful initialization is used to ensure contractivity. Training can be unstable because the Lipschitz constant varies with , and a parameter update can push the model from contractive to non-contractive.
Backpropagation Through the Fixed Point
Implicit Differentiation for Fixed Points
Statement
Given a fixed point and a loss , the gradient with respect to is:
This is derived by differentiating with respect to :
Solving: .
The matrix inverse (where ) is computed via a linear solve, not explicit inversion.
Intuition
Standard backprop differentiates through each layer sequentially: the gradient flows backward through all layers. Implicit differentiation bypasses this entirely. It says: whatever path the forward computation took to reach , the gradient depends only on the Jacobian of at the fixed point, not on the iteration history.
This is why DEQ models have memory for backprop: you do not need to store intermediate iterates. You only need the fixed point and the ability to solve a linear system involving .
Why It Matters
This decouples the forward solve from the backward pass. You can use any solver (Anderson, Broyden, Newton) for the forward pass, and the backward pass is always the same linear system. This is more flexible than neural ODEs where the backward pass must use the adjoint ODE. It also means the gradient is exact (up to the linear solve tolerance), regardless of how many forward iterations were used.
Failure Mode
The linear system requires for the Neumann series to converge. This is guaranteed when is contractive. If the contraction constant is close to 1, the linear system is ill-conditioned and the gradient can be noisy. In practice, the linear solve is done with iterative methods (conjugate gradient, GMRES) with early termination, trading exact gradients for stable approximate ones.
DEQ vs Neural ODE vs Standard Networks
| Property | Standard ( layers) | Neural ODE | DEQ |
|---|---|---|---|
| Forward computation | sequential steps | Adaptive ODE solve | Fixed-point solve |
| Memory (forward) | (adjoint) | ||
| Memory (backward) | (adjoint) | ||
| Effective depth | Fixed () | Adaptive (solver steps) | Infinite (at equilibrium) |
| Expressiveness | Unconstrained | Homeomorphism (no crossing) | Depends on structure |
| Training speed | Fast (parallel in batch) | Slow (sequential ODE) | Moderate (iterative solve) |
| Gradient accuracy | Exact | Approximate (numerical) | Exact (up to linear solve) |
Hypernetworks and Weight Generation
A related concept: hypernetworks are networks that generate the weights of another network. Given a context , a hypernetwork outputs weight matrices that parameterize a task-specific network .
Hypernetworks are implicit in a different sense: the relationship between (hypernetwork parameters) and the final prediction involves a two-stage computation . Differentiating through this composition is straightforward with automatic differentiation but requires careful memory management when the generated weights are large.
Hypernetworks connect to meta-learning (generating task-specific weights from few examples) and model compression (generating efficient weights conditioned on input properties).
Common Confusions
DEQ models do not actually have infinite layers
A DEQ is not an infinitely deep network. It is a network with one layer applied repeatedly until convergence. The "infinite depth" interpretation is that the fixed point is what an infinitely deep weight-tied network would converge to. But the actual computation involves a finite number of solver iterations (typically 20 to 50 in practice). The benefit is that you do not need to choose the number of iterations in advance: the solver decides when convergence is reached.
Implicit differentiation is not the same as unrolling
Unrolling means treating each iteration of the forward solver as a layer and backpropagating through all of them. This costs memory for iterations and computes an approximate gradient that depends on the number of iterations. Implicit differentiation computes the exact gradient at the fixed point regardless of how many iterations were used to find it. The cost is one linear solve, not backward steps.
Exercises
Problem
A DEQ uses with . What condition on ensures the fixed point exists and is unique? Compute the fixed point explicitly.
Problem
Derive the implicit differentiation formula from the fixed-point equation . Under what conditions is this well-defined?
References
Canonical:
- Bai, Kolter, Koltun, "Deep Equilibrium Models" (NeurIPS 2019). The foundational paper.
- Bai, Kolter, Koltun, "Multiscale Deep Equilibrium Models" (NeurIPS 2020). Scaling DEQ to vision tasks.
Current:
-
Geng et al., "On Training Implicit Models" (NeurIPS 2021). Stability and training techniques.
-
Fung et al., "JFB: Jacobian-Free Backpropagation for Implicit Models" (2022). Avoiding the linear solve.
-
Zhang et al., Dive into Deep Learning (2023), Chapters 14-17
Next Topics
- Neural ODEs: continuous-depth models that use ODE solvers instead of fixed-point solvers
- Second-order optimization: methods that also require solving linear systems involving Jacobians
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Skip Connections and ResNetsLayer 2
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Implicit DifferentiationLayer 2
- Automatic DifferentiationLayer 1
Builds on This
- Continuous Thought MachinesLayer 5