Calculus Objects
Implicit Differentiation
Differentiating through implicit equations and optimization problems: the implicit function theorem gives dy/dx without solving for y explicitly. Applications to bilevel optimization, deep equilibrium models, hyperparameter optimization, and meta-learning.
Prerequisites
Why This Matters
In machine learning, you constantly need to differentiate through operations that do not have explicit formulas. How do you compute the gradient of a validation loss with respect to a hyperparameter, when the hyperparameter affects the model through an entire training procedure? How do you backpropagate through a network with infinitely many layers (a fixed point)?
Implicit differentiation answers these questions. Instead of unrolling a computation and differentiating through every step (as in standard automatic differentiation), you differentiate the conditions that define the solution (optimality conditions, fixed-point equations) and solve for the gradient directly. This is often cheaper, more memory-efficient, and more numerically stable. The central objects are Jacobian matrices and Hessians, which encode the local sensitivity of the implicit equation to its inputs.
Mental Model
Suppose you know that is defined as the solution to some equation . You want . You could try to solve for explicitly as a function of and then differentiate. But this is often impossible or expensive. Instead, differentiate the equation directly:
Solve for . You never needed to find explicitly.
The Implicit Function Theorem
Implicit Function Theorem
Statement
If is continuously differentiable, , and the Jacobian is invertible, then there exists a neighborhood of and a unique continuously differentiable function such that and:
evaluated at .
Intuition
If perturbing can always correct for perturbations in (which is guaranteed by the invertibility of ), then the solution varies smoothly with . The derivative formula comes from differentiating the constraint and applying the chain rule.
Proof Sketch
Differentiate with respect to using the multivariable chain rule: . Multiply both sides by to isolate . The existence and uniqueness of follows from the contraction mapping theorem applied to the Newton iteration for solving near .
Why It Matters
This is the foundation for differentiating through any implicitly defined quantity. Every application below. argmin differentiation, fixed-point differentiation, DEQs. is a special case.
Failure Mode
If is singular (non-invertible), the implicit function theorem does not apply. This happens at bifurcation points where the solution structure changes qualitatively (e.g., multiple solutions merge). In practice, ill-conditioned Jacobians lead to numerically unstable gradients.
Differentiating Through Argmin
Differentiation Through Argmin
Statement
Let . At an interior minimizer, the optimality condition is . Applying the implicit function theorem with :
where is the Hessian of with respect to and is the mixed partial derivative matrix, both evaluated at .
Intuition
The optimality condition implicitly defines as a function of . The Hessian tells you how sensitive the optimality condition is to changes in . The mixed derivative tells you how the optimality condition shifts when changes.
Why It Matters
This formula lets you differentiate through any optimization problem without unrolling the optimization steps. Applications include hyperparameter optimization (differentiate validation loss through training), meta-learning (differentiate outer objective through inner optimization), and optimal control.
Deep Equilibrium Models (DEQs)
A standard deep network applies transformations sequentially: . A deep equilibrium model asks: what if ? If the iterations converge, they reach a fixed point satisfying:
where is a single layer applied repeatedly. This is an implicit equation .
Forward pass. Find by fixed-point iteration (or Anderson acceleration, or Newton's method).
Backward pass. Apply the implicit function theorem. The gradient with respect to is:
The key term is the inverse Jacobian at the fixed point. In practice, this is solved iteratively (vector-Jacobian products via conjugate gradient or Neumann series) without forming the full matrix.
Why this matters. DEQs have the representational power of an infinite-depth network but the memory cost of a single layer. There is no computation graph to store for backpropagation; the implicit derivative depends only on the fixed point and the Jacobian at that point.
Applications
Bilevel Optimization
Many ML problems are bilevel: an outer objective depends on the solution of an inner optimization:
Hyperparameter optimization (outer: validation loss, inner: training), neural architecture search, and data distillation all fit this framework. Implicit differentiation computes efficiently by differentiating through the optimality conditions of the inner problem.
Meta-Learning (MAML and Beyond)
In Model-Agnostic Meta-Learning (MAML), the inner loop adapts to a task by gradient descent and the outer loop optimizes the initialization. Unrolling the inner loop and differentiating through it requires storing all intermediate states and computing second-order gradients. Implicit differentiation through the inner loop's optimality conditions (iMAML) avoids this: it only needs the final adapted parameters and the Hessian at convergence.
Comparison: Implicit vs. Unrolled Differentiation
The choice between implicit differentiation and unrolling affects memory, compute, and correctness.
| Property | Unrolled differentiation | Implicit differentiation |
|---|---|---|
| Memory | for inner iterations (stores full computation graph) | (only stores the fixed point/optimum) |
| Compute per gradient | One backward pass through steps | One linear system solve (typically 10-50 conjugate gradient iterations) |
| Convergence requirement | Works even if inner loop has not converged | Requires inner loop to have converged (otherwise the optimality condition does not hold) |
| Gradient accuracy | Exact gradient of the truncated problem | Exact gradient of the converged problem (approximate if solved iteratively) |
| Bias | Biased if inner loop is truncated early | Unbiased at convergence; the gradient is of the true bilevel objective |
| Hessian needed | No (only Jacobian-vector products from autodiff) | Yes (Hessian-vector products for the inverse Hessian solve) |
| Implementation | Standard autodiff (PyTorch/JAX) | Requires custom backward pass or libraries (e.g., jaxopt, torchopt) |
The practical tradeoff: unrolling is simpler to implement and works with any inner loop, but uses memory proportional to the number of inner steps. Implicit differentiation is more memory-efficient and gives unbiased gradients, but requires the inner problem to be solved to convergence and needs Hessian-vector products.
For hyperparameter optimization where the inner loop is full training (thousands of SGD steps), unrolling is infeasible due to memory. Implicit differentiation is the only practical option. For meta-learning with short inner loops (5-10 steps), unrolling is feasible and often preferred for simplicity.
Hyperparameter Optimization via Implicit Gradients
The most direct ML application: compute where is a hyperparameter (regularization weight, learning rate, data augmentation strength). The training procedure defines model parameters . The validation loss is .
By the chain rule and the argmin differentiation formula:
The inverse Hessian-vector product is computed iteratively using conjugate gradient. Each conjugate gradient step requires one Hessian-vector product, which costs the same as one gradient computation via automatic differentiation. Typically 10-50 CG steps suffice for a good approximation, making the total cost of one hypergradient comparable to 10-50 training gradient steps.
Common Confusions
Implicit differentiation does not require solving a linear system from scratch
The formula involves , but in practice you never form and invert this matrix. You solve a linear system iteratively using conjugate gradient or the Neumann series . This is efficient when is available as a matrix-vector product (via autodiff).
Implicit differentiation is not the same as unrolling
Unrolling differentiates through each step of an iterative solver, storing all intermediate states. Implicit differentiation differentiates through the solution (the fixed point or optimum) directly. Unrolling uses memory for iterations; implicit differentiation uses memory. However, implicit differentiation requires the iteration to have converged.
Summary
- Implicit function theorem: if and is invertible, then
- Differentiating through argmin: use the optimality condition as the implicit equation
- DEQs: infinite-depth networks with memory backpropagation via implicit differentiation at the fixed point
- Bilevel optimization and meta-learning use implicit differentiation to avoid unrolling inner optimization loops
- In practice, solve the linear system iteratively using conjugate gradient or Neumann series, not by matrix inversion
Exercises
Problem
Consider the equation defining a unit circle. Use implicit differentiation to find at the point .
Problem
In ridge regression, . Compute using implicit differentiation through the optimality conditions.
Problem
A DEQ defines . The backward pass requires solving . When does this linear system become ill-conditioned, and what are the practical consequences for training?
References
Canonical:
- Krantz & Parks, The Implicit Function Theorem (2003)
- Griewank & Walther, Evaluating Derivatives (2008), Chapter 15
Current:
- Bai, Kolter, Koltun, "Deep Equilibrium Models" (NeurIPS 2019)
- Blondel et al., "Efficient and Modular Implicit Differentiation" (NeurIPS 2022)
Next Topics
Implicit differentiation connects to bilevel optimization, meta-learning, physics-informed neural networks, and differentiable physics simulations throughout modern machine learning.
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- The Jacobian MatrixLayer 0A
- Automatic DifferentiationLayer 1