Implicit Differentiation

Sneiderman, Robby

Mathematical Infrastructure

Implicit Differentiation

Differentiating through implicit equations and optimization problems: the implicit function theorem gives dy/dx without solving for y explicitly. Applications to bilevel optimization, deep equilibrium models, hyperparameter optimization, and meta-learning.

AdvancedTier 2StableSupporting~45 min

Prerequisites

The Jacobian Matrix Automatic Differentiation

Prereq Map

Learning position

Read this page in the graph.

mathematical-infrastructure | layer 2 | tier 2. This page has 2 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Equilibrium and Implicit-Layer Models

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

In machine learning, you constantly need to differentiate through operations that do not have explicit formulas. How do you compute the gradient of a validation loss with respect to a hyperparameter, when the hyperparameter affects the model through an entire training procedure? How do you backpropagate through a network with infinitely many layers (a fixed point)?

Implicit differentiation answers these questions. Instead of unrolling a computation and differentiating through every step (as in standard automatic differentiation), you differentiate the conditions that define the solution (optimality conditions, fixed-point equations) and solve for the gradient directly. This is often cheaper, more memory-efficient, and more numerically stable. The central objects are Jacobian matrices and Hessians, which encode the local sensitivity of the implicit equation to its inputs.

Mental Model

Suppose you know that $y$ is defined as the solution to some equation $F(x, y) = 0$ . You want $dy/dx$ . You could try to solve for $y$ explicitly as a function of $x$ and then differentiate. But this is often impossible or expensive. Instead, differentiate the equation $F(x, y) = 0$ directly:

$\frac{\partial F}{\partial x} + \frac{\partial F}{\partial y} \frac{dy}{dx} = 0$

Solve for $dy/dx$ . You never needed to find $y(x)$ explicitly.

The Implicit Function Theorem

Theorem

Implicit Function Theorem

Statement

If $F: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^m$ is continuously differentiable, $F(\mathbf{x}_0, \mathbf{y}_0) = \mathbf{0}$ , and the Jacobian $\frac{\partial F}{\partial \mathbf{y}}(\mathbf{x}_0, \mathbf{y}_0)$ is invertible, then there exists a neighborhood of $\mathbf{x}_0$ and a unique continuously differentiable function $\mathbf{y}(\mathbf{x})$ such that $F(\mathbf{x}, \mathbf{y}(\mathbf{x})) = \mathbf{0}$ and:

$\frac{d\mathbf{y}}{d\mathbf{x}} = -\left(\frac{\partial F}{\partial \mathbf{y}}\right)^{-1} \frac{\partial F}{\partial \mathbf{x}}$

evaluated at $(\mathbf{x}_0, \mathbf{y}_0)$ .

Intuition

If perturbing $\mathbf{y}$ can always correct for perturbations in $\mathbf{x}$ (which is guaranteed by the invertibility of $\partial F / \partial \mathbf{y}$ ), then the solution $\mathbf{y}$ varies smoothly with $\mathbf{x}$ . The derivative formula comes from differentiating the constraint $F(\mathbf{x}, \mathbf{y}(\mathbf{x})) = \mathbf{0}$ and applying the chain rule.

Proof Sketch

Differentiate $F(\mathbf{x}, \mathbf{y}(\mathbf{x})) = \mathbf{0}$ with respect to $\mathbf{x}$ using the multivariable chain rule: $\frac{\partial F}{\partial \mathbf{x}} + \frac{\partial F}{\partial \mathbf{y}} \frac{d\mathbf{y}}{d\mathbf{x}} = \mathbf{0}$ . Multiply both sides by $(\partial F / \partial \mathbf{y})^{-1}$ to isolate $d\mathbf{y}/d\mathbf{x}$ . The existence and uniqueness of $\mathbf{y}(\mathbf{x})$ follows from the contraction mapping theorem applied to the Newton iteration for solving $F = 0$ near $(\mathbf{x}_0, \mathbf{y}_0)$ .

Why It Matters

This is the foundation for differentiating through any implicitly defined quantity. Every application below. argmin differentiation, fixed-point differentiation, DEQs. is a special case.

Failure Mode

If $\partial F / \partial \mathbf{y}$ is singular (non-invertible), the implicit function theorem does not apply. This happens at bifurcation points where the solution structure changes qualitatively (e.g., multiple solutions merge). In practice, ill-conditioned Jacobians lead to numerically unstable gradients.

report a correction →

Differentiating Through Argmin

Corollary

Differentiation Through Argmin

Statement

Let $\mathbf{y}^*(\mathbf{x}) = \arg\min_{\mathbf{y}} f(\mathbf{x}, \mathbf{y})$ . At an interior minimizer, the optimality condition is $\nabla_{\mathbf{y}} f(\mathbf{x}, \mathbf{y}^*) = \mathbf{0}$ . Applying the implicit function theorem with $F = \nabla_{\mathbf{y}} f$ :

$\frac{d\mathbf{y}^*}{d\mathbf{x}} = -\left(\nabla^2_{\mathbf{yy}} f\right)^{-1} \nabla^2_{\mathbf{yx}} f$

where $\nabla^2_{\mathbf{yy}} f$ is the Hessian of $f$ with respect to $\mathbf{y}$ and $\nabla^2_{\mathbf{yx}} f$ is the mixed partial derivative matrix, both evaluated at $(\mathbf{x}, \mathbf{y}^*(\mathbf{x}))$ .

Intuition

The optimality condition $\nabla_{\mathbf{y}} f = 0$ implicitly defines $\mathbf{y}^*$ as a function of $\mathbf{x}$ . The Hessian $\nabla^2_{\mathbf{yy}} f$ tells you how sensitive the optimality condition is to changes in $\mathbf{y}$ . The mixed derivative $\nabla^2_{\mathbf{yx}} f$ tells you how the optimality condition shifts when $\mathbf{x}$ changes.

Why It Matters

This formula lets you differentiate through any optimization problem without unrolling the optimization steps. Applications include hyperparameter optimization (differentiate validation loss through training), meta-learning (differentiate outer objective through inner optimization), and optimal control.

report a correction →

Deep Equilibrium Models (DEQs)

A standard deep network applies transformations sequentially: $\mathbf{z}^{(1)}, \mathbf{z}^{(2)}, \ldots, \mathbf{z}^{(L)}$ . A deep equilibrium model asks: what if $L \to \infty$ ? If the iterations converge, they reach a fixed point $\mathbf{z}^*$ satisfying:

$\mathbf{z}^* = f_\theta(\mathbf{z}^*, \mathbf{x})$

where $f_\theta$ is a single layer applied repeatedly. This is an implicit equation $F(\theta, \mathbf{x}, \mathbf{z}^*) = f_\theta(\mathbf{z}^*, \mathbf{x}) - \mathbf{z}^* = \mathbf{0}$ .

Forward pass. Find $\mathbf{z}^*$ by fixed-point iteration (or Anderson acceleration, or Newton's method).

Backward pass. Apply the implicit function theorem. The gradient with respect to $\theta$ is:

$\frac{\partial \mathcal{L}}{\partial \theta} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^*} \left(\mathbf{I} - \frac{\partial f_\theta}{\partial \mathbf{z}^*}\right)^{-1} \frac{\partial f_\theta}{\partial \theta}$

The key term $(\mathbf{I} - \partial f_\theta / \partial \mathbf{z}^*)^{-1}$ is the inverse Jacobian at the fixed point. In practice, this is solved iteratively (vector-Jacobian products via conjugate gradient or Neumann series) without forming the full matrix.

Why this matters. DEQs have the representational power of an infinite-depth network but the memory cost of a single layer. There is no computation graph to store for backpropagation; the implicit derivative depends only on the fixed point and the Jacobian at that point.

Applications

Bilevel Optimization

Many ML problems are bilevel: an outer objective depends on the solution of an inner optimization:

$\min_{\mathbf{x}} \; g(\mathbf{x}, \mathbf{y}^*(\mathbf{x})) \quad \text{where} \quad \mathbf{y}^*(\mathbf{x}) = \arg\min_{\mathbf{y}} f(\mathbf{x}, \mathbf{y})$

Hyperparameter optimization (outer: validation loss, inner: training), neural architecture search, and data distillation all fit this framework. Implicit differentiation computes $dg/d\mathbf{x}$ efficiently by differentiating through the optimality conditions of the inner problem.

Meta-Learning (MAML and Beyond)

In Model-Agnostic Meta-Learning (MAML), the inner loop adapts to a task by gradient descent and the outer loop optimizes the initialization. Unrolling the inner loop and differentiating through it requires storing all intermediate states and computing second-order gradients. Implicit differentiation through the inner loop's optimality conditions (iMAML) avoids this: it only needs the final adapted parameters and the Hessian at convergence.

Comparison: Implicit vs. Unrolled Differentiation

The choice between implicit differentiation and unrolling affects memory, compute, and correctness.

Property	Unrolled differentiation	Implicit differentiation
Memory	$O(T)$ for $T$ inner iterations (stores full computation graph)	$O(1)$ (only stores the fixed point/optimum)
Compute per gradient	One backward pass through $T$ steps	One linear system solve (typically 10-50 conjugate gradient iterations)
Convergence requirement	Works even if inner loop has not converged	Requires inner loop to have converged (otherwise the optimality condition $\nabla f = 0$ does not hold)
Gradient accuracy	Exact gradient of the truncated problem	Exact gradient of the converged problem (approximate if solved iteratively)
Bias	Biased if inner loop is truncated early	Unbiased at convergence; the gradient is of the true bilevel objective
Hessian needed	No (only Jacobian-vector products from autodiff)	Yes (Hessian-vector products for the inverse Hessian solve)
Implementation	Standard autodiff (PyTorch/JAX)	Requires custom backward pass or libraries (e.g., jaxopt, torchopt)

The practical tradeoff: unrolling is simpler to implement and works with any inner loop, but uses memory proportional to the number of inner steps. Implicit differentiation is more memory-efficient and gives unbiased gradients, but requires the inner problem to be solved to convergence and needs Hessian-vector products.

For hyperparameter optimization where the inner loop is full training (thousands of SGD steps), unrolling is infeasible due to memory. Implicit differentiation is the only practical option. For meta-learning with short inner loops (5-10 steps), unrolling is feasible and often preferred for simplicity.

Hyperparameter Optimization via Implicit Gradients

The most direct ML application: compute $\partial \mathcal{L}_{\text{val}} / \partial \lambda$ where $\lambda$ is a hyperparameter (regularization weight, learning rate, data augmentation strength). The training procedure defines model parameters $\theta^*(\lambda) = \arg\min_\theta \mathcal{L}_{\text{train}}(\theta, \lambda)$ . The validation loss is $\mathcal{L}_{\text{val}}(\theta^*(\lambda))$ .

By the chain rule and the argmin differentiation formula:

$\frac{\partial \mathcal{L}_{\text{val}}}{\partial \lambda} = \frac{\partial \mathcal{L}_{\text{val}}}{\partial \theta} \cdot \frac{d\theta^*}{d\lambda} = -\frac{\partial \mathcal{L}_{\text{val}}}{\partial \theta} \left(\nabla^2_{\theta\theta} \mathcal{L}_{\text{train}}\right)^{-1} \nabla^2_{\theta\lambda} \mathcal{L}_{\text{train}}$

The inverse Hessian-vector product $(\nabla^2_{\theta\theta} \mathcal{L}_{\text{train}})^{-1} \mathbf{v}$ is computed iteratively using conjugate gradient. Each conjugate gradient step requires one Hessian-vector product, which costs the same as one gradient computation via automatic differentiation. Typically 10-50 CG steps suffice for a good approximation, making the total cost of one hypergradient comparable to 10-50 training gradient steps.

Common Confusions

Watch Out

Implicit differentiation does not require solving a linear system from scratch

The formula involves $(\partial F / \partial \mathbf{y})^{-1}$ , but in practice you never form and invert this matrix. You solve a linear system $(\partial F / \partial \mathbf{y}) \mathbf{v} = \partial F / \partial \mathbf{x}$ iteratively using conjugate gradient or the Neumann series $\sum_{k=0}^{K} (\mathbf{I} - \partial F / \partial \mathbf{y})^k$ . This is efficient when $\partial F / \partial \mathbf{y}$ is available as a matrix-vector product (via autodiff).

Watch Out

Implicit differentiation is not the same as unrolling

Unrolling differentiates through each step of an iterative solver, storing all intermediate states. Implicit differentiation differentiates through the solution (the fixed point or optimum) directly. Unrolling uses $O(T)$ memory for $T$ iterations; implicit differentiation uses $O(1)$ memory. However, implicit differentiation requires the iteration to have converged.

Summary

Implicit function theorem: if $F(\mathbf{x}, \mathbf{y}) = \mathbf{0}$ and $\partial F / \partial \mathbf{y}$ is invertible, then $d\mathbf{y}/d\mathbf{x} = -(\partial F / \partial \mathbf{y})^{-1} (\partial F / \partial \mathbf{x})$
Differentiating through argmin: use the optimality condition $\nabla_{\mathbf{y}} f = 0$ as the implicit equation
DEQs: infinite-depth networks with $O(1)$ memory backpropagation via implicit differentiation at the fixed point
Bilevel optimization and meta-learning use implicit differentiation to avoid unrolling inner optimization loops
In practice, solve the linear system iteratively using conjugate gradient or Neumann series, not by matrix inversion

Exercises

ExerciseCore

Problem

Consider the equation $x^2 + y^2 = 1$ defining a unit circle. Use implicit differentiation to find $dy/dx$ at the point $(x_0, y_0) = (1/\sqrt{2}, 1/\sqrt{2})$ .

ExerciseAdvanced

Problem

In ridge regression, $\mathbf{w}^*(\lambda) = \arg\min_{\mathbf{w}} \|\mathbf{X}\mathbf{w} - \mathbf{y}\|^2 + \lambda \|\mathbf{w}\|^2$ . Compute $d\mathbf{w}^*/d\lambda$ using implicit differentiation through the optimality conditions.

ExerciseResearch

Problem

A DEQ defines $\mathbf{z}^* = f_\theta(\mathbf{z}^*, \mathbf{x})$ . The backward pass requires solving $(\mathbf{I} - \partial f/\partial \mathbf{z}^*)^T \mathbf{v} = \partial \mathcal{L}/\partial \mathbf{z}^*$ . When does this linear system become ill-conditioned, and what are the practical consequences for training?

References

Canonical:

Krantz & Parks, The Implicit Function Theorem (2003)
Griewank & Walther, Evaluating Derivatives (2008), Chapter 15

Current:

Bai, Kolter, Koltun, "Deep Equilibrium Models" (NeurIPS 2019)
Blondel et al., "Efficient and Modular Implicit Differentiation" (NeurIPS 2022)

Next Topics

Implicit differentiation connects to bilevel optimization, meta-learning, physics-informed neural networks, and differentiable physics simulations throughout modern machine learning.

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

The Jacobian Matrixlayer 0A · tier 1
Automatic Differentiationlayer 1 · tier 1

Derived topics

1

Equilibrium and Implicit-Layer Modelslayer 4 · tier 2

Graph-backed continuations

Equilibrium and Implicit-Layer Models