Scientific ML
Adjoint Sensitivity Method
Compute gradients through an ODE solver by integrating a backward adjoint ODE, trading O(NT) activation memory for O(1) memory at the cost of a second integration.
Prerequisites
Why This Matters
A Neural ODE integrated with a Runge-Kutta solver at steps holds an activation tensor at every step. Backpropagating through the unrolled solver costs memory, the same as a residual network of depth . For an adaptive solver that takes thousands of steps on a stiff trajectory, this is fatal.
The adjoint sensitivity method computes the same gradient by integrating a second ODE backward in time, recovering activations on the fly. Memory cost drops to in depth, independent of . The price is a second forward solve and the loss of bit-exact gradients (the recovered trajectory differs from the original by solver error).
This trade is not new to deep learning. Pontryagin derived the adjoint equation in 1962 for optimal control. The control community has used it for sixty years to differentiate through dynamical systems. Chen, Rubanova, Bettencourt, and Duvenaud (2018) imported it into deep learning and made Neural ODEs trainable at scale.
Setup
Let evolve under from to , with initial state . Let be a scalar loss depending on the final state. We want gradients with respect to:
- the initial state ,
- the parameters , and
- the time bounds and .
The naive approach unrolls the solver and applies reverse-mode autodiff. The adjoint approach defines and derives an ODE for that runs backward in time.
The Adjoint Equation
Adjoint Sensitivity for ODEs
Statement
Define the adjoint . Then satisfies the backward ODE
with terminal condition . The parameter gradient is recovered by
Both integrals are solved by integrating an augmented system from back to , alongside the reconstructed state .
Intuition
Think of as the sensitivity of the loss to a perturbation of the state at time . A perturbation at time propagates forward through the dynamics and ends up shifting by the linearized flow map. The chain rule says . Differentiating this in gives the adjoint ODE. The minus sign is because is defined with the loss at the end of the interval, so it accumulates as decreases.
Proof Sketch
Form the Lagrangian , where is a Lagrange multiplier. The integral is zero on the trajectory. Take the variation in , integrate the term by parts using , and collect terms in . Setting the coefficient of to zero gives the adjoint ODE. The boundary term at gives the terminal condition. The remaining gives the parameter gradient.
Why It Matters
Three quantities flow backward from to in a single augmented ODE: the reconstructed state , the adjoint , and the running parameter gradient . Total memory: three vectors, independent of how many solver steps are taken. A solver that takes 10,000 adaptive steps still uses constant memory in depth.
Failure Mode
The reconstructed from the backward solve is not the same trajectory as the forward solve, because adaptive step sizes and floating-point error diverge. The adjoint gradient is therefore an approximation of the discretize-then-differentiate gradient, and the two can disagree noticeably for stiff systems. See "Discretize vs Optimize" below.
The Augmented Backward ODE
The full backward integration runs three ODEs jointly from to :
\dot z &= f_\theta(z, t) \\ \dot a &= -a^\top \, \frac{\partial f_\theta(z, t)}{\partial z} \\ \dot g &= -a^\top \, \frac{\partial f_\theta(z, t)}{\partial \theta} \end{aligned}$$ with terminal conditions $z(t_1)$ from the forward solve, $a(t_1) = \partial L/\partial z(t_1)$, and $g(t_1) = 0$. At $t = t_0$, $g(t_0) = \partial \mathcal{L}/\partial \theta$ and $a(t_0) = \partial \mathcal{L}/\partial z(t_0)$. Each backward solver step requires evaluating $\partial f_\theta / \partial z$ and $\partial f_\theta / \partial \theta$. Both are vector-Jacobian products done with a single reverse-mode pass through the network defining $f_\theta$. ## Discretize vs Optimize Two philosophies clash here, and they give different gradients. **Optimize-then-discretize** (the adjoint method). Derive the continuous adjoint ODE first, then discretize it with a numerical solver. This gives the $O(1)$ memory benefit but produces gradients that are not the exact gradient of any particular discrete computation. They are the discretized version of the true continuous gradient. **Discretize-then-optimize** (backprop through the solver). Pick a discrete solver (RK4, Dormand-Prince), unroll it, and apply reverse-mode autodiff to the unrolled graph. This gives the exact gradient of the discrete forward computation but costs $O(N \cdot d)$ memory. The two gradients agree only in the limit of vanishing step size. For adaptive solvers on stiff problems they can disagree by enough to destabilize training. Gholaminejad et al. (2019, ANODE) documented cases where the adjoint method's gradients are wrong by tens of percent. The fix they proposed (ANODE-V2) recomputes activations at the same step locations as the forward pass, recovering bit-exact agreement with discretize-then-optimize at the cost of storing solver state. In practice, well-conditioned Neural ODEs (small $T$, Lipschitz $f_\theta$) have agreement to several decimal places and the adjoint method is the default. For stiff or chaotic dynamics, prefer discretize-then-optimize with [gradient checkpointing](/topics/activation-checkpointing). ## Pontryagin's Maximum Principle The adjoint equation is half of Pontryagin's maximum principle (Pontryagin, Boltyanskii, Gamkrelidze, Mishchenko, 1962). The other half is an optimality condition on the control: $$u^*(t) = \arg\max_u \, H(z(t), a(t), u, t)$$ where $H = a^\top f(z, u, t) - L(z, u, t)$ is the Hamiltonian. In a Neural ODE, the "control" is the parameter vector $\theta$ and we use gradient descent rather than the maximum principle to update it. So Neural ODE training uses the adjoint equation but not the maximum principle in full. The maximum principle is the optimality condition for the constrained optimization problem of optimal control. The adjoint equation alone is the gradient computation. They are bundled in textbooks but separable in practice, and Neural ODEs use only the gradient half. ## Common Confusions <Confusion title="The adjoint method is not just backprop"> Standard backprop on an unrolled solver is discretize-then-optimize: it gives the exact gradient of the discrete forward pass. The adjoint method is optimize-then-discretize: it gives a discretized approximation of the continuous gradient. The two are different gradients of different objects, and they happen to agree in the limit of vanishing step size. </Confusion> <Confusion title="O(1) memory is in depth, not in width"> The adjoint method uses memory proportional to the state dimension $d$, not proportional to the number of solver steps $N$. A solver taking 10,000 steps on a $d = 1024$ state still uses $O(d)$ memory. But the state itself can be arbitrarily large, so "O(1) memory" is shorthand for "no dependence on solver step count," not "constant total bytes." </Confusion> <Confusion title="Reconstructing z(t) backward is not free"> The backward solve of $\dot z = f_\theta(z, t)$ runs the same dynamics in reverse. For dissipative systems (contraction in time), the forward dynamics shrink errors and the backward dynamics amplify them. Numerical reconstruction diverges from the true trajectory exponentially fast. This is why checkpointing, which saves $z$ at a few intermediate times, is often more robust than full backward reconstruction. </Confusion> ## Summary - The adjoint sensitivity method computes ODE gradients by integrating a backward adjoint ODE, costing $O(d)$ memory in depth instead of $O(N \cdot d)$ for solver-unrolled backprop. - The adjoint $a(t) = \partial \mathcal{L}/\partial z(t)$ satisfies $\dot a = -a^\top \partial f_\theta / \partial z$ with terminal condition $a(t_1) = \partial L / \partial z(t_1)$. - Three ODEs run jointly backward: state, adjoint, and accumulating parameter gradient. - Optimize-then-discretize (adjoint) gives a discretized continuous gradient; discretize-then-optimize (backprop) gives an exact discrete gradient. They differ by solver error and can disagree noticeably on stiff systems. - Pontryagin (1962) derived the adjoint equation for optimal control; Chen et al. (2018) imported it into deep learning for Neural ODEs. ## Exercises <Exercise difficulty="core"> <Problem> Consider the linear scalar ODE $\dot z = -\alpha z$ with $z(0) = z_0$, scalar parameter $\alpha > 0$, integration interval $[0, T]$, and loss $\mathcal{L} = \tfrac{1}{2} z(T)^2$. Write down the adjoint equation, solve it analytically, and compute $\partial \mathcal{L}/\partial \alpha$. Verify against direct differentiation of the closed-form solution $z(T) = z_0 e^{-\alpha T}$. </Problem> <Hint> The adjoint equation is $\dot a = -a \cdot (\partial f / \partial z) = \alpha a$, which integrates from $a(T) = z(T)$ back to $a(0)$. The parameter gradient is $-\int_T^0 a(t) (\partial f / \partial \alpha) dt = \int_0^T a(t) \cdot z(t) \cdot (-1) dt \cdot (-1)$. Be careful with the sign of $\partial f / \partial \alpha = -z$. </Hint> <Solution> The Jacobian is $\partial f / \partial z = -\alpha$, so the adjoint ODE is $\dot a = -a(-\alpha) = \alpha a$. With terminal condition $a(T) = z(T) = z_0 e^{-\alpha T}$, the solution running backward is $a(t) = a(T) e^{\alpha(t - T)} = z_0 e^{-\alpha T} e^{\alpha(t - T)} = z_0 e^{\alpha t - 2 \alpha T}$. The parameter gradient is $\frac{d\mathcal{L}}{d\alpha} = -\int_T^0 a(t) (-z(t)) dt = -\int_0^T a(t) z(t) dt$. With $z(t) = z_0 e^{-\alpha t}$, $a(t) z(t) = z_0^2 e^{\alpha t - 2\alpha T} e^{-\alpha t} = z_0^2 e^{-2\alpha T}$. The integrand is constant, so $\int_0^T z_0^2 e^{-2\alpha T} dt = T z_0^2 e^{-2\alpha T}$. Therefore $\frac{d\mathcal{L}}{d\alpha} = -T z_0^2 e^{-2\alpha T}$. Direct check: $\mathcal{L} = \tfrac{1}{2} z_0^2 e^{-2\alpha T}$, so $\frac{d\mathcal{L}}{d\alpha} = \tfrac{1}{2} z_0^2 \cdot (-2T) e^{-2\alpha T} = -T z_0^2 e^{-2\alpha T}$. Agreement. </Solution> </Exercise> <Exercise difficulty="advanced"> <Problem> Suppose a Neural ODE on a $d = 64$ state is integrated with an adaptive Dormand-Prince solver that takes 500 steps on average. Compare the memory required for (a) standard backprop through the unrolled solver, (b) gradient checkpointing with $\sqrt{N}$ checkpoints, and (c) the adjoint method. Then explain why a researcher training on chaotic Lorenz dynamics might prefer (a) or (b) over (c) despite the higher memory cost. </Problem> <Hint> Standard backprop stores activations at every solver step. Checkpointing saves a subset and recomputes between them. The adjoint reconstructs $z(t)$ backward from $z(t_1)$. Lorenz dynamics are chaotic: nearby trajectories diverge exponentially. </Hint> <Solution> (a) Standard backprop: 500 activations of dimension 64 = 32,000 floats per sample. With a batch of 32, that's about 4 MB just for the ODE intermediate states (ignoring per-step Jacobian work). (b) Checkpointing with $\sqrt{500} \approx 22$ checkpoints: 22 saved states plus active recomputation segment, around 1,408 floats per sample. About 180 KB for batch 32. Total runtime increases by roughly $\sqrt{N}$ from extra recomputation. (c) Adjoint method: 3 vectors of dimension 64 (state, adjoint, parameter running gradient $g$ in the $\theta$ space, but $g$ has dimension $|\theta|$ which can be much larger than $d$). For a network with $|\theta| = 10^5$, the dominant memory is the running gradient itself, not the per-sample state. Per-sample memory in the depth dimension is $O(d) = 64$ floats. For chaotic Lorenz dynamics, the backward reconstruction of $z(t)$ from $z(t_1)$ amplifies floating-point error exponentially: a $10^{-7}$ perturbation at $t_1$ becomes $O(1)$ at $t_0$ over a few Lyapunov times. The reconstructed trajectory bears no resemblance to the forward one, and $\partial f / \partial z$ evaluated on the wrong trajectory gives wrong gradients. Checkpointing solves this by re-running the forward pass on short segments, which avoids the long-range divergence problem. The memory cost is acceptable; the wrong-gradient cost of (c) is not. </Solution> </Exercise> ## References **Canonical:** - Pontryagin, Boltyanskii, Gamkrelidze, Mishchenko, "The Mathematical Theory of Optimal Processes" (1962), Chapter 1 derives the maximum principle and the adjoint equation - Chen, Rubanova, Bettencourt, Duvenaud, "Neural Ordinary Differential Equations" (NeurIPS 2018, arXiv:1806.07366), Section 2 and Appendix B **Critical analysis:** - Gholaminejad, Keutzer, Biros, "ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs" (IJCAI 2019, arXiv:1902.10298) — documents cases where the adjoint method's gradients diverge from discretize-then-optimize and proposes a fix - Onken, Ruthotto, "Discretize-Optimize vs Optimize-Discretize for Time-Series Regression and Continuous Normalizing Flows" (arXiv:2005.13420) **Continuous control background:** - Bryson, Ho, "Applied Optimal Control: Optimization, Estimation, and Control" (1975), Chapter 2 covers continuous-time gradient computation via adjoints - Lewis, Vrabie, Syrmos, "Optimal Control" (3rd edition 2012), Chapter 3 **Neural-ODE-specific:** - Kidger, "On Neural Differential Equations" (PhD thesis 2022, arXiv:2202.02435), Chapter 3 covers gradient computation in detail - Massaroli, Poli, Park, Yamashita, Asama, "Dissecting Neural ODEs" (NeurIPS 2020, arXiv:2002.08071), Section 3 compares gradient methods ## Next Topics - **[Continuous normalizing flows](/topics/continuous-normalizing-flows)**: the adjoint method makes large-scale CNF training tractable - **[Neural SDEs](/topics/neural-sdes)**: stochastic generalization requires Brownian-motion replay alongside the adjoint - **[Activation checkpointing](/topics/activation-checkpointing)**: the discrete-network analogue, also trading compute for memoryLast reviewed: April 17, 2026
Prerequisites
Foundations this topic depends on.
- Neural ODEs and Continuous-Depth NetworksLayer 4
- Classical ODEs: Existence, Stability, and Numerical MethodsLayer 1
- Continuity in R^nLayer 0A
- Metric Spaces, Convergence, and CompletenessLayer 0A
- The Jacobian MatrixLayer 0A
- Skip Connections and ResNetsLayer 2
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Gradient Flow and Vanishing GradientsLayer 2
- Automatic DifferentiationLayer 1