Beyond Llms
Neural ODEs and Continuous-Depth Networks
Treating neural network depth as a continuous variable: the ODE formulation of residual networks, the adjoint method for memory-efficient backpropagation, connections to dynamical systems theory, and practical limitations.
Prerequisites
Why This Matters
A ResNet computes at each layer. This is Euler's method applied to an ODE. If you take the step size to zero and the number of layers to infinity, you get a continuous dynamical system:
This reframing is not just mathematical elegance. It gives you: constant memory backpropagation (via the adjoint method), adaptive computation depth (the ODE solver decides how many steps to take), and a bridge between deep learning and dynamical systems theory. The tradeoffs are real: training is slower, and the expressiveness is constrained by ODE theory (no crossing trajectories). Neural ODEs connect to DEQ models (which solve for the ODE's fixed point directly) and continuous thought machines (which use ODE dynamics for adaptive-depth reasoning).
The ResNet-ODE Connection
ResNet as Discretized ODE
Statement
A residual network with update for is the Euler discretization of the initial value problem:
with step size . In the limit , the discrete trajectory converges to the continuous solution (under Lipschitz conditions on ).
The output of the network is .
Intuition
Each ResNet layer adds a small correction to the hidden state. In the continuous limit, these corrections become a vector field that flows the input through a smooth trajectory. The network's "depth" becomes a continuous time variable. Deeper networks correspond to longer integration times, and the network learns the vector field that transforms inputs into useful representations.
Why It Matters
This perspective explains why ResNets work: the skip connection is not just a gradient-flow trick. It makes each layer an incremental transformation, and incremental transformations compose into smooth, invertible maps (under mild conditions). This is why ResNets can be very deep without the representation collapsing.
It also enables replacing the fixed -layer architecture with an adaptive ODE solver that chooses its own step size. Easy inputs get fewer steps; hard inputs get more. This is adaptive computation depth without architectural changes.
Failure Mode
The continuous limit requires the dynamics to be Lipschitz continuous. If is not Lipschitz (e.g., if it has sharp discontinuities), the ODE may not have a unique solution, and the convergence of Euler's method is not guaranteed. In practice, standard neural network architectures with smooth activations (tanh, softplus) satisfy this condition. ReLU technically does not (it is not differentiable at 0), but works in practice.
The Adjoint Method
Adjoint Sensitivity Method
Statement
To compute for a neural ODE, define the adjoint state . The adjoint satisfies a backward ODE:
integrated backwards from to with initial condition . The parameter gradient is:
Memory cost: in depth (constant, regardless of the number of ODE solver steps), compared to for standard backpropagation through layers.
Intuition
Standard backprop stores all intermediate activations to compute gradients, costing memory. The adjoint method avoids this by solving the adjoint ODE backwards in time, recomputing on the fly by integrating the forward ODE backwards. This trades memory for compute: you solve two ODEs (forward and backward) instead of storing all activations.
This is the continuous analog of activation checkpointing, taken to its logical extreme.
Why It Matters
Constant-memory backpropagation enables training of very deep (or continuous-depth) networks without running out of GPU memory. For standard ResNets with layers, the memory saving is . For neural ODEs where the solver may take thousands of steps, the saving is even larger.
The adjoint method is not new. It was developed in optimal control theory in the 1960s (Pontryagin's maximum principle). Neural ODEs brought it to deep learning.
Failure Mode
The backward ODE recomputation of introduces numerical error. If the forward and backward solvers use different discretizations or if the dynamics are chaotic, the recomputed can diverge from the original, causing gradient inaccuracy. In practice, this is mitigated by using the same adaptive solver in both directions, but it remains a source of subtle bugs. Checkpointed approaches (solving forward, saving a few checkpoints, recomputing between them) offer a middle ground.
Why ODEs Constrain Expressiveness
ODE trajectories cannot cross. If , then for all (by uniqueness of ODE solutions under Lipschitz conditions). This means the map is a homeomorphism: it is continuous, invertible, and its inverse is continuous.
This is a limitation. A homeomorphism cannot change the topology of the data. If the input data has two intertwined spirals, a neural ODE cannot "untangle" them into linearly separable clusters through a continuous flow. A standard ResNet (with finite step size) can, because discrete maps are not constrained by ODE uniqueness.
In practice, this limitation is addressed by:
- Augmented neural ODEs: concatenate extra dimensions to the state, allowing the flow to "lift" data into higher dimensions where untangling is possible
- Using neural ODEs as components in a larger architecture, not as the entire model
Common Confusions
Neural ODEs are not just deep ResNets
The continuous formulation gives qualitatively different properties: adaptive depth, constant-memory training, invertibility constraints, and connections to physics. A 100-layer ResNet with ReLU and batch norm does not behave like a neural ODE. The ODE perspective is most useful when you actually use an ODE solver (adaptive step size, error control), not when you just view a discrete ResNet "as if" it were continuous.
Constant memory does not mean free computation
The adjoint method saves memory by recomputing activations during the backward pass. This doubles the computational cost (two ODE solves instead of one forward pass). For some applications, this tradeoff is worth it (when memory is the bottleneck). For others, standard backprop with gradient checkpointing is more practical.
Neural ODEs are not universally better than discrete networks
Neural ODEs are slower to train (ODE solver overhead), harder to parallelize (sequential integration), and more constrained in expressiveness (no trajectory crossing). They are most useful when continuous dynamics are a natural fit: time series, physical systems, normalizing flows, and problems where adaptive computation depth matters.
Exercises
Problem
A ResNet has 50 layers with hidden dimension 256. Standard backprop stores all 50 intermediate activations ( floats). The neural ODE adjoint method stores only the initial and final states. Compute the memory ratio. Under what conditions is the ODE approach worth the extra compute?
Problem
Explain why the non-crossing property of ODE trajectories limits the expressiveness of neural ODEs. Give a specific example of a classification problem in that a neural ODE (without augmentation) cannot solve but a standard 2-layer network can.
References
Canonical:
- Chen et al., "Neural Ordinary Differential Equations" (NeurIPS 2018, Best Paper). The foundational paper.
- Pontryagin et al., The Mathematical Theory of Optimal Processes (1962). Original adjoint method.
Current:
- Dupont et al., "Augmented Neural ODEs" (NeurIPS 2019). Fixes the expressiveness limitation.
- Kidger, "On Neural Differential Equations" (PhD thesis, 2022). Comprehensive modern treatment.
- Grathwohl et al., "FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models" (ICLR 2019). Neural ODEs for normalizing flows.
Next Topics
- Physics-informed neural networks: using ODE/PDE structure as inductive bias
- Normalizing flows: invertible transformations for density estimation, where the continuous formulation avoids Jacobian computation
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Skip Connections and ResNetsLayer 2
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Gradient Flow and Vanishing GradientsLayer 2
- Automatic DifferentiationLayer 1
Builds on This
- Continuous Thought MachinesLayer 5