Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Beyond Llms

Neural ODEs and Continuous-Depth Networks

Treating neural network depth as a continuous variable: the ODE formulation of residual networks, the adjoint method for memory-efficient backpropagation, connections to dynamical systems theory, and practical limitations.

AdvancedTier 2Current~50 min
0

Why This Matters

A ResNet computes ht+1=ht+fθ(ht,t)h_{t+1} = h_t + f_\theta(h_t, t) at each layer. This is Euler's method applied to an ODE. If you take the step size to zero and the number of layers to infinity, you get a continuous dynamical system:

dhdt=fθ(h(t),t)\frac{dh}{dt} = f_\theta(h(t), t)

This reframing is not just mathematical elegance. It gives you: constant memory backpropagation (via the adjoint method), adaptive computation depth (the ODE solver decides how many steps to take), and a bridge between deep learning and dynamical systems theory. The tradeoffs are real: training is slower, and the expressiveness is constrained by ODE theory (no crossing trajectories). Neural ODEs connect to DEQ models (which solve for the ODE's fixed point directly) and continuous thought machines (which use ODE dynamics for adaptive-depth reasoning).

The ResNet-ODE Connection

Proposition

ResNet as Discretized ODE

Statement

A residual network with update ht+1=ht+1Lfθ(ht,t/L)h_{t+1} = h_t + \frac{1}{L} f_\theta(h_t, t/L) for t=0,1,,L1t = 0, 1, \ldots, L-1 is the Euler discretization of the initial value problem:

dhdt=fθ(h(t),t),h(0)=x,t[0,1]\frac{dh}{dt} = f_\theta(h(t), t), \quad h(0) = x, \quad t \in [0, 1]

with step size Δt=1/L\Delta t = 1/L. In the limit LL \to \infty, the discrete trajectory {h0,h1,,hL}\{h_0, h_1, \ldots, h_L\} converges to the continuous solution h(t)h(t) (under Lipschitz conditions on fθf_\theta).

The output of the network is h(1)=h(0)+01fθ(h(t),t)dth(1) = h(0) + \int_0^1 f_\theta(h(t), t) \, dt.

Intuition

Each ResNet layer adds a small correction to the hidden state. In the continuous limit, these corrections become a vector field that flows the input through a smooth trajectory. The network's "depth" becomes a continuous time variable. Deeper networks correspond to longer integration times, and the network learns the vector field fθf_\theta that transforms inputs into useful representations.

Why It Matters

This perspective explains why ResNets work: the skip connection ht+1=ht+f(ht)h_{t+1} = h_t + f(h_t) is not just a gradient-flow trick. It makes each layer an incremental transformation, and incremental transformations compose into smooth, invertible maps (under mild conditions). This is why ResNets can be very deep without the representation collapsing.

It also enables replacing the fixed LL-layer architecture with an adaptive ODE solver that chooses its own step size. Easy inputs get fewer steps; hard inputs get more. This is adaptive computation depth without architectural changes.

Failure Mode

The continuous limit requires the dynamics fθf_\theta to be Lipschitz continuous. If fθf_\theta is not Lipschitz (e.g., if it has sharp discontinuities), the ODE may not have a unique solution, and the convergence of Euler's method is not guaranteed. In practice, standard neural network architectures with smooth activations (tanh, softplus) satisfy this condition. ReLU technically does not (it is not differentiable at 0), but works in practice.

The Adjoint Method

Theorem

Adjoint Sensitivity Method

Statement

To compute dLdθ\frac{dL}{d\theta} for a neural ODE, define the adjoint state a(t)=dLdh(t)a(t) = \frac{dL}{dh(t)}. The adjoint satisfies a backward ODE:

dadt=a(t)fh(h(t),t,θ)\frac{da}{dt} = -a(t)^\top \frac{\partial f}{\partial h}(h(t), t, \theta)

integrated backwards from t=Tt = T to t=0t = 0 with initial condition a(T)=dLdh(T)a(T) = \frac{dL}{dh(T)}. The parameter gradient is:

dLdθ=0Ta(t)fθ(h(t),t,θ)dt\frac{dL}{d\theta} = -\int_0^T a(t)^\top \frac{\partial f}{\partial \theta}(h(t), t, \theta) \, dt

Memory cost: O(1)O(1) in depth (constant, regardless of the number of ODE solver steps), compared to O(L)O(L) for standard backpropagation through LL layers.

Intuition

Standard backprop stores all intermediate activations h0,h1,,hLh_0, h_1, \ldots, h_L to compute gradients, costing O(L)O(L) memory. The adjoint method avoids this by solving the adjoint ODE backwards in time, recomputing h(t)h(t) on the fly by integrating the forward ODE backwards. This trades memory for compute: you solve two ODEs (forward and backward) instead of storing all activations.

This is the continuous analog of activation checkpointing, taken to its logical extreme.

Why It Matters

Constant-memory backpropagation enables training of very deep (or continuous-depth) networks without running out of GPU memory. For standard ResNets with L=100L = 100 layers, the memory saving is 100×\sim 100\times. For neural ODEs where the solver may take thousands of steps, the saving is even larger.

The adjoint method is not new. It was developed in optimal control theory in the 1960s (Pontryagin's maximum principle). Neural ODEs brought it to deep learning.

Failure Mode

The backward ODE recomputation of h(t)h(t) introduces numerical error. If the forward and backward solvers use different discretizations or if the dynamics are chaotic, the recomputed h(t)h(t) can diverge from the original, causing gradient inaccuracy. In practice, this is mitigated by using the same adaptive solver in both directions, but it remains a source of subtle bugs. Checkpointed approaches (solving forward, saving a few checkpoints, recomputing between them) offer a middle ground.

Why ODEs Constrain Expressiveness

ODE trajectories cannot cross. If ha(0)hb(0)h_a(0) \neq h_b(0), then ha(t)hb(t)h_a(t) \neq h_b(t) for all tt (by uniqueness of ODE solutions under Lipschitz conditions). This means the map h(0)h(T)h(0) \mapsto h(T) is a homeomorphism: it is continuous, invertible, and its inverse is continuous.

This is a limitation. A homeomorphism cannot change the topology of the data. If the input data has two intertwined spirals, a neural ODE cannot "untangle" them into linearly separable clusters through a continuous flow. A standard ResNet (with finite step size) can, because discrete maps are not constrained by ODE uniqueness.

In practice, this limitation is addressed by:

  • Augmented neural ODEs: concatenate extra dimensions to the state, allowing the flow to "lift" data into higher dimensions where untangling is possible
  • Using neural ODEs as components in a larger architecture, not as the entire model

Common Confusions

Watch Out

Neural ODEs are not just deep ResNets

The continuous formulation gives qualitatively different properties: adaptive depth, constant-memory training, invertibility constraints, and connections to physics. A 100-layer ResNet with ReLU and batch norm does not behave like a neural ODE. The ODE perspective is most useful when you actually use an ODE solver (adaptive step size, error control), not when you just view a discrete ResNet "as if" it were continuous.

Watch Out

Constant memory does not mean free computation

The adjoint method saves memory by recomputing activations during the backward pass. This doubles the computational cost (two ODE solves instead of one forward pass). For some applications, this tradeoff is worth it (when memory is the bottleneck). For others, standard backprop with gradient checkpointing is more practical.

Watch Out

Neural ODEs are not universally better than discrete networks

Neural ODEs are slower to train (ODE solver overhead), harder to parallelize (sequential integration), and more constrained in expressiveness (no trajectory crossing). They are most useful when continuous dynamics are a natural fit: time series, physical systems, normalizing flows, and problems where adaptive computation depth matters.

Exercises

ExerciseCore

Problem

A ResNet has 50 layers with hidden dimension 256. Standard backprop stores all 50 intermediate activations (50×25650 \times 256 floats). The neural ODE adjoint method stores only the initial and final states. Compute the memory ratio. Under what conditions is the ODE approach worth the extra compute?

ExerciseAdvanced

Problem

Explain why the non-crossing property of ODE trajectories limits the expressiveness of neural ODEs. Give a specific example of a classification problem in R2\mathbb{R}^2 that a neural ODE (without augmentation) cannot solve but a standard 2-layer network can.

References

Canonical:

  • Chen et al., "Neural Ordinary Differential Equations" (NeurIPS 2018, Best Paper). The foundational paper.
  • Pontryagin et al., The Mathematical Theory of Optimal Processes (1962). Original adjoint method.

Current:

  • Dupont et al., "Augmented Neural ODEs" (NeurIPS 2019). Fixes the expressiveness limitation.
  • Kidger, "On Neural Differential Equations" (PhD thesis, 2022). Comprehensive modern treatment.
  • Grathwohl et al., "FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models" (ICLR 2019). Neural ODEs for normalizing flows.

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics