Scientific ML
Neural SDEs and the Diffusion Bridge
The stochastic generalization of neural ODEs: parameterizing the drift and diffusion of an SDE with neural networks, the adjoint method extended through Brownian motion, the explicit bridge to diffusion models via the probability flow ODE, and generative neural SDEs as infinite-dimensional GANs.
Prerequisites
Why This Matters
Neural ODEs parameterize a deterministic vector field with a neural network. Replacing "" with "" turns this into a neural SDE: a learned drift plus a learned (or fixed) diffusion driven by Brownian motion. This is not a cosmetic generalization. Stochasticity changes what the model can express, what the loss must optimize, and what trajectories mean.
Two reasons to care:
-
Diffusion models are neural SDEs. Score-based generative modeling fits exactly into this framework. The forward noising process is an SDE; the reverse-time generative process is an SDE; the network learns the score, which is the only unknown drift term. Understanding the SDE picture is the cleanest route to understanding why diffusion samplers work and why they admit deterministic ODE counterparts.
-
Stochasticity is the right inductive bias for many time-series problems. Financial data, neural recordings, and partially observed systems have intrinsic noise that a deterministic ODE can only fit by overfitting. Neural SDEs learn both the systematic drift and the noise structure jointly. Latent SDEs (Li et al. 2020) extend this to latent-variable time-series modeling.
Setup
A neural SDE has the form
with neural networks (drift) and (diffusion), and an -dimensional standard Brownian motion. When this collapses to a Neural ODE. The unknown is the realized path of , so is a stochastic process: each forward solve produces a different trajectory, and the model represents a distribution over trajectories.
The Ito convention is standard in the ML literature and is assumed throughout this page. See stochastic calculus for ML for the difference between Ito and Stratonovich conventions and why Ito is preferred (martingale property, isometry, no anticipating integrand).
Existence and Uniqueness
Existence and Uniqueness for Neural SDEs
Statement
There exists a unique strong solution to the neural SDE on , adapted to the Brownian filtration, satisfying .
Intuition
This is the SDE analog of Picard-Lindelof for classical ODEs. Lipschitz continuity controls how fast nearby paths can separate; linear growth prevents finite-time blow-up. Together they let Picard-iteration-style arguments converge in rather than uniformly.
Proof Sketch
Define the iteration . Use the Ito isometry on the stochastic integral term and the Cauchy-Schwarz inequality on the drift term to show the iterates form a Cauchy sequence in the space of square-integrable adapted processes equipped with . Completeness gives the unique limit. See Oksendal Theorem 5.2.1 for the full argument; the only neural-network-specific ingredient is checking the Lipschitz hypothesis for and .
Why It Matters
This is the load-bearing existence guarantee for neural SDEs. Any time you train a network with a stochastic integrator, you are implicitly assuming this theorem applies. Standard architectures with smooth activations satisfy the Lipschitz condition on bounded sets; with weight constraints the linear growth bound also holds. Without these, the trajectory the integrator computes corresponds to nothing well-defined.
Failure Mode
Multiplicative noise architectures where depends nonlinearly on can violate global Lipschitz continuity. The CIR-style square-root diffusion () is a classical example where existence still holds but requires non-Lipschitz SDE theory (Yamada-Watanabe).
The SDE Adjoint Method
The Neural ODE adjoint method extends to SDEs but with significant subtleties. Li et al. 2020 derived the stochastic adjoint for SDEs with diagonal noise:
where satisfies a backward SDE driven by the same Brownian motion used in the forward pass. The crucial implementation detail is deterministic noise reconstruction: the random seed used to sample in the forward pass must be replayed in reverse during the backward pass, otherwise the gradients are with respect to a different sample path than the one whose loss was evaluated.
This requirement (Li et al. call it "Brownian motion replay") makes the SDE adjoint memory cost in the integration horizon — substantially more than the of the deterministic adjoint, but still far less than the of full backpropagation through the integrator. The torchsde library implements this via the virtual Brownian tree, a binary-search structure that reconstructs at any queried time without storing the entire path.
The discretize-then-optimize alternative (backprop through the SDE solver) is exact but loses the memory advantage, and its memory scales with both the path length and the number of Brownian increments. The right choice is problem-dependent; see Onken and Ruthotto 2020 for an empirical comparison.
The Probability Flow ODE: Bridge to Diffusion Models
The deepest connection between neural SDEs and Neural ODEs is the probability flow ODE. Given a forward SDE
with marginal density , there exists a deterministic ODE whose solutions have the same marginal density at every time :
Probability Flow ODE (Song et al. 2021)
Statement
Define the deterministic ODE
Let denote its solution with . Then for every , has the same density as the SDE solution .
Intuition
The Fokker-Planck equation for the SDE,
can be rewritten in transport form with velocity . This transport equation is the continuity equation for the deterministic ODE with right-hand side . The two systems push the same density through space at every , even though their individual sample paths are different.
Proof Sketch
Substitute into the Fokker-Planck equation and rearrange. The diffusion term becomes , and combining with the drift term gives with . Both the SDE and the ODE satisfy this PDE; uniqueness of the Fokker-Planck/continuity equation gives the result.
Why It Matters
This theorem is the formal bridge between stochastic generative modeling and deterministic neural ODE inference. Once you have a trained score model , you can sample by integrating either the reverse-time SDE or the deterministic probability flow ODE. The ODE path is what DDIM, DPM-Solver, and EDM use for fast sampling: a Neural ODE with . Fewer NFE per sample, exact likelihoods (via change-of-variables), and adaptive solvers all become available.
Failure Mode
Equality holds for marginal distributions, not for joint distributions across time. The SDE and ODE produce different conditional distributions for , so the ODE cannot be used as a substitute when you need to condition on intermediate states. Stochastic samplers also tend to have different bias-variance tradeoffs in the few-step regime; see Karras et al. 2022 (EDM) for a careful empirical comparison.
This is why the Neural-ODE / diffusion-model connection is real and not analogical: modern fast samplers literally invoke an ODE solver on a learned vector field whose components are (a chosen drift) and (a learned score). The same torchdiffeq adaptive solver used for Neural ODE classification is used inside diffusion samplers.
Generative Neural SDEs as Infinite-Dimensional GANs
The probability flow ODE perspective explains why score-based diffusion training works: the loss has a clean variational interpretation. But neural SDEs admit a second generative-modeling perspective that does not require the score-matching framing.
Kidger et al. 2021 framed a generator that integrates an SDE from random initial noise through learned , paired with a discriminator that scores generated paths against real paths. This is a Wasserstein GAN played in the space of continuous functions, and the discriminator-generator min-max trains both networks jointly. Kidger et al. proved that under capacity assumptions this scheme can match arbitrary continuous-time stochastic processes — they characterize the result as neural SDEs being universal approximators for time-homogeneous Ito diffusions in the Wasserstein-1 metric.
This formulation generalizes:
- Latent SDEs (Li et al. 2020): a variational-autoencoder analog where the latent path follows a learned SDE. Useful for irregularly sampled time series with intrinsic noise.
- Neural CDEs (Kidger et al. 2020): controlled differential equations driven by the data path itself rather than Brownian motion, which gives a continuous-time analog of RNNs for irregularly sampled inputs.
- Latent ODE-RNN hybrids (Rubanova et al. 2019): mix discrete RNN updates at observation times with ODE flow between observations.
Connection to Energy-Based Models
The score that drives the reverse SDE is the negative gradient of an energy-based model: if (modulo the log partition function), then . Sampling from a diffusion model by probability flow ODE integration is gradient flow on a time-dependent energy landscape, descending the energy of the noisy distribution at each until where the energy is the target data energy.
The neural-SDE / Neural-ODE / EBM trio is one mathematical object viewed three ways:
| Perspective | Object | Loss |
|---|---|---|
| EBM | Energy | Score matching, contrastive divergence |
| Score-based / SDE | Score | Denoising score matching at each noise level |
| Neural ODE | Probability flow vector field | Trained via the score loss, used at inference |
The same network can be trained with EBM losses, diffusion losses, or flow-matching losses, and the resulting sampler can run as an SDE or an ODE. Modern diffusion practice has converged on the score-matching loss (lowest-variance gradients) and the ODE sampler (fastest inference), but the unified object underlying all three formalisms is the same.
Common Confusions
The probability flow ODE is not the reverse-time SDE
Two different equations. The reverse-time SDE has a stochastic term and produces samples with the same joint distribution as time-reversed forward paths. The probability flow ODE is deterministic and matches only the marginal densities. They produce different individual sample paths. For final-sample quality at a given NFE budget, the comparison is empirical and depends on the noise schedule (see Karras et al. 2022, Table 4).
Diagonal noise is not a generic assumption
Most of the practical neural-SDE machinery (the stochastic adjoint of Li et al. 2020, the virtual Brownian tree, the probability flow ODE in its simplest form) assumes diagonal or even scalar diffusion. General multiplicative non-diagonal noise SDEs require more sophisticated stochastic analysis (Stratonovich corrections, Levy area approximations) and have not seen widespread ML adoption.
Brownian motion replay is essential, not optional
The SDE adjoint method is correct only if the backward pass uses the same Brownian sample path as the forward pass. Sampling fresh noise on the backward pass gives a gradient with respect to a different objective, which biases training in subtle ways. Always check that your library uses a virtual Brownian tree or seeded sampler before trusting SDE-adjoint gradients.
Exercises
Problem
Consider the Ornstein-Uhlenbeck SDE with . The stationary density is .
- Write the probability flow ODE corresponding to this SDE.
- Sketch why solutions of this ODE preserve the stationary density (every initial stays distributed as for all ).
Problem
Suppose you train a score model for a diffusion model with forward SDE on .
- Write the probability flow ODE that you would integrate from to to sample.
- Why does this ODE require fewer NFE than the reverse-time SDE for comparable sample quality? Identify the variance source that the ODE removes.
- What goes wrong if is inaccurate near ? Why is this region especially hard?
References
Canonical:
- Li, Wong, Chen, Duvenaud, "Scalable Gradients for Stochastic Differential Equations" (AISTATS 2020; arXiv:2001.01328). The neural-SDE adjoint method via Brownian motion replay; the virtual Brownian tree.
- Kidger, Foster, Li, Lyons, "Neural SDEs as Infinite-Dimensional GANs" (ICML 2021; arXiv:2102.03657). Generative neural SDEs; Wasserstein GAN training in path space.
- Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021 oral; arXiv:2011.13456). The probability flow ODE (Section 4.3, Appendix D.1) — the explicit bridge to neural ODEs.
- Anderson, "Reverse-time diffusion equation models," Stochastic Processes and Their Applications 12(3):313-326 (1982). The original derivation of the time-reversed SDE.
Current:
- Tzen, Raginsky, "Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit" (arXiv:1905.09883, 2019). Theoretical analysis of latent SDEs as the continuous-time limit of latent Gaussian models.
- Kidger, Morrill, Foster, Lyons, "Neural Controlled Differential Equations for Irregular Time Series" (NeurIPS 2020 spotlight; arXiv:2005.08926). The CDE generalization; the workhorse for irregularly sampled real-world time series.
- Rubanova, Chen, Duvenaud, "Latent ODEs for Irregularly-Sampled Time Series" (NeurIPS 2019; arXiv:1907.03907). VAE-style latent dynamics with ODE flow between observations; the immediate precursor to latent SDEs.
- Karras, Aittala, Aila, Laine, "Elucidating the Design Space of Diffusion-Based Generative Models" (NeurIPS 2022; arXiv:2206.00364). Empirical comparison of SDE vs. ODE samplers and the EDM noise-schedule design.
- Lu, Zhou, Bao, Chen, Li, Zhu, "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling" (NeurIPS 2022; arXiv:2206.00927). Specialized solver exploiting the structure of the diffusion probability flow ODE.
Reference / Survey:
- Kidger, "On Neural Differential Equations" (PhD thesis, Oxford, 2022; arXiv:2202.02435). Standard modern reference; Chapters 5-7 cover SDE machinery in depth.
- Oksendal, Stochastic Differential Equations (6th ed., 2003), Chapter 5. The textbook proof of SDE existence and uniqueness; the reference for any SDE convergence argument.
Next Topics
- Diffusion models: the dominant ML application of the neural-SDE framework
- Energy-based models: the EBM perspective on the score that diffusion models learn
- Continuous normalizing flows: the deterministic-flow generative alternative built on neural ODEs
Last reviewed: April 17, 2026
Prerequisites
Foundations this topic depends on.
- Neural ODEs and Continuous-Depth NetworksLayer 4
- Classical ODEs: Existence, Stability, and Numerical MethodsLayer 1
- Continuity in R^nLayer 0A
- Metric Spaces, Convergence, and CompletenessLayer 0A
- The Jacobian MatrixLayer 0A
- Skip Connections and ResNetsLayer 2
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Gradient Flow and Vanishing GradientsLayer 2
- Automatic DifferentiationLayer 1
- Stochastic Calculus for MLLayer 3
- Martingale TheoryLayer 0B
- Measure-Theoretic ProbabilityLayer 0B