Neural SDEs and the Diffusion Bridge

Sneiderman, Robby

Scientific ML

Neural SDEs and the Diffusion Bridge

The stochastic generalization of neural ODEs: parameterizing the drift and diffusion of an SDE with neural networks, the adjoint method extended through Brownian motion, the explicit bridge to diffusion models via the probability flow ODE, and generative neural SDEs as infinite-dimensional GANs.

AdvancedTier 3CurrentFrontier watch~55 min

Prerequisites

Neural Odes Stochastic Calculus for ML Adjoint Sensitivity Method Continuous Normalizing Flows

Prereq Map

Learning position

Read this page in the graph.

scientific-ml | layer 4 | tier 3. This page has 5 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Diffusion Models

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Neural ODEs parameterize a deterministic vector field with a neural network. Replacing " $dt$ " with " $dt + \sigma\, dW$ " turns this into a neural SDE: a learned drift plus a learned (or fixed) diffusion driven by Brownian motion. This is not a cosmetic generalization. Stochasticity changes what the model can express, what the loss must optimize, and what trajectories mean.

Two reasons to care:

Diffusion models are neural SDEs. Score-based generative modeling fits exactly into this framework. The forward noising process is an SDE; the reverse-time generative process is an SDE; the network learns the score, which is the only unknown drift term. Understanding the SDE picture is the cleanest route to understanding why diffusion samplers work and why they admit deterministic ODE counterparts.
Stochasticity is the right inductive bias for many time-series problems. Financial data, neural recordings, and partially observed systems have intrinsic noise that a deterministic ODE can only fit by overfitting. Neural SDEs learn both the systematic drift and the noise structure jointly. Latent SDEs (Li et al. 2020) extend this to latent-variable time-series modeling.

Setup

A neural SDE has the form

$d X_t = \mu_\theta(X_t, t)\, dt + \sigma_\theta(X_t, t)\, dW_t, \quad X_0 = x_0,$

with neural networks $\mu_\theta: \mathbb{R}^d \times [0, T] \to \mathbb{R}^d$ (drift) and $\sigma_\theta: \mathbb{R}^d \times [0, T] \to \mathbb{R}^{d \times m}$ (diffusion), and $W_t$ an $m$ -dimensional standard Brownian motion. When $\sigma_\theta \equiv 0$ this collapses to a Neural ODE. The unknown is the realized path of $W$ , so $X_t$ is a stochastic process: each forward solve produces a different trajectory, and the model represents a distribution over trajectories.

The Ito convention is standard in the ML literature and is assumed throughout this page. See stochastic calculus for ML for the difference between Ito and Stratonovich conventions and why Ito is preferred (martingale property, isometry, no anticipating integrand).

Existence and Uniqueness

Theorem

Existence and Uniqueness for Neural SDEs

Statement

There exists a unique strong solution $X_t$ to the neural SDE on $[0, T]$ , adapted to the Brownian filtration, satisfying $\mathbb{E}[\sup_{t \leq T} \|X_t\|^2] < \infty$ .

Intuition

This is the SDE analog of Picard-Lindelof for classical ODEs. Lipschitz continuity controls how fast nearby paths can separate; linear growth prevents finite-time blow-up. Together they let Picard-iteration-style arguments converge in $L^2$ rather than uniformly.

Proof Sketch

Define the iteration $X_t^{(n+1)} = X_0 + \int_0^t \mu_\theta(X_s^{(n)}, s)\, ds + \int_0^t \sigma_\theta(X_s^{(n)}, s)\, dW_s$ . Use the Ito isometry on the stochastic integral term and the Cauchy-Schwarz inequality on the drift term to show the iterates form a Cauchy sequence in the space of square-integrable adapted processes equipped with $\sup_{t \leq T} \mathbb{E}[\|\cdot\|^2]$ . Completeness gives the unique limit. See Oksendal Theorem 5.2.1 for the full argument; the only neural-network-specific ingredient is checking the Lipschitz hypothesis for $\mu_\theta$ and $\sigma_\theta$ .

Why It Matters

This is the load-bearing existence guarantee for neural SDEs. Any time you train a network with a stochastic integrator, you are implicitly assuming this theorem applies. Standard architectures with smooth activations satisfy the Lipschitz condition on bounded sets; with weight constraints the linear growth bound also holds. Without these, the trajectory the integrator computes corresponds to nothing well-defined.

Failure Mode

Multiplicative noise architectures where $\sigma_\theta$ depends nonlinearly on $X$ can violate global Lipschitz continuity. The CIR-style square-root diffusion ( $\sigma(x) = \sqrt{x}$ ) is a classical example where existence still holds but requires non-Lipschitz SDE theory (Yamada-Watanabe).

report a correction →

The SDE Adjoint Method

The Neural ODE adjoint method extends to SDEs but with significant subtleties. Li et al. 2020 derived the stochastic adjoint for SDEs with diagonal noise:

$\frac{dL}{d\theta} = -\int_0^T a(t)^\top \frac{\partial \mu_\theta}{\partial \theta}(X_t, t)\, dt - \int_0^T a(t)^\top \frac{\partial \sigma_\theta}{\partial \theta}(X_t, t)\, dW_t,$

where $a(t)$ satisfies a backward SDE driven by the same Brownian motion $W_t$ used in the forward pass. The required implementation detail is deterministic noise reconstruction: the random seed used to sample $W_t$ in the forward pass must be replayed in reverse during the backward pass, otherwise the gradients are with respect to a different sample path than the one whose loss was evaluated.

This requirement (Li et al. call it "Brownian motion replay") makes the SDE adjoint memory cost $O(\log T)$ in the integration horizon — substantially more than the $O(1)$ of the deterministic adjoint, but still far less than the $O(T)$ of full backpropagation through the integrator. The torchsde library implements this via the virtual Brownian tree, a binary-search structure that reconstructs $W_t$ at any queried time without storing the entire path.

The discretize-then-optimize alternative (backprop through the SDE solver) is exact but loses the memory advantage, and its memory scales with both the path length and the number of Brownian increments. The right choice is problem-dependent; see Onken and Ruthotto 2020 for an empirical comparison.

The Probability Flow ODE: Bridge to Diffusion Models

The deepest connection between neural SDEs and Neural ODEs is the probability flow ODE. Given a forward SDE

$dX_t = \mu(X_t, t)\, dt + \sigma(t)\, dW_t,$

with marginal density $p_t(x)$ , there exists a deterministic ODE whose solutions have the same marginal density $p_t$ at every time $t$ :

Theorem

Probability Flow ODE (Song et al. 2021)

Statement

Define the deterministic ODE

$\frac{dx}{dt} = \mu(x, t) - \frac{1}{2} \sigma(t)^2 \nabla_x \log p_t(x).$

Let $\tilde{X}_t$ denote its solution with $\tilde{X}_0 \sim p_0$ . Then for every $t$ , $\tilde{X}_t$ has the same density $p_t$ as the SDE solution $X_t$ .

Intuition

The Fokker-Planck equation for the SDE,

$\partial_t p_t = -\nabla \cdot (\mu p_t) + \frac{1}{2} \sigma^2 \nabla^2 p_t,$

can be rewritten in transport form $\partial_t p_t + \nabla \cdot (v p_t) = 0$ with velocity $v(x, t) = \mu(x, t) - \frac{1}{2}\sigma(t)^2 \nabla \log p_t(x)$ . This transport equation is the continuity equation for the deterministic ODE with right-hand side $v$ . The two systems push the same density through space at every $t$ , even though their individual sample paths are different.

Proof Sketch

Substitute $\nabla \log p_t = (\nabla p_t)/p_t$ into the Fokker-Planck equation and rearrange. The diffusion term $\frac{1}{2}\sigma^2 \nabla^2 p_t = \frac{1}{2}\sigma^2 \nabla \cdot (\nabla p_t)$ becomes $\nabla \cdot (\frac{1}{2}\sigma^2 p_t \nabla \log p_t)$ , and combining with the drift term gives $\partial_t p_t = -\nabla \cdot (v p_t)$ with $v = \mu - \frac{1}{2}\sigma^2 \nabla \log p_t$ . Both the SDE and the ODE satisfy this PDE; uniqueness of the Fokker-Planck/continuity equation gives the result.

Why It Matters

This theorem is the formal bridge between stochastic generative modeling and deterministic neural ODE inference. Once you have a trained score model $s_\theta(x, t) \approx \nabla \log p_t(x)$ , you can sample by integrating either the reverse-time SDE or the deterministic probability flow ODE. The ODE path is what DDIM, DPM-Solver, and EDM use for fast sampling: a Neural ODE with $f_\theta = \mu - \frac{1}{2}\sigma^2 s_\theta$ . Fewer NFE per sample, exact likelihoods (via change-of-variables), and adaptive solvers all become available.

Failure Mode

Equality holds for marginal distributions, not for joint distributions across time. The SDE and ODE produce different conditional distributions $p(x_s | x_t)$ for $s \neq t$ , so the ODE cannot be used as a substitute when you need to condition on intermediate states. Stochastic samplers also tend to have different bias-variance tradeoffs in the few-step regime; see Karras et al. 2022 (EDM) for a careful empirical comparison.

report a correction →

This is why the Neural-ODE / diffusion-model connection is real and not analogical: modern fast samplers literally invoke an ODE solver on a learned vector field whose components are $\mu(x, t)$ (a chosen drift) and $\frac{1}{2}\sigma(t)^2 s_\theta(x, t)$ (a learned score). The same torchdiffeq adaptive solver used for Neural ODE classification is used inside diffusion samplers.

VE-SDE, VP-SDE, and Sub-VP

Three forward SDEs dominate practice (Song et al. 2021):

Variance Exploding (VE-SDE): $dx = \sqrt{d[\sigma^2(t)]/dt}\, dW_t$ . Drift is zero; the variance grows without bound. Used in NCSN-style score models.
Variance Preserving (VP-SDE): $dx = -\frac{1}{2}\beta(t)x\, dt + \sqrt{\beta(t)}\, dW_t$ . Has a bounded variance and an OU-like form. The continuous-time limit of DDPM (Ho et al. 2020).
Sub-VP: $dx = -\frac{1}{2}\beta(t)x\, dt + \sqrt{\beta(t)(1 - e^{-2\int_0^t \beta(s)\,ds})}\, dW_t$ . Even tighter variance bound than VP. Tends to give the best likelihoods.

These are different choices of forward SDE, not different models. The same score-matching loss applies; the sampler structure is identical. Modern diffusion practice (EDM, Karras et al. 2022) uses a re-parameterized VE-style SDE with carefully tuned noise schedules.

Flow Matching and Rectified Flow

The probability flow ODE perspective has spawned a family of methods that bypass the SDE framework entirely and learn the velocity field directly:

Flow Matching (Lipman et al., ICLR 2023; arXiv:2210.02747): regress a learned vector field $v_\theta(x, t)$ against a target conditional vector field that transports a chosen prior to the data distribution along straight (or any chosen) probability paths. No SDE, no score; just a regression loss on the velocity field.
Rectified Flow (Liu, Gong, Liu, ICLR 2023; arXiv:2209.03003): start with any coupling between noise and data, then iteratively "rectify" the trajectories to be straighter. Straight-line paths support large-step ODE integration, enabling 1-step generation after distillation.
Consistency Models (Song, Dhariwal, Chen, Sutskever, ICML 2023; arXiv:2303.01469): train a model that maps any point on a probability flow ODE trajectory directly to its endpoint at $t = 0$ . Enables single-step sampling without distillation from a teacher diffusion model.

These methods are unified by the observation that diffusion is one (somewhat indirect) way to learn a probability flow ODE; flow matching and rectified flow learn the same object more directly. In current practice (2025), flow matching has overtaken score-matching diffusion for many image and video generation pipelines (Stable Diffusion 3, Flux, Movie Gen).

Generative Neural SDEs as Infinite-Dimensional GANs

The probability flow ODE perspective explains why score-based diffusion training works: the loss has a clean variational interpretation. But neural SDEs admit a second generative-modeling perspective that does not require the score-matching framing.

Kidger et al. 2021 framed a generator $G_\theta$ that integrates an SDE from random initial noise through learned $(\mu_\theta, \sigma_\theta)$ , paired with a discriminator $D_\phi$ that scores generated paths against real paths. This is a Wasserstein GAN played in the space of continuous functions, and the discriminator-generator min-max trains both networks jointly. Kidger et al. proved that under capacity assumptions this scheme can match arbitrary continuous-time stochastic processes — they characterize the result as neural SDEs being universal approximators for time-homogeneous Ito diffusions in the Wasserstein-1 metric.

This formulation generalizes:

Latent SDEs (Li et al. 2020): a variational-autoencoder analog where the latent path follows a learned SDE. Useful for irregularly sampled time series with intrinsic noise.
Neural CDEs (Kidger et al. 2020): controlled differential equations driven by the data path itself rather than Brownian motion, which gives a continuous-time analog of RNNs for irregularly sampled inputs.
Latent ODE-RNN hybrids (Rubanova et al. 2019): mix discrete RNN updates at observation times with ODE flow between observations.

Connection to Energy-Based Models

The score $\nabla_x \log p_t(x)$ that drives the reverse SDE is the negative gradient of an energy-based model: if $E_t(x) = -\log p_t(x)$ (modulo the log partition function), then $\nabla \log p_t(x) = -\nabla E_t(x)$ . Probability flow ODE inference contains a score term $-\frac{1}{2}\sigma(t)^2 \nabla \log p_t = \frac{1}{2}\sigma(t)^2 \nabla E_t$ that locally pushes samples up the density (down the energy) of $p_t$ , but the ODE also carries the full drift $\mu(x, t)$ , so the net dynamics are not pure gradient flow on a single energy unless $\mu \equiv 0$ . The clean statement is: PF-ODE solutions share marginals with the reverse SDE at every $t$ , and the score appears as one component of the velocity field — not as the entire generating energy.

The neural-SDE / Neural-ODE / EBM trio is one mathematical object viewed three ways:

Perspective	Object	Loss
EBM	Energy $E_\theta(x)$	Score matching, contrastive divergence
Score-based / SDE	Score $s_\theta(x, t) = -\nabla E_t(x)$	Denoising score matching at each noise level
Neural ODE	Probability flow vector field $\mu - \frac{1}{2}\sigma^2 s_\theta$	Trained via the score loss, used at inference

The same network can be trained with EBM losses, diffusion losses, or flow-matching losses, and the resulting sampler can run as an SDE or an ODE. Modern diffusion practice has converged on the score-matching loss (lowest-variance gradients) and the ODE sampler (fastest inference), but the unified object underlying all three formalisms is the same.

Common Confusions

Watch Out

The probability flow ODE is not the reverse-time SDE

Two different equations. The reverse-time SDE has a stochastic term and produces samples with the same joint distribution as time-reversed forward paths. The probability flow ODE is deterministic and matches only the marginal densities. They produce different individual sample paths. For final-sample quality at a given NFE budget, the comparison is empirical and depends on the noise schedule (see Karras et al. 2022, Table 4).

Watch Out

Diagonal noise is not a generic assumption

Most of the practical neural-SDE machinery (the stochastic adjoint of Li et al. 2020, the virtual Brownian tree, the probability flow ODE in its simplest form) assumes diagonal or even scalar diffusion. General multiplicative non-diagonal noise SDEs require more sophisticated stochastic analysis (Stratonovich corrections, Levy area approximations) and have not seen widespread ML adoption.

Watch Out

Brownian motion replay is essential, not optional

The SDE adjoint method is correct only if the backward pass uses the same Brownian sample path as the forward pass. Sampling fresh noise on the backward pass gives a gradient with respect to a different objective, which biases training in subtle ways. Always check that your library uses a virtual Brownian tree or seeded sampler before trusting SDE-adjoint gradients.

Exercises

ExerciseCore

Problem

Consider the Ornstein-Uhlenbeck SDE $dX_t = -\theta X_t\, dt + \sigma\, dW_t$ with $\theta, \sigma > 0$ . The stationary density is $p_\infty(x) = \mathcal{N}(0, \sigma^2/(2\theta))$ .

Write the probability flow ODE corresponding to this SDE.
Sketch why solutions of this ODE preserve the stationary density (every initial $\tilde{X}_0 \sim p_\infty$ stays distributed as $p_\infty$ for all $t$ ).

ExerciseAdvanced

Problem

Suppose you train a score model $s_\theta(x, t) \approx \nabla_x \log p_t(x)$ for a diffusion model with forward SDE $dx = -\frac{1}{2}\beta(t) x\, dt + \sqrt{\beta(t)}\, dW$ on $t \in [0, 1]$ .

Write the probability flow ODE that you would integrate from $t = 1$ to $t = 0$ to sample.
Why does this ODE require fewer NFE than the reverse-time SDE for comparable sample quality? Identify the variance source that the ODE removes.
What goes wrong if $s_\theta$ is inaccurate near $t = 0$ ? Why is this region especially hard?

References

Canonical:

Li, Wong, Chen, Duvenaud, "Scalable Gradients for Stochastic Differential Equations" (AISTATS 2020; arXiv:2001.01328). The neural-SDE adjoint method via Brownian motion replay; the virtual Brownian tree.
Kidger, Foster, Li, Lyons, "Neural SDEs as Infinite-Dimensional GANs" (ICML 2021; arXiv:2102.03657). Generative neural SDEs; Wasserstein GAN training in path space.
Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021 oral; arXiv:2011.13456). The probability flow ODE (Section 4.3, Appendix D.1) — the explicit bridge to neural ODEs.
Anderson, "Reverse-time diffusion equation models," Stochastic Processes and Their Applications 12(3):313-326 (1982). The original derivation of the time-reversed SDE.

Diffusion-model precursors:

Ho, Jain, Abbeel, "Denoising Diffusion Probabilistic Models" (NeurIPS 2020; arXiv:2006.11239). The DDPM paper; the discrete-time precursor to VP-SDEs.
Song & Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution" (NeurIPS 2019; arXiv:1907.05600). NCSN; the discrete-time precursor to VE-SDEs.
Anderson, "Reverse-time diffusion equation models," Stochastic Processes and Their Applications 12(3), 1982. (Listed above for completeness.)

Flow matching and ODE-direct methods:

Lipman, Chen, Ben-Hamu, Nickel, Le, "Flow Matching for Generative Modeling" (ICLR 2023; arXiv:2210.02747). Direct regression on the probability flow vector field; no score, no SDE.
Liu, Gong, Liu, "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow" (ICLR 2023; arXiv:2209.03003). Rectified flow; iterative trajectory straightening.
Song, Dhariwal, Chen, Sutskever, "Consistency Models" (ICML 2023; arXiv:2303.01469). Single-step generation by matching to ODE endpoints.
Esser et al., "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" (Stable Diffusion 3, ICML 2024; arXiv:2403.03206). The flow-matching-based successor to score-based latent diffusion in production systems.

Current:

Tzen, Raginsky, "Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit" (arXiv:1905.09883, 2019). Theoretical analysis of latent SDEs as the continuous-time limit of latent Gaussian models.
Kidger, Morrill, Foster, Lyons, "Neural Controlled Differential Equations for Irregular Time Series" (NeurIPS 2020 spotlight; arXiv:2005.08926). The CDE generalization; the workhorse for irregularly sampled real-world time series.
Rubanova, Chen, Duvenaud, "Latent ODEs for Irregularly-Sampled Time Series" (NeurIPS 2019; arXiv:1907.03907). VAE-style latent dynamics with ODE flow between observations; the immediate precursor to latent SDEs.
Karras, Aittala, Aila, Laine, "Elucidating the Design Space of Diffusion-Based Generative Models" (NeurIPS 2022; arXiv:2206.00364). Empirical comparison of SDE vs. ODE samplers and the EDM noise-schedule design.
Lu, Zhou, Bao, Chen, Li, Zhu, "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling" (NeurIPS 2022; arXiv:2206.00927). Specialized solver exploiting the structure of the diffusion probability flow ODE.

Reference / Survey:

Kidger, "On Neural Differential Equations" (PhD thesis, Oxford, 2022; arXiv:2202.02435). Standard modern reference; Chapters 5-7 cover SDE machinery in depth.
Oksendal, Stochastic Differential Equations (6th ed., 2003), Chapter 5. The textbook proof of SDE existence and uniqueness; the reference for any SDE convergence argument.

Next Topics

Diffusion models: the dominant ML application of the neural-SDE framework
Energy-based models: the EBM perspective on the score that diffusion models learn
Continuous normalizing flows: the deterministic-flow generative alternative built on neural ODEs

Last reviewed: May 27, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Adjoint Sensitivity Methodlayer 3 · tier 2
Stochastic Differential Equationslayer 3 · tier 2
Continuous Normalizing Flowslayer 3 · tier 3
Stochastic Calculus for MLlayer 3 · tier 3
Neural ODEs and Continuous-Depth Networkslayer 4 · tier 3

Derived topics

2

Diffusion Modelslayer 4 · tier 1
Energy-Based Modelslayer 3 · tier 3

Graph-backed continuations

Diffusion Models Energy-Based Models