Skip to main content

Scientific ML

Neural SDEs and the Diffusion Bridge

The stochastic generalization of neural ODEs: parameterizing the drift and diffusion of an SDE with neural networks, the adjoint method extended through Brownian motion, the explicit bridge to diffusion models via the probability flow ODE, and generative neural SDEs as infinite-dimensional GANs.

AdvancedTier 3Current~55 min
0

Why This Matters

Neural ODEs parameterize a deterministic vector field with a neural network. Replacing "dtdt" with "dt+σdWdt + \sigma\, dW" turns this into a neural SDE: a learned drift plus a learned (or fixed) diffusion driven by Brownian motion. This is not a cosmetic generalization. Stochasticity changes what the model can express, what the loss must optimize, and what trajectories mean.

Two reasons to care:

  1. Diffusion models are neural SDEs. Score-based generative modeling fits exactly into this framework. The forward noising process is an SDE; the reverse-time generative process is an SDE; the network learns the score, which is the only unknown drift term. Understanding the SDE picture is the cleanest route to understanding why diffusion samplers work and why they admit deterministic ODE counterparts.

  2. Stochasticity is the right inductive bias for many time-series problems. Financial data, neural recordings, and partially observed systems have intrinsic noise that a deterministic ODE can only fit by overfitting. Neural SDEs learn both the systematic drift and the noise structure jointly. Latent SDEs (Li et al. 2020) extend this to latent-variable time-series modeling.

Setup

A neural SDE has the form

dXt=μθ(Xt,t)dt+σθ(Xt,t)dWt,X0=x0,d X_t = \mu_\theta(X_t, t)\, dt + \sigma_\theta(X_t, t)\, dW_t, \quad X_0 = x_0,

with neural networks μθ:Rd×[0,T]Rd\mu_\theta: \mathbb{R}^d \times [0, T] \to \mathbb{R}^d (drift) and σθ:Rd×[0,T]Rd×m\sigma_\theta: \mathbb{R}^d \times [0, T] \to \mathbb{R}^{d \times m} (diffusion), and WtW_t an mm-dimensional standard Brownian motion. When σθ0\sigma_\theta \equiv 0 this collapses to a Neural ODE. The unknown is the realized path of WW, so XtX_t is a stochastic process: each forward solve produces a different trajectory, and the model represents a distribution over trajectories.

The Ito convention is standard in the ML literature and is assumed throughout this page. See stochastic calculus for ML for the difference between Ito and Stratonovich conventions and why Ito is preferred (martingale property, isometry, no anticipating integrand).

Existence and Uniqueness

Theorem

Existence and Uniqueness for Neural SDEs

Statement

There exists a unique strong solution XtX_t to the neural SDE on [0,T][0, T], adapted to the Brownian filtration, satisfying E[suptTXt2]<\mathbb{E}[\sup_{t \leq T} \|X_t\|^2] < \infty.

Intuition

This is the SDE analog of Picard-Lindelof for classical ODEs. Lipschitz continuity controls how fast nearby paths can separate; linear growth prevents finite-time blow-up. Together they let Picard-iteration-style arguments converge in L2L^2 rather than uniformly.

Proof Sketch

Define the iteration Xt(n+1)=X0+0tμθ(Xs(n),s)ds+0tσθ(Xs(n),s)dWsX_t^{(n+1)} = X_0 + \int_0^t \mu_\theta(X_s^{(n)}, s)\, ds + \int_0^t \sigma_\theta(X_s^{(n)}, s)\, dW_s. Use the Ito isometry on the stochastic integral term and the Cauchy-Schwarz inequality on the drift term to show the iterates form a Cauchy sequence in the space of square-integrable adapted processes equipped with suptTE[2]\sup_{t \leq T} \mathbb{E}[\|\cdot\|^2]. Completeness gives the unique limit. See Oksendal Theorem 5.2.1 for the full argument; the only neural-network-specific ingredient is checking the Lipschitz hypothesis for μθ\mu_\theta and σθ\sigma_\theta.

Why It Matters

This is the load-bearing existence guarantee for neural SDEs. Any time you train a network with a stochastic integrator, you are implicitly assuming this theorem applies. Standard architectures with smooth activations satisfy the Lipschitz condition on bounded sets; with weight constraints the linear growth bound also holds. Without these, the trajectory the integrator computes corresponds to nothing well-defined.

Failure Mode

Multiplicative noise architectures where σθ\sigma_\theta depends nonlinearly on XX can violate global Lipschitz continuity. The CIR-style square-root diffusion (σ(x)=x\sigma(x) = \sqrt{x}) is a classical example where existence still holds but requires non-Lipschitz SDE theory (Yamada-Watanabe).

The SDE Adjoint Method

The Neural ODE adjoint method extends to SDEs but with significant subtleties. Li et al. 2020 derived the stochastic adjoint for SDEs with diagonal noise:

dLdθ=0Ta(t)μθθ(Xt,t)dt0Ta(t)σθθ(Xt,t)dWt,\frac{dL}{d\theta} = -\int_0^T a(t)^\top \frac{\partial \mu_\theta}{\partial \theta}(X_t, t)\, dt - \int_0^T a(t)^\top \frac{\partial \sigma_\theta}{\partial \theta}(X_t, t)\, dW_t,

where a(t)a(t) satisfies a backward SDE driven by the same Brownian motion WtW_t used in the forward pass. The crucial implementation detail is deterministic noise reconstruction: the random seed used to sample WtW_t in the forward pass must be replayed in reverse during the backward pass, otherwise the gradients are with respect to a different sample path than the one whose loss was evaluated.

This requirement (Li et al. call it "Brownian motion replay") makes the SDE adjoint memory cost O(logT)O(\log T) in the integration horizon — substantially more than the O(1)O(1) of the deterministic adjoint, but still far less than the O(T)O(T) of full backpropagation through the integrator. The torchsde library implements this via the virtual Brownian tree, a binary-search structure that reconstructs WtW_t at any queried time without storing the entire path.

The discretize-then-optimize alternative (backprop through the SDE solver) is exact but loses the memory advantage, and its memory scales with both the path length and the number of Brownian increments. The right choice is problem-dependent; see Onken and Ruthotto 2020 for an empirical comparison.

The Probability Flow ODE: Bridge to Diffusion Models

The deepest connection between neural SDEs and Neural ODEs is the probability flow ODE. Given a forward SDE

dXt=μ(Xt,t)dt+σ(t)dWt,dX_t = \mu(X_t, t)\, dt + \sigma(t)\, dW_t,

with marginal density pt(x)p_t(x), there exists a deterministic ODE whose solutions have the same marginal density ptp_t at every time tt:

Theorem

Probability Flow ODE (Song et al. 2021)

Statement

Define the deterministic ODE

dxdt=μ(x,t)12σ(t)2xlogpt(x).\frac{dx}{dt} = \mu(x, t) - \frac{1}{2} \sigma(t)^2 \nabla_x \log p_t(x).

Let X~t\tilde{X}_t denote its solution with X~0p0\tilde{X}_0 \sim p_0. Then for every tt, X~t\tilde{X}_t has the same density ptp_t as the SDE solution XtX_t.

Intuition

The Fokker-Planck equation for the SDE,

tpt=(μpt)+12σ22pt,\partial_t p_t = -\nabla \cdot (\mu p_t) + \frac{1}{2} \sigma^2 \nabla^2 p_t,

can be rewritten in transport form tpt+(vpt)=0\partial_t p_t + \nabla \cdot (v p_t) = 0 with velocity v(x,t)=μ(x,t)12σ(t)2logpt(x)v(x, t) = \mu(x, t) - \frac{1}{2}\sigma(t)^2 \nabla \log p_t(x). This transport equation is the continuity equation for the deterministic ODE with right-hand side vv. The two systems push the same density through space at every tt, even though their individual sample paths are different.

Proof Sketch

Substitute logpt=(pt)/pt\nabla \log p_t = (\nabla p_t)/p_t into the Fokker-Planck equation and rearrange. The diffusion term 12σ22pt=12σ2(pt)\frac{1}{2}\sigma^2 \nabla^2 p_t = \frac{1}{2}\sigma^2 \nabla \cdot (\nabla p_t) becomes (12σ2ptlogpt)\nabla \cdot (\frac{1}{2}\sigma^2 p_t \nabla \log p_t), and combining with the drift term gives tpt=(vpt)\partial_t p_t = -\nabla \cdot (v p_t) with v=μ12σ2logptv = \mu - \frac{1}{2}\sigma^2 \nabla \log p_t. Both the SDE and the ODE satisfy this PDE; uniqueness of the Fokker-Planck/continuity equation gives the result.

Why It Matters

This theorem is the formal bridge between stochastic generative modeling and deterministic neural ODE inference. Once you have a trained score model sθ(x,t)logpt(x)s_\theta(x, t) \approx \nabla \log p_t(x), you can sample by integrating either the reverse-time SDE or the deterministic probability flow ODE. The ODE path is what DDIM, DPM-Solver, and EDM use for fast sampling: a Neural ODE with fθ=μ12σ2sθf_\theta = \mu - \frac{1}{2}\sigma^2 s_\theta. Fewer NFE per sample, exact likelihoods (via change-of-variables), and adaptive solvers all become available.

Failure Mode

Equality holds for marginal distributions, not for joint distributions across time. The SDE and ODE produce different conditional distributions p(xsxt)p(x_s | x_t) for sts \neq t, so the ODE cannot be used as a substitute when you need to condition on intermediate states. Stochastic samplers also tend to have different bias-variance tradeoffs in the few-step regime; see Karras et al. 2022 (EDM) for a careful empirical comparison.

This is why the Neural-ODE / diffusion-model connection is real and not analogical: modern fast samplers literally invoke an ODE solver on a learned vector field whose components are μ(x,t)\mu(x, t) (a chosen drift) and 12σ(t)2sθ(x,t)\frac{1}{2}\sigma(t)^2 s_\theta(x, t) (a learned score). The same torchdiffeq adaptive solver used for Neural ODE classification is used inside diffusion samplers.

Generative Neural SDEs as Infinite-Dimensional GANs

The probability flow ODE perspective explains why score-based diffusion training works: the loss has a clean variational interpretation. But neural SDEs admit a second generative-modeling perspective that does not require the score-matching framing.

Kidger et al. 2021 framed a generator GθG_\theta that integrates an SDE from random initial noise through learned (μθ,σθ)(\mu_\theta, \sigma_\theta), paired with a discriminator DϕD_\phi that scores generated paths against real paths. This is a Wasserstein GAN played in the space of continuous functions, and the discriminator-generator min-max trains both networks jointly. Kidger et al. proved that under capacity assumptions this scheme can match arbitrary continuous-time stochastic processes — they characterize the result as neural SDEs being universal approximators for time-homogeneous Ito diffusions in the Wasserstein-1 metric.

This formulation generalizes:

  • Latent SDEs (Li et al. 2020): a variational-autoencoder analog where the latent path follows a learned SDE. Useful for irregularly sampled time series with intrinsic noise.
  • Neural CDEs (Kidger et al. 2020): controlled differential equations driven by the data path itself rather than Brownian motion, which gives a continuous-time analog of RNNs for irregularly sampled inputs.
  • Latent ODE-RNN hybrids (Rubanova et al. 2019): mix discrete RNN updates at observation times with ODE flow between observations.

Connection to Energy-Based Models

The score xlogpt(x)\nabla_x \log p_t(x) that drives the reverse SDE is the negative gradient of an energy-based model: if Et(x)=logpt(x)E_t(x) = -\log p_t(x) (modulo the log partition function), then logpt(x)=Et(x)\nabla \log p_t(x) = -\nabla E_t(x). Sampling from a diffusion model by probability flow ODE integration is gradient flow on a time-dependent energy landscape, descending the energy of the noisy distribution at each tt until t=0t = 0 where the energy is the target data energy.

The neural-SDE / Neural-ODE / EBM trio is one mathematical object viewed three ways:

PerspectiveObjectLoss
EBMEnergy Eθ(x)E_\theta(x)Score matching, contrastive divergence
Score-based / SDEScore sθ(x,t)=Et(x)s_\theta(x, t) = -\nabla E_t(x)Denoising score matching at each noise level
Neural ODEProbability flow vector field μ12σ2sθ\mu - \frac{1}{2}\sigma^2 s_\thetaTrained via the score loss, used at inference

The same network can be trained with EBM losses, diffusion losses, or flow-matching losses, and the resulting sampler can run as an SDE or an ODE. Modern diffusion practice has converged on the score-matching loss (lowest-variance gradients) and the ODE sampler (fastest inference), but the unified object underlying all three formalisms is the same.

Common Confusions

Watch Out

The probability flow ODE is not the reverse-time SDE

Two different equations. The reverse-time SDE has a stochastic term and produces samples with the same joint distribution as time-reversed forward paths. The probability flow ODE is deterministic and matches only the marginal densities. They produce different individual sample paths. For final-sample quality at a given NFE budget, the comparison is empirical and depends on the noise schedule (see Karras et al. 2022, Table 4).

Watch Out

Diagonal noise is not a generic assumption

Most of the practical neural-SDE machinery (the stochastic adjoint of Li et al. 2020, the virtual Brownian tree, the probability flow ODE in its simplest form) assumes diagonal or even scalar diffusion. General multiplicative non-diagonal noise SDEs require more sophisticated stochastic analysis (Stratonovich corrections, Levy area approximations) and have not seen widespread ML adoption.

Watch Out

Brownian motion replay is essential, not optional

The SDE adjoint method is correct only if the backward pass uses the same Brownian sample path as the forward pass. Sampling fresh noise on the backward pass gives a gradient with respect to a different objective, which biases training in subtle ways. Always check that your library uses a virtual Brownian tree or seeded sampler before trusting SDE-adjoint gradients.

Exercises

ExerciseCore

Problem

Consider the Ornstein-Uhlenbeck SDE dXt=θXtdt+σdWtdX_t = -\theta X_t\, dt + \sigma\, dW_t with θ,σ>0\theta, \sigma > 0. The stationary density is p(x)=N(0,σ2/(2θ))p_\infty(x) = \mathcal{N}(0, \sigma^2/(2\theta)).

  1. Write the probability flow ODE corresponding to this SDE.
  2. Sketch why solutions of this ODE preserve the stationary density (every initial X~0p\tilde{X}_0 \sim p_\infty stays distributed as pp_\infty for all tt).
ExerciseAdvanced

Problem

Suppose you train a score model sθ(x,t)xlogpt(x)s_\theta(x, t) \approx \nabla_x \log p_t(x) for a diffusion model with forward SDE dx=12β(t)xdt+β(t)dWdx = -\frac{1}{2}\beta(t) x\, dt + \sqrt{\beta(t)}\, dW on t[0,1]t \in [0, 1].

  1. Write the probability flow ODE that you would integrate from t=1t = 1 to t=0t = 0 to sample.
  2. Why does this ODE require fewer NFE than the reverse-time SDE for comparable sample quality? Identify the variance source that the ODE removes.
  3. What goes wrong if sθs_\theta is inaccurate near t=0t = 0? Why is this region especially hard?

References

Canonical:

  • Li, Wong, Chen, Duvenaud, "Scalable Gradients for Stochastic Differential Equations" (AISTATS 2020; arXiv:2001.01328). The neural-SDE adjoint method via Brownian motion replay; the virtual Brownian tree.
  • Kidger, Foster, Li, Lyons, "Neural SDEs as Infinite-Dimensional GANs" (ICML 2021; arXiv:2102.03657). Generative neural SDEs; Wasserstein GAN training in path space.
  • Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021 oral; arXiv:2011.13456). The probability flow ODE (Section 4.3, Appendix D.1) — the explicit bridge to neural ODEs.
  • Anderson, "Reverse-time diffusion equation models," Stochastic Processes and Their Applications 12(3):313-326 (1982). The original derivation of the time-reversed SDE.

Current:

  • Tzen, Raginsky, "Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit" (arXiv:1905.09883, 2019). Theoretical analysis of latent SDEs as the continuous-time limit of latent Gaussian models.
  • Kidger, Morrill, Foster, Lyons, "Neural Controlled Differential Equations for Irregular Time Series" (NeurIPS 2020 spotlight; arXiv:2005.08926). The CDE generalization; the workhorse for irregularly sampled real-world time series.
  • Rubanova, Chen, Duvenaud, "Latent ODEs for Irregularly-Sampled Time Series" (NeurIPS 2019; arXiv:1907.03907). VAE-style latent dynamics with ODE flow between observations; the immediate precursor to latent SDEs.
  • Karras, Aittala, Aila, Laine, "Elucidating the Design Space of Diffusion-Based Generative Models" (NeurIPS 2022; arXiv:2206.00364). Empirical comparison of SDE vs. ODE samplers and the EDM noise-schedule design.
  • Lu, Zhou, Bao, Chen, Li, Zhu, "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling" (NeurIPS 2022; arXiv:2206.00927). Specialized solver exploiting the structure of the diffusion probability flow ODE.

Reference / Survey:

  • Kidger, "On Neural Differential Equations" (PhD thesis, Oxford, 2022; arXiv:2202.02435). Standard modern reference; Chapters 5-7 cover SDE machinery in depth.
  • Oksendal, Stochastic Differential Equations (6th ed., 2003), Chapter 5. The textbook proof of SDE existence and uniqueness; the reference for any SDE convergence argument.

Next Topics

Last reviewed: April 17, 2026

Prerequisites

Foundations this topic depends on.

Next Topics