Skip to main content

Mathematical Infrastructure

PDE Fundamentals for Machine Learning

The partial differential equations that appear in modern machine learning: heat and Fokker-Planck for diffusion, continuity for flow matching, Hamilton-Jacobi-Bellman for reinforcement learning, Poisson for score matching. Classification, solution concepts, and where ML actually needs PDE theory versus where it just uses the vocabulary.

AdvancedTier 2Stable~70 min
0

Why This Matters

Spectral PDE explorer
Four classical PDEs solved exactly in your browser · no time-stepping · one FFT per frame · [0, 1]² N = 128
t = 0.0500 · t_c = 1.6887
Parseval err = 0.0e+0·FFT <0.05 ms·dominant mode = (0, 0) · 0.0%
What is this?

A live solver for four classical partial differential equations, running entirely in your browser. The left pane is the solution u(x,y,t)u(x,y,t). The right pane is its Fourier spectrum u^(k,t)|\hat{u}(k,t)| on a log scale. Both update together as you drag the time slider, switch PDE modes, or paint on the field.

What makes it unusual: instead of stepping time forward in tiny numerical increments (the standard PDE-solver approach), this evaluates the exact closed-form solution at any time via a single Fourier multiplier and one inverse FFT per frame. That is why you can scrub time freely, including backwards (scrubbing past t=0t = 0 in heat mode reproduces the ill-posedness that motivates DDPM and the score models that learn logpt\nabla \log p_t).

Try the guided tour button in the top-right to watch all four modes and the live neural-net fit without touching anything. Or click and start painting heat onto the field.

u(x,y,t)u(x,y,t)spatial
diffuses
σ=2αt\sigma = \sqrt{2\alpha t}=0.039
click + drag to paint heat · shift+drag to erase · domain is periodic
u^(k,t)|\hat{u}(k,t)|log · DC centered
rings at k=8,16,32,48|k| = 8,\,16,\,32,\,48
each mode decays as eα(2π)2k2te^{-\alpha(2\pi)^2|k|^2 t}— high-kk dies first
energy  u(t)22/u(0)22\|u(t)\|_2^{\,2} / \|u(0)\|_2^{\,2}log scaleexpected slope 2α(2π)2keff2\approx -2\alpha(2\pi)^2 |k_{\text{eff}}|^2
10010^{0}
10110^{-1}
10210^{-2}
10310^{-3}
10410^{-4}
10510^{-5}
time t0.0500
−0.05 (reverse)
diffusivity α0.0150
0.0010.08
PDE archetype
u^(k,t)  =  u^(k,0)eα(2π)2k2t\hat{u}(k,t) \;=\; \hat{u}(k,0)\,e^{-\alpha\,(2\pi)^2|k|^2 t}
initial condition
why ML cares about this equation
  • Gaussian blur with radius σ\sigma is this simulation at t=σ2/(2α)t = \sigma^2 / (2\alpha). Scale space in computer vision (Witkin 1983) is the heat equation in disguise.
  • The DDPM / score-SDE forward process diffuses the data density ptp_t. Every Fourier mode decays as eαk2te^{-\alpha|k|^2 t} — high-frequency detail dies first. Watch the letter A collapse into a blob. This is what Gaussian noise does to real images.
  • Scrub time backward. Each mode is now multiplied by e+αk2te^{+\alpha|k|^2 t}, and high-kk modes blow up exponentially. This is why naive reverse diffusion is ill-posed and a learned score logpt\nabla \log p_t is required to stay on the data manifold (Anderson 1982; Song 2021).
  • A Fourier Neural Operator layer is this solver with the radial multiplier learned rather than closed-form (Li et al. 2021). The attenuation visible in the u^(k)|\hat{u}(k)| pane is the target those layers learn to reproduce.
What the left pane shows

The field u(x,y,t)u(x,y,t) on the periodic square [0,1]2[0,1]^2, rendered directly from the current solution. Paint heat with click-drag; shift-drag erases. Edges wrap around.

What the right pane shows

The log-magnitude Fourier spectrum logu^(k,t)\log|\hat{u}(k,t)|, with DC centered so k=0k=0 sits at the middle. Rings mark k=8,16,32,48|k|=8,16,32,48.

Try this
  • Scrub time past 0 into negative values
  • Switch to mode IC and slide kx,kyk_x,k_y
  • Paint, then press ▶
Can a neural net learn what the FFT just computed?SIREN 2→24→24→1 · ω0=30\omega_0 = 30

A tiny neural network (two hidden layers, sine activations) trains live to match the exact solver's output. Press ▶ train and watch the prediction improve, the error heatmap shrink, and the loss curve decay on a log axis.

The point: exact solvers give machine-precision answers in microseconds; a learned approximator converges slowly to a few-percent RMS and stops there — limited by its ~1,200 parameters, not by compute. This is the honest baseline neural PDE methods fight against on easy problems, and the reason they only win on inverse problems or high-dimensional domains where classical solvers can't go.

This demo fits a supervised target. A true PINN (Raissi 2019) swaps the target for the PDE residual utαΔuu_t - \alpha\,\Delta u and never sees ground truth; it needs autograd for second derivatives, which is heavier than what runs here. The PINN failure modes appear later on this page.

Spectral 2D solver · 128² grid · radix-2 FFT · closed-form propagator. Space = play/pause · R = reset · 1–4 = mode · C = contours · 🔗 share permalinks.

Robby Sneiderman · @Robby955 · MIT·view source·report an issue

Most of modern generative ML and scientific ML is quietly about partial differential equations. A diffusion model is a time-discretized simulation of a Fokker-Planck equation run backward. A flow matching model learns the velocity field of a continuity equation. A value function in reinforcement learning satisfies a Hamilton-Jacobi-Bellman equation. A score network learns logp\nabla \log p, which is itself a gradient of a Poisson-like object. The PINN literature and the neural operator literature (Fourier Neural Operator, DeepONet) are explicit about their PDE roots; the generative modeling literature often is not.

This page assembles the PDE material that recurs across these systems. It is not a PDE course. It is a reference that names the objects, states what they guarantee and what they do not, and points to the exact place where each PDE shows up inside a working ML model.

Mental Model

A PDE is a local constraint on an unknown function. Given u:ΩRu : \Omega \to \mathbb{R} on a domain ΩRn\Omega \subseteq \mathbb{R}^n, a PDE of order kk is an equation

F ⁣(x,u(x),u(x),2u(x),,ku(x))=0,xΩ,F\!\left(x, u(x), \nabla u(x), \nabla^2 u(x), \ldots, \nabla^k u(x)\right) = 0, \qquad x \in \Omega,

together with boundary and initial conditions that pin down which solution is meant. The content of PDE theory is twofold: which functions uu satisfy the equation (existence, regularity), and how the solution depends on the data (stability, uniqueness). For machine learning purposes, the second question matters more, because ML systems either learn the solution map directly (neural operators), or learn the velocity/score field whose existence is guaranteed by a classical PDE theorem.

The operational shift in ML is that we rarely write down and solve a PDE. We write down an ML loss whose minimizer, at population level, is exactly the classical solution or its velocity/score representation. The PDE is then a sanity check on what the loss is asking for and on what failures of the model imply about the learned object.

Classification: The Three Archetypes

Linear second-order PDEs on Rn\mathbb{R}^n with constant coefficients take the form

i,jaijiju+ibiiu+cu=f.\sum_{i,j} a_{ij} \, \partial_i \partial_j u + \sum_i b_i \, \partial_i u + c u = f.

Let A=(aij)A = (a_{ij}) be the symbol matrix. Assume AA is symmetric (always achievable after symmetrization).

Theorem

Classification of Second-Order Linear PDEs

Statement

A second-order linear PDE with symbol AA is:

  • Elliptic if AA is definite (all eigenvalues nonzero and of the same sign). Archetype: Laplace equation Δu=0\Delta u = 0.
  • Parabolic if AA is degenerate with one zero eigenvalue and the remaining n1n-1 of the same sign, and the first-order term in the missing direction is nonzero. Archetype: heat equation tu=Δu\partial_t u = \Delta u.
  • Hyperbolic if AA is nondegenerate with eigenvalues of mixed sign. Archetype: wave equation ttu=Δu\partial_{tt} u = \Delta u.

Intuition

Elliptic PDEs describe equilibria: no preferred time direction, and the solution at any interior point is an average of its boundary data. Parabolic PDEs describe smoothing and diffusion: information propagates in one direction of time, and discontinuities are smoothed out. Hyperbolic PDEs describe wave-like propagation: information travels along characteristics at finite speed, and singularities persist.

Why It Matters

The archetype dictates what the solution can look like, which numerical methods are stable, and how ML should treat the problem. A PINN that works well for the heat equation can fail dramatically on the wave equation because gradient descent propagates information differently than the PDE does. Flow matching is effectively a first-order hyperbolic problem (a transport equation); diffusion model forward processes are parabolic. Mixing these up in an ML pipeline is a source of silent bugs.

Failure Mode

Most real PDEs in ML are nonlinear (Fokker-Planck with nonlinear drift, Hamilton-Jacobi-Bellman, Burgers') or first-order (continuity, transport), and do not fit cleanly into the linear-second-order classification. The classification is a ladder to climb onto, not a complete taxonomy. Viscosity solutions (Crandall-Lions 1983) were invented precisely because nonlinear first-order PDEs need a weaker notion of solution.

ArchetypeSymbol matrix AACanonical PDEInformation flowTypical behaviorML counterpart
EllipticDefinite (all eigenvalues same sign)Δu=f\Delta u = fInstantaneous everywhereSmoothing, boundary-averagedSpectral clustering, Poisson solves, FNO benchmarks
ParabolicOne zero eigenvalue, rest same signtu=Δu\partial_t u = \Delta uForward in time onlyExponential smoothing, diffusiveDiffusion-model forward process, Gaussian blur
HyperbolicMixed-sign eigenvaluesttu=Δu\partial_{tt} u = \Delta uAlong characteristics at finite speedWave propagation, shock formationFlow matching (first-order transport), Burgers' benchmarks

Six PDEs That Matter for Machine Learning

1. Heat equation (parabolic)

tu=12Δu,u(0,x)=u0(x).\partial_t u = \frac{1}{2} \Delta u, \qquad u(0, x) = u_0(x).

The closed-form solution is convolution with a Gaussian kernel of variance tt:

u(t,x)=(Gtu0)(x),Gt(x)=(2πt)n/2ex2/(2t).u(t, x) = (G_t * u_0)(x), \qquad G_t(x) = (2 \pi t)^{-n/2} e^{-\|x\|^2 / (2t)}.

Where it appears in ML. The forward pass of a continuous-time diffusion model with constant diffusion coefficient is the heat equation acting on the data density (Song et al. 2021, arXiv:2011.13456). The learned score logpt\nabla \log p_t is the gradient of the log-density of a heat-equation solution started at the data distribution. Gaussian smoothing in image processing is a single Euler step of the heat equation.

2. Fokker-Planck equation (parabolic, linear in the density)

Theorem

Fokker-Planck Equation for an SDE

Statement

The density pt(x)p_t(x) of the solution to dx=f(x,t)dt+g(x,t)dWtdx = f(x, t) \, dt + g(x, t) \, dW_t with initial density p0p_0 satisfies the Fokker-Planck (or Kolmogorov forward) equation

tp=(fp)+12(ggp),\partial_t p = -\nabla \cdot (f \, p) + \tfrac{1}{2} \nabla \cdot \nabla \cdot (g g^\top p),

with initial condition p(0,x)=p0(x)p(0, x) = p_0(x). Here the second term is shorthand for i,jij(gg)ijp\sum_{i,j} \partial_i \partial_j (g g^\top)_{ij} \, p.

Intuition

Probability mass is conserved. The first term is pure transport of mass along the drift ff. The second term is pure diffusion at rate ggg g^\top. Fokker-Planck is a continuity equation for a probability density with an added diffusive flux.

Proof Sketch

Apply Ito's lemma to a test function φ(x)\varphi(x), take expectation, integrate by parts twice to move derivatives off φ\varphi and onto pp. The adjoint of the infinitesimal generator L=f+12gg:2\mathcal{L} = f \cdot \nabla + \tfrac{1}{2} g g^\top : \nabla^2 is the Fokker-Planck operator. See Pavliotis, Stochastic Processes and Applications (Springer 2014), Theorem 2.4.

Why It Matters

Every score-based diffusion model is implicitly solving Fokker-Planck forward to obtain the noised marginals and its time reverse to generate samples. The forward process is dx=f(x,t)dt+g(t)dWdx = f(x, t) \, dt + g(t) \, dW; the learned network approximates logpt\nabla \log p_t, and plugging this into the reverse-time SDE gives generative sampling.

Failure Mode

Fokker-Planck requires enough regularity of f,gf, g for a density to exist. If ggg g^\top is degenerate (a common case in RL, where control affects only some coordinates), the equation is hypoelliptic rather than elliptic in the spatial variables and needs Hörmander's theorem (1967) to guarantee a smooth density.

Theorem

Anderson Reverse-Time SDE

Statement

The time reversal xˉt=xTt\bar x_t = x_{T - t} of dx=f(x,t)dt+g(t)dWdx = f(x, t) \, dt + g(t) \, dW solves

dxˉ=[f(xˉ,Tt)+g(Tt)2logpTt(xˉ)]dt+g(Tt)dWˉ,d\bar x = \left[\,-f(\bar x, T - t) + g(T - t)^2 \, \nabla \log p_{T - t}(\bar x)\,\right] dt + g(T - t) \, d\bar W,

where Wˉ\bar W is a Brownian motion in reversed time (Anderson 1982, Stochastic Processes and their Applications, 12(3), pp 313-326).

Intuition

Running an SDE backward in time is not trivial: you need a drift correction by g2logpg^2 \nabla \log p to cancel the concentration of probability that the forward process produced. The corrected drift pushes mass back toward the data distribution.

Why It Matters

This is the generative sampling equation for score-based diffusion models. The trained network is sθ(x,t)logpt(x)s_\theta(x, t) \approx \nabla \log p_t(x); substituting it into the Anderson SDE gives the reverse process used at inference. If the score is learned accurately, Anderson's theorem guarantees the reverse process produces samples from p0p_0 (Song et al. 2021).

Failure Mode

Score estimation is exact only in expectation. Score errors in low-density regions (far from the data manifold) compound over the reverse integration and produce off-manifold samples. This is the fundamental noise-schedule and sampler-design problem in diffusion models.

3. Continuity equation (first-order hyperbolic)

tp+(vp)=0.\partial_t p + \nabla \cdot (v \, p) = 0.

This is Fokker-Planck without the diffusion term. Given a velocity field v(t,x)v(t, x), probability mass is pushed along the flow of vv without any smoothing.

Where it appears in ML. Flow matching (Lipman, Chen, Ben-Hamu, Nickel, Le 2023, arXiv:2210.02747), rectified flow, and continuous normalizing flows all train a network vθv_\theta to satisfy a continuity equation from the data distribution to a tractable base (typically a standard Gaussian). At inference time, you solve the ODE x˙=vθ(t,x)\dot x = v_\theta(t, x) deterministically, which is the method-of-characteristics solution of the continuity equation. Optimal transport via the Benamou-Brenier formulation (2000, Numerische Mathematik) is a constrained minimization over continuity-equation-compatible velocity fields.

4. Hamilton-Jacobi-Bellman equation (fully nonlinear, first-order)

For a continuous-time stochastic control problem with cost \ell and terminal cost Φ\Phi, the value function V(t,x)V(t, x) satisfies

tV+minu{(x,u)+f(x,u)V+12tr(gg2V)}=0,V(T,x)=Φ(x).\partial_t V + \min_u \big\{\, \ell(x, u) + f(x, u) \cdot \nabla V + \tfrac{1}{2} \mathrm{tr}(g g^\top \nabla^2 V) \,\big\} = 0, \quad V(T, x) = \Phi(x).

Where it appears in ML. Every continuous-time reinforcement learning formulation reduces to an HJB equation under a sign flip. Control theory minimizes cost, so HJB has a minu\min_u; RL maximizes reward, so the RL version has a maxu\max_u and the value function changes sign accordingly. The discrete-time Bellman equation Q(s,a)=r(s,a)+γmaxaQ(s,a)Q(s, a) = r(s, a) + \gamma \max_{a'} Q(s', a') is the backward Euler discretization of the reward-form HJB with a sample-based replacement of the expectation. Entropy-regularized RL (soft Q-learning, soft actor-critic) replaces maxaQ\max_a Q by the soft-max αlogaexp(Q/α)\alpha \log \sum_a \exp(Q / \alpha), giving a soft-HJB equation whose optimal policy has the Gibbs form π(as)exp(Q(s,a)/α)\pi(a \mid s) \propto \exp(Q(s, a) / \alpha).

The HJB equation is fully nonlinear and does not generally admit classical (differentiable) solutions. The right notion is Crandall-Lions viscosity solutions (1983, Transactions of the American Mathematical Society, 277(1), pp 1-42).

5. Poisson and Laplace equations (elliptic)

Δu=f(Poisson),Δu=0(Laplace).\Delta u = f \quad \text{(Poisson)}, \qquad \Delta u = 0 \quad \text{(Laplace)}.

The fundamental solution of ΔΦ=δ0\Delta \Phi = \delta_0 on Rn\mathbb{R}^n is Φ(x)=cnx2n\Phi(x) = -c_n \|x\|^{2-n} for n3n \geq 3 (with cn=[(n2)ωn1]1c_n = [(n-2) \omega_{n-1}]^{-1} and ωn1\omega_{n-1} the surface area of the unit sphere) and Φ(x)=12πlogx\Phi(x) = \tfrac{1}{2\pi} \log \|x\| for n=2n = 2. Any sufficiently decaying solution of Poisson can be written as a convolution with Φ\Phi; the negative sign in n3n \geq 3 and the positive sign in n=2n = 2 come from the sign of the Laplacian acting on these radially symmetric profiles and are not a typo.

Where it appears in ML. The discrete Laplace operator is the core object in spectral clustering, graph neural networks, and manifold learning: eigenvectors of the graph Laplacian approximate eigenfunctions of the Laplace-Beltrami operator on the underlying manifold (Belkin and Niyogi 2003, Neural Computation 15(6), pp 1373-1396). Poisson equations also arise directly in physics-based ML applications: solving Δu=f-\Delta u = f for potential fields is the canonical linear PDE benchmark for FNOs and classical PINN tasks, because the exact solution operator is a translation-invariant convolution, which diagonal-in-Fourier methods match exactly.

6. Burgers' equation (nonlinear, parabolic limit of inviscid hyperbolic)

tu+uxu=νxxu.\partial_t u + u \, \partial_x u = \nu \, \partial_{xx} u.

At ν=0\nu = 0 this is inviscid Burgers, which develops shocks (discontinuities) in finite time even for smooth initial data. For ν>0\nu > 0 the equation has a Cole-Hopf transformation to the heat equation.

Where it appears in ML. Burgers' is the standard benchmark problem for PINNs and neural operators because it exhibits shocks: a regime where neural PDE solvers either handle or dramatically fail to handle a nonlinear, singular feature. If a method cannot solve Burgers' at ν0\nu \to 0, it will not solve Navier-Stokes, which has the same nonlinear advection plus a coupling and a pressure constraint.

Notions of Solution

Classical solutions are differentiable enough to plug into the PDE pointwise. Real applied PDEs usually do not have classical solutions, and one of the weaker notions below is what the theory actually guarantees.

Definition

Weak Solution

A weak solution of Lu=fLu = f on Ω\Omega is a function uu in a Sobolev space Hs(Ω)H^s(\Omega) for which

ΩuLφdx=Ωfφdx\int_\Omega u \, L^* \varphi \, dx = \int_\Omega f \, \varphi \, dx

for every test function φ\varphi compactly supported in Ω\Omega, where LL^* is the formal adjoint of LL. Derivatives are moved onto the smooth test function via integration by parts; uu is allowed to be nondifferentiable in the strong sense.

Definition

Viscosity Solution

A viscosity solution of a fully nonlinear first-order or second-order PDE is defined by Crandall-Lions 1983 test-function conditions that generalize the maximum principle. Intuitively, uu is a viscosity solution if for every smooth φ\varphi touching uu from above (resp. below) at a point x0x_0, the PDE holds at x0x_0 with φ\varphi in place of uu with appropriate inequality. Viscosity solutions exist and are unique under mild assumptions for HJB and many nonlinear first-order PDEs where classical and weak theories both fail.

Definition

Distributional Solution

A distributional solution is a weak solution where the test space is Schwartz functions (or compactly supported smooth functions) and uu is interpreted as a distribution. This is the setting for PDEs with singular source terms or point masses, including Green's functions themselves.

Solution notionRegularity of uuTest function spaceTypical use caseKey reference
ClassicalCkC^k for kk-th order PDENone requiredWell-posed linear PDEs on smooth domainsEvans ch 2-4
WeakHsH^s Sobolev, s1s \geq 1 usuallyH01H^1_0 or CcC^\infty_cLinear PDEs with rough data or singular geometryEvans ch 5-6, Brezis ch 8-9
ViscosityContinuous (bounded)Smooth test functions touching uuFully nonlinear first- and second-order PDEs, HJBCrandall-Lions 1983
DistributionalDistribution (possibly non-function)Schwartz / CcC^\infty_cPDEs with singular sources, Green's functions, fundamental solutionsHörmander, The Analysis of Linear Partial Differential Operators

Where PDEs Embed in ML Systems

ML systemPDELearned objectWhat the PDE guarantees
Score-based diffusionFokker-Planck forward + Anderson reverseScore logpt\nabla \log p_tExact samples from data distribution if score is exact
Flow matchingContinuity equationVelocity field vθ(t,x)v_\theta(t, x)Deterministic coupling between base and data
Normalizing flowsContinuity equationInvertible transformationExact log-likelihood via change of variables
Continuous-time RLHamilton-Jacobi-BellmanValue function V(t,x)V(t, x)Optimal policy from greedy choice with respect to V\nabla V
Score matchingPoisson-like (implicit)Score logp\nabla \log pLog-density up to constant
PINNsUser-specified PDESolution uθ(x,t)u_\theta(x, t)Physics-consistent interpolation if loss is zero
Neural operators (FNO, DeepONet)Parametric family of PDEsSolution map fuff \mapsto u_fFast evaluation of the solution operator

What Neural PDE Solvers Actually Buy You

Classical solvers (finite differences, finite elements, spectral) are the correct tool for well-posed PDEs in low dimension with simple geometry. They have rigorous error bounds, provable stability, and decades of engineering. Neural solvers are worth using when at least one of the following holds:

  • High dimension. Classical methods scale as O(Nd)O(N^d) grid points. For d10d \geq 10 (common in stochastic control, Boltzmann equations, quantum chemistry) this is infeasible. Neural parameterizations can break the curse of dimensionality when the target has enough structure (Han, Jentzen, E 2018 on deep BSDE methods for HJB, PNAS 115(34)).

  • Parametric families. A FNO or DeepONet trained across a family of PDE coefficients can amortize the solution cost: inference is one forward pass instead of a full solve. Classical solvers have no analog; each new coefficient requires a new solve.

  • Inverse problems. Backing out unknown coefficients from observations of uu is a natural fit for autodiff. PINNs and neural operators can jointly fit data and residual (Raissi, Perdikaris, Karniadakis 2019, Journal of Computational Physics 378, pp 686-707).

  • Implicit access. If the PDE is only given through a simulator (Navier-Stokes with a specific turbulence model, molecular dynamics), classical PDE theory does not apply directly. ML can learn the coarse-grained map.

What neural solvers do not currently deliver: convergence guarantees at classical-solver rates, reliable performance on shock-dominated or highly multiscale problems, or competitive accuracy on standard 2D or 3D benchmarks where finite elements have been tuned for thirty years. The honest reading of the 2020-2025 literature is that neural PDE solvers extend reach into regimes classical methods cannot handle, rather than replacing classical solvers in their home regime.

Worked Example: The DDPM Forward Process Is the Heat Equation on the Data Density

A common point of confusion is how a stochastic process on samples relates to a deterministic PDE. The standard DDPM forward process adds Gaussian noise to each sample independently:

xt=1βtxt1+βtεt,εtN(0,I).x_t = \sqrt{1 - \beta_t}\, x_{t-1} + \sqrt{\beta_t}\, \varepsilon_t, \qquad \varepsilon_t \sim \mathcal{N}(0, I).

At the level of an individual trajectory, this is pure noise injection and has no PDE associated with it. But the density pt(x)p_t(x) of xtx_t does satisfy a PDE. In the continuous-time limit with variance schedule g(t)2=dβt/dtg(t)^2 = d\beta_t/dt, the density obeys the Fokker-Planck equation for the forward SDE dx=12β(t)xdt+β(t)dWdx = -\tfrac{1}{2}\beta(t) x\,dt + \sqrt{\beta(t)}\,dW:

tpt(x)  =  12 ⁣(β(t)xpt(x))  +  12β(t)Δpt(x).\partial_t p_t(x) \;=\; \tfrac{1}{2}\nabla\cdot\!\big(\beta(t)\, x\, p_t(x)\big) \;+\; \tfrac{1}{2}\beta(t)\,\Delta p_t(x).

In the variance-exploding limit (no drift, pure noise injection), the first term drops and we are left with

tpt  =  12β(t)Δpt,\partial_t p_t \;=\; \tfrac{1}{2}\beta(t)\,\Delta p_t,

which is exactly the heat equation with time-varying diffusivity. Take the spatial Fourier transform: every mode decays independently as

p^t(k)  =  p^0(k)exp ⁣(12k20tβ(s)ds).\hat{p}_t(k) \;=\; \hat{p}_0(k)\, \exp\!\left(-\tfrac{1}{2}\, |k|^2 \int_0^t \beta(s)\, ds\right).

This is the equation visible in the interactive explorer above (set β(t)2α\beta(t) \equiv 2\alpha to recover the standard heat propagator). The high-frequency content of the data density is attenuated exponentially in k2|k|^2. Running the process backward in time requires inverting this multiplier, which amplifies high-kk noise by the same exponential factor. No finite amount of data lets you recover that information without a prior; the learned score logpt\nabla \log p_t is exactly the object that pins the trajectory to the data manifold during reverse time (Anderson 1982; Song et al. 2021, ICLR).

The takeaway: DDPM training is not "noise prediction for its own sake." It is learning the Green's function of a parabolic PDE that you could, in principle, write in closed form on a trivial domain like [0,1]2[0,1]^2 but never on the manifold of natural images. The spectral structure visible in the explorer's Fourier pane is literally what every diffusion model internalizes during training.

Common Confusions

Watch Out

The Fokker-Planck equation is not the SDE

Fokker-Planck is a deterministic PDE for the density ptp_t. The SDE is a stochastic equation for the trajectory xtx_t. Both encode the same Markov process, but they are distinct mathematical objects. Score-based diffusion trains on trajectories (SDE view) and generates by integrating a reverse SDE, but the theoretical analysis is almost always stated in the density (Fokker-Planck) view.

Watch Out

A PINN is not a PDE solver in the classical sense

A PINN minimizes a composite loss that includes a PDE residual, evaluated at sampled collocation points. Minimizing the loss to zero on a finite point set does not imply the PDE holds pointwise, and no classical PINN formulation has error bounds that scale like finite element or spectral methods. Treat a PINN as a regularized regression with a physics-inspired penalty, not as a convergent numerical scheme.

Watch Out

Flow matching is deterministic; diffusion is stochastic

Flow matching trains a velocity field for a continuity equation. At inference you solve an ODE with no noise. Diffusion trains a score for a Fokker-Planck reverse SDE. At inference you integrate an SDE with injected noise. The two produce samples from the same distribution (when trained well) but have different variance properties and sampler trade-offs. The continuity equation and the Fokker-Planck equation are related: Fokker-Planck with zero diffusion is the continuity equation.

Watch Out

Viscosity solutions are not solutions that got smoothed

The name is historical: Crandall and Lions introduced the definition via a vanishing-viscosity argument (εΔu\varepsilon \Delta u added, ε0\varepsilon \to 0). The resulting notion of solution, however, is purely algebraic and does not require any actual diffusion. Viscosity solutions are the correct notion of solution for fully nonlinear first-order and second-order PDEs, including HJB. Nothing in the definition involves smoothing.

Watch Out

Neural operators do not learn PDEs; they learn solution maps

A Fourier Neural Operator (Li et al. 2021, arXiv:2010.08895) learns a mapping auaa \mapsto u_a from PDE coefficients to PDE solutions, trained on a dataset of (coefficient, solution) pairs obtained by running a classical solver. It does not learn what a PDE is. Without the classical solver, there is no training data. The "neural" part accelerates repeated evaluation of an already-understood solution map; it does not replace the PDE model.

Summary

  • PDEs are local constraints on functions. The three archetypes (elliptic, parabolic, hyperbolic) dictate what solutions look like and which numerical and ML methods are stable.
  • Six PDEs recur in ML: heat, Fokker-Planck, continuity, Hamilton-Jacobi-Bellman, Poisson, Burgers'. Diffusion models, flow matching, RL, and score matching are each specific ML incarnations of one of these.
  • Classical solutions are rarely available for real problems. Weak, viscosity, and distributional solutions are the right formal objects.
  • Neural solvers extend PDE reach into high dimension, parametric families, inverse problems, and simulator-only settings, and do not compete with classical solvers in their home regime as of 2025.
  • The correct mental model: an ML system that learns a score, a velocity, or a value is learning a specific field whose mathematical existence and meaning are given by a classical PDE theorem.

Exercises

ExerciseCore

Problem

Starting from the Ito SDE dxt=V(xt)dt+2β1dWtdx_t = -\nabla V(x_t) \, dt + \sqrt{2 \beta^{-1}} \, dW_t (overdamped Langevin dynamics with potential VV and inverse temperature β\beta), write out the Fokker-Planck equation for the density ptp_t and identify the stationary distribution.

ExerciseCore

Problem

The heat equation tu=12Δu\partial_t u = \tfrac{1}{2} \Delta u on Rn\mathbb{R}^n with initial data u0u_0 has solution u(t,x)=(Gtu0)(x)u(t, x) = (G_t * u_0)(x) where GtG_t is the Gaussian kernel of variance tIt I. Explain in what sense Gaussian smoothing of an image is a heat-equation simulation, and what "time" corresponds to in standard image-processing notation.

ExerciseAdvanced

Problem

Starting from the forward SDE dx=12β(t)xdt+β(t)dWdx = -\tfrac{1}{2} \beta(t) x \, dt + \sqrt{\beta(t)} \, dW (the variance-preserving diffusion used in DDPM), derive the Anderson reverse-time SDE. Identify the drift correction that makes the reverse SDE generate samples from the initial density.

ExerciseResearch

Problem

Consider training a Fourier Neural Operator on a parametric family of Poisson equations Δu=f\Delta u = f on [0,1]2[0, 1]^2 with periodic boundary conditions, where ff is drawn from a Gaussian random field prior. Explain why the FNO's translation-invariant kernel parameterization is a particularly good fit for this problem class, and identify two concrete settings where that fit breaks down.

References

Canonical PDE texts:

  • Evans, Partial Differential Equations (2nd ed., AMS 2010). Chapter 2 for the three archetypes; Chapter 5 for Sobolev spaces and weak solutions; Chapter 10 for Hamilton-Jacobi and viscosity solutions.
  • Brezis, Functional Analysis, Sobolev Spaces and Partial Differential Equations (Springer 2011). Chapters 8-9 for Sobolev theory; Chapter 10 for evolution equations.
  • Strauss, Partial Differential Equations: An Introduction (2nd ed., Wiley 2007). Chapters 1-5 for classification and the three archetypes at an introductory level.

Stochastic and Kolmogorov-forward theory:

  • Pavliotis, Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations (Springer 2014). Chapter 2 for the Fokker-Planck derivation; Chapter 4 for stationary distributions; Chapter 6 for overdamped Langevin.
  • Øksendal, Stochastic Differential Equations (6th ed., Springer 2003). Chapters 7-8 for the Kolmogorov forward and backward equations.
  • Anderson, "Reverse-time diffusion equation models" (Stochastic Processes and their Applications, 12(3), pp 313-326, 1982). Primary source for the reverse-time SDE formula.
  • Hörmander, "Hypoelliptic second order differential equations" (Acta Mathematica, 119, pp 147-171, 1967). For degenerate diffusion operators.

Viscosity solutions and HJB:

  • Crandall, Lions, "Viscosity solutions of Hamilton-Jacobi equations" (Transactions of the American Mathematical Society, 277(1), pp 1-42, 1983). The defining paper.
  • Fleming, Soner, Controlled Markov Processes and Viscosity Solutions (2nd ed., Springer 2006). Chapters 2-3 for HJB theory with applications to control.

Optimal transport and continuity equation:

  • Villani, Topics in Optimal Transportation (AMS 2003). Chapters 1-2 for Monge-Kantorovich.
  • Villani, Optimal Transport: Old and New (Springer 2008). Chapter 23 for Wasserstein geometry and gradient flows.
  • Benamou, Brenier, "A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem" (Numerische Mathematik, 84(3), pp 375-393, 2000). Dynamical formulation of optimal transport as a constrained minimization over continuity-equation-compatible velocity fields.

Machine learning meets PDEs (current):

  • Raissi, Perdikaris, Karniadakis, "Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations" (Journal of Computational Physics, 378, pp 686-707, 2019). The reference PINN paper.
  • Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, Stuart, Anandkumar, "Fourier Neural Operator for Parametric Partial Differential Equations" (ICLR 2021, arXiv:2010.08895).
  • Kovachki, Li, Liu, Azizzadenesheli, Bhattacharya, Stuart, Anandkumar, "Neural Operator: Learning Maps Between Function Spaces" (JMLR 24, 2023, arXiv:2108.08481). Unifying framework for FNO, DeepONet, and related architectures.
  • Lu, Jin, Karniadakis, "DeepONet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators" (Nature Machine Intelligence 3, pp 218-229, 2021, arXiv:1910.03193).
  • Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021, arXiv:2011.13456). Unifies score matching and Anderson reverse SDE as the generative framework.
  • Lipman, Chen, Ben-Hamu, Nickel, Le, "Flow Matching for Generative Modeling" (ICLR 2023, arXiv:2210.02747). Training objective for continuity-equation velocity fields.
  • Han, Jentzen, E, "Solving high-dimensional partial differential equations using deep learning" (PNAS, 115(34), pp 8505-8510, 2018). Deep BSDE method for HJB in high dimension.
  • Karniadakis, Kevrekidis, Lu, Perdikaris, Wang, Yang, "Physics-informed machine learning" (Nature Reviews Physics, 3, pp 422-440, 2021). Overview of the PINN and neural operator landscape.

Next Topics

  • Physics-informed neural networks: the direct application of this material to solving PDEs with neural loss functions.
  • Diffusion models: score-based generative modeling, where Fokker-Planck and the Anderson reverse SDE are the operational equations.
  • Flow matching: continuity-equation-based generative modeling with deterministic inference.
  • Neural ODEs: continuous-depth networks, adjacent to the neural operator and PDE literature.

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics