PDE Fundamentals for Machine Learning

Sneiderman, Robby

Mathematical Infrastructure

PDE Fundamentals for Machine Learning

The partial differential equations that appear in modern machine learning: heat and Fokker-Planck for diffusion, continuity for flow matching, Hamilton-Jacobi-Bellman for reinforcement learning, Poisson for score matching. Classification, solution concepts, and where ML actually needs PDE theory versus where it just uses the vocabulary.

AdvancedTier 2StableSupporting~70 min

Prerequisites

Fast Fourier Transform Eigenvalues and Eigenvectors Stochastic Differential Equations Measure Theoretic Probability

Prereq Map

Learning position

Read this page in the graph.

mathematical-infrastructure | layer 1 | tier 2. This page has 6 direct prerequisites and 5 published dependents.

Open Atlas Prerequisites Leads to

What next

Physics-Informed Neural Networks

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Interactive module

Four classical PDEs solved exactly in your browser: no time-stepping, one FFT per frame, on [0,1]^2 with N = 128.

Spectral PDE explorer

Four classical PDEs solved exactly in your browser · no time-stepping · one FFT per frame · [0, 1]² N = 128

t = 0.0500 · t_c = 1.6887

Parseval err = 0.0e+0·FFT <0.05 ms·dominant mode = (0, 0) · 0.0%contours

▸What is this?

A live solver for four classical partial differential equations, running entirely in your browser. The left pane is the solution $u(x,y,t)$ — think of it as temperature (heat mode), a transported field (advection), a wave function (Schrödinger), or the steady-state of a source (Poisson). The right pane is its Fourier spectrum $|\hat{u}(k,t)|$ on a log scale. Both update together as you drag the time slider, switch PDE modes, or paint on the field.

What makes it unusual: instead of stepping time forward in tiny numerical increments (the standard PDE-solver approach), this evaluates the exact closed-form solution at any time via a single Fourier multiplier and one inverse FFT per frame. That is why you can scrub time freely, including backwards (scrubbing past $t = 0$ in heat mode reproduces the ill-posedness that motivates DDPM and the score models that learn $\nabla \log p_t$ ).

Try the guided tour button in the top-right to watch all four modes and the live neural-net fit without touching anything. Or click ▶ and start painting heat onto the field.

colormaplow

u(x,y,t)

→ high(amber = temperature / density / wave amplitude)

u(x,y,t)

spatial

diffuses

\sigma = \sqrt{2\alpha t}

=0.039

click + drag to paint heat · shift+drag to erase · domain is periodic

|\hat{u}(k,t)|

log · DC centered

rings at

|k| = 8,\,16,\,32,\,48

each mode decays as

e^{-\alpha(2\pi)^2|k|^2 t}

— high-

k

dies first

energy

\|u(t)\|_2^{\,2} / \|u(0)\|_2^{\,2}

log scaleexpected slope

\approx -2\alpha(2\pi)^2 |k_{\text{eff}}|^2

10^{0}

10^{-1}

10^{-2}

10^{-3}

10^{-4}

10^{-5}

time t0.0500

−0.05 (reverse)

diffusivity α0.0150

0.0010.08

PDE archetype

\hat{u}(k,t) \;=\; \hat{u}(k,0)\,e^{-\alpha\,(2\pi)^2|k|^2 t}

initial condition

why ML cares about this equation

Gaussian blur with radius $\sigma$ is this simulation at $t = \sigma^2 / (2\alpha)$ . Scale space in computer vision (Witkin 1983) is the heat equation in disguise.
The DDPM / score-SDE forward process diffuses the data density $p_t$ . Every Fourier mode decays as $e^{-\alpha|k|^2 t}$ — high-frequency detail dies first. Watch the letter A collapse into a blob. This is what Gaussian noise does to real images.
Scrub time backward. Each mode is now multiplied by $e^{+\alpha|k|^2 t}$ , and high- $k$ modes blow up exponentially. This is why naive reverse diffusion is ill-posed and a learned score $\nabla \log p_t$ is required to stay on the data manifold (Anderson 1982; Song 2021).
A Fourier Neural Operator layer is this solver with the radial multiplier learned rather than closed-form (Li et al. 2021). The attenuation visible in the $|\hat{u}(k)|$ pane is the target those layers learn to reproduce.

What the left pane shows

The field $u(x,y,t)$ on the periodic square $[0,1]^2$ , rendered directly from the current solution. Paint heat with click-drag; shift-drag erases. Edges wrap around.

What the right pane shows

The log-magnitude Fourier spectrum $\log|\hat{u}(k,t)|$ , with DC centered so $k=0$ sits at the middle. Rings mark $|k|=8,16,32,48$ .

Try this

Scrub time past 0 into negative values
Switch to mode IC and slide $k_x,k_y$
Paint, then press ▶

per-mode Parseval L2 error vs. inverse-heat condition numberα = 0.015 · T = 0.05

Bars (left axis): radial Parseval-decomposition of the squared L2 error a no-decay predictor makes against $u(x,y,t)$ — modes that diffusion has already smoothed contribute the most. Line (right axis): noise-amplification factor for inverting that mode back to $t = 0$ . The two are the same shape because both grow with $\alpha k^2 t$ , which is exactly why the inverse heat problem is ill-conditioned where the forward problem looks easy.

Can a neural net learn what the FFT just computed?SIREN 2→24→24→1 ·

\omega_0 = 30

A tiny neural network (two hidden layers, sine activations) trains live to match the exact solver's output. Press ▶ train and watch the prediction improve, the error heatmap shrink, and the loss curve decay on a log axis.

The point: exact solvers give machine-precision answers in microseconds; a learned approximator converges slowly to a few-percent RMS and stops there — limited by its ~1,200 parameters, not by compute. This is the honest baseline neural PDE methods fight against on easy problems, and the reason they only win on inverse problems or high-dimensional domains where classical solvers can't go.

This demo fits a supervised target. A true PINN (Raissi 2019) swaps the target for the PDE residual $u_t - \alpha\,\Delta u$ and never sees ground truth; it needs autograd for second derivatives, which is heavier than what runs here. The PINN failure modes appear later on this page.

Spectral 2D solver · 128² grid · radix-2 FFT · closed-form propagator. Space = play/pause · R = reset · 1–4 = mode · C = contours · F = fullscreen · 🔗 share permalinks.

Robby Sneiderman · @Robby955·report an issue

Loading interactive PDE explorer...

Most of modern generative ML and scientific ML is quietly about partial differential equations. A diffusion model is a time-discretized simulation of a Fokker-Planck equation run backward. A flow matching model learns the velocity field of a continuity equation. A value function in reinforcement learning satisfies a Hamilton-Jacobi-Bellman equation. A score network learns $\nabla \log p$ , which is itself a gradient of a Poisson-like object. The PINN literature and the neural operator literature (Fourier Neural Operator, DeepONet) are explicit about their PDE roots; the generative modeling literature often is not.

This page assembles the PDE material that recurs across these systems. It is not a PDE course. It is a reference that names the objects, states what they guarantee and what they do not, and points to the exact place where each PDE shows up inside a working ML model.

Mental Model

A PDE is a local constraint on an unknown function. Given $u : \Omega \to \mathbb{R}$ on a domain $\Omega \subseteq \mathbb{R}^n$ , a PDE of order $k$ is an equation

$F\!\left(x, u(x), \nabla u(x), \nabla^2 u(x), \ldots, \nabla^k u(x)\right) = 0, \qquad x \in \Omega,$

together with boundary and initial conditions that pin down which solution is meant. The content of PDE theory is twofold: which functions $u$ satisfy the equation (existence, regularity), and how the solution depends on the data (stability, uniqueness). For machine learning purposes, the second question matters more, because ML systems either learn the solution map directly (neural operators), or learn the velocity/score field whose existence is guaranteed by a classical PDE theorem.

The operational shift in ML is that we rarely write down and solve a PDE. We write down an ML loss whose minimizer, at population level, is exactly the classical solution or its velocity/score representation. The PDE is then a sanity check on what the loss is asking for and on what failures of the model imply about the learned object.

Classification: The Three Archetypes

Linear second-order PDEs on $\mathbb{R}^n$ with constant coefficients take the form

$\sum_{i,j} a_{ij} \, \partial_i \partial_j u + \sum_i b_i \, \partial_i u + c u = f.$

Let $A = (a_{ij})$ be the symbol matrix. Assume $A$ is symmetric (always achievable after symmetrization).

Theorem

Classification of Second-Order Linear PDEs

Statement

A second-order linear PDE with symbol $A$ is:

Elliptic if and only if $A$ is definite (all eigenvalues nonzero and of the same sign). Archetype: Laplace equation $\Delta u = 0$ .
Parabolic if and only if $A$ is degenerate with one zero eigenvalue and the remaining $n-1$ of the same sign, and the first-order term in the missing direction is nonzero. Archetype: heat equation $\partial_t u = \Delta u$ .
Hyperbolic if and only if $A$ is nondegenerate with eigenvalues of mixed sign. Archetype: wave equation $\partial_{tt} u = \Delta u$ .

Intuition

Elliptic PDEs describe equilibria: no preferred time direction, and the solution at any interior point is an average of its boundary data. Parabolic PDEs describe smoothing and diffusion: information propagates in one direction of time, and discontinuities are smoothed out. Hyperbolic PDEs describe wave-like propagation: information travels along characteristics at finite speed, and singularities persist.

Why It Matters

The archetype dictates what the solution can look like, which numerical methods are stable, and how ML should treat the problem. A PINN that works well for the heat equation can fail dramatically on the wave equation because gradient descent propagates information differently than the PDE does. Flow matching is effectively a first-order hyperbolic problem (a transport equation); diffusion model forward processes are parabolic. Mixing these up in an ML pipeline is a source of silent bugs.

Failure Mode

Most real PDEs in ML are nonlinear (Fokker-Planck with nonlinear drift, Hamilton-Jacobi-Bellman, Burgers') or first-order (continuity, transport), and do not fit cleanly into the linear-second-order classification. The classification is a ladder to climb onto, not a complete taxonomy. Viscosity solutions (Crandall-Lions 1983) were invented precisely because nonlinear first-order PDEs need a weaker notion of solution.

report a correction →

Archetype	Symbol matrix $A$	Canonical PDE	Information flow	Typical behavior	ML counterpart
Elliptic	Definite (all eigenvalues same sign)	$\Delta u = f$	Instantaneous everywhere	Smoothing, boundary-averaged	Spectral clustering, Poisson solves, FNO benchmarks
Parabolic	One zero eigenvalue, rest same sign	$\partial_t u = \Delta u$	Forward in time only	Exponential smoothing, diffusive	Diffusion-model forward process, Gaussian blur
Hyperbolic	Mixed-sign eigenvalues	$\partial_{tt} u = \Delta u$	Along characteristics at finite speed	Wave propagation, shock formation	Flow matching (first-order transport), Burgers' benchmarks

Six PDEs That Matter for Machine Learning

1. Heat equation (parabolic)

$\partial_t u = \frac{1}{2} \Delta u, \qquad u(0, x) = u_0(x).$

The closed-form solution is convolution with a Gaussian kernel of variance $t$ :

$u(t, x) = (G_t * u_0)(x), \qquad G_t(x) = (2 \pi t)^{-n/2} e^{-\|x\|^2 / (2t)}.$

Where it appears in ML. The forward pass of a continuous-time diffusion model with constant diffusion coefficient is the heat equation acting on the data density (Song et al. 2021, arXiv:2011.13456). The learned score $\nabla \log p_t$ is the gradient of the log-density of a heat-equation solution started at the data distribution. Gaussian smoothing in image processing is a single Euler step of the heat equation.

2. Fokker-Planck equation (parabolic, linear in the density)

Theorem

Fokker-Planck Equation for an SDE

Statement

The density $p_t(x)$ of the solution to $dx = f(x, t) \, dt + g(x, t) \, dW_t$ with initial density $p_0$ satisfies the Fokker-Planck (or Kolmogorov forward) equation

$\partial_t p = -\nabla \cdot (f \, p) + \tfrac{1}{2} \nabla \cdot \nabla \cdot (g g^\top p),$

with initial condition $p(0, x) = p_0(x)$ . Here the second term is shorthand for $\sum_{i,j} \partial_i \partial_j (g g^\top)_{ij} \, p$ .

Intuition

Probability mass is conserved. The first term is pure transport of mass along the drift $f$ . The second term is pure diffusion at rate $g g^\top$ . Fokker-Planck is a continuity equation for a probability density with an added diffusive flux.

Proof Sketch

Apply Ito's lemma to a test function $\varphi(x)$ , take expectation, integrate by parts twice to move derivatives off $\varphi$ and onto $p$ . The adjoint of the infinitesimal generator $\mathcal{L} = f \cdot \nabla + \tfrac{1}{2} g g^\top : \nabla^2$ is the Fokker-Planck operator. See Pavliotis, Stochastic Processes and Applications (Springer 2014), Theorem 2.4.

Why It Matters

Every score-based diffusion model is implicitly solving Fokker-Planck forward to obtain the noised marginals and its time reverse to generate samples. The forward process is $dx = f(x, t) \, dt + g(t) \, dW$ ; the learned network approximates $\nabla \log p_t$ , and plugging this into the reverse-time SDE gives generative sampling.

Failure Mode

Fokker-Planck requires enough regularity of $f, g$ for a density to exist. If $g g^\top$ is degenerate (a common case in RL, where control affects only some coordinates), the equation is hypoelliptic rather than elliptic in the spatial variables and needs Hörmander's theorem (1967) to guarantee a smooth density.

report a correction →

Theorem

Anderson Reverse-Time SDE

Statement

The time reversal $\bar x_t = x_{T - t}$ of $dx = f(x, t) \, dt + g(t) \, dW$ solves

$d\bar x = \left[\,-f(\bar x, T - t) + g(T - t)^2 \, \nabla \log p_{T - t}(\bar x)\,\right] dt + g(T - t) \, d\bar W,$

where $\bar W$ is a Brownian motion in reversed time (Anderson 1982, Stochastic Processes and their Applications, 12(3), pp 313-326).

Intuition

Running an SDE backward in time is not trivial: you need a drift correction by $g^2 \nabla \log p$ to cancel the concentration of probability that the forward process produced. The corrected drift pushes mass back toward the data distribution.

Why It Matters

This is the generative sampling equation for score-based diffusion models. The trained network is $s_\theta(x, t) \approx \nabla \log p_t(x)$ ; substituting it into the Anderson SDE gives the reverse process used at inference. If the score is learned accurately, Anderson's theorem guarantees the reverse process produces samples from $p_0$ (Song et al. 2021).

Failure Mode

Score estimation is exact only in expectation. Score errors in low-density regions (far from the data manifold) compound over the reverse integration and produce off-manifold samples. This is the fundamental noise-schedule and sampler-design problem in diffusion models.

report a correction →

3. Continuity equation (first-order hyperbolic)

$\partial_t p + \nabla \cdot (v \, p) = 0.$

This is Fokker-Planck without the diffusion term. Given a velocity field $v(t, x)$ , probability mass is pushed along the flow of $v$ without any smoothing.

Where it appears in ML. Flow matching (Lipman, Chen, Ben-Hamu, Nickel, Le 2023, arXiv:2210.02747), rectified flow, and continuous normalizing flows all train a network $v_\theta$ to satisfy a continuity equation from the data distribution to a tractable base (typically a standard Gaussian). At inference time, you solve the ODE $\dot x = v_\theta(t, x)$ deterministically, which is the method-of-characteristics solution of the continuity equation. Optimal transport via the Benamou-Brenier formulation (2000, Numerische Mathematik) is a constrained minimization over continuity-equation-compatible velocity fields.

4. Hamilton-Jacobi-Bellman equation (fully nonlinear, first-order)

For a continuous-time stochastic control problem with cost $\ell$ and terminal cost $\Phi$ , the value function $V(t, x)$ satisfies

$\partial_t V + \min_u \big\{\, \ell(x, u) + f(x, u) \cdot \nabla V + \tfrac{1}{2} \mathrm{tr}(g g^\top \nabla^2 V) \,\big\} = 0, \quad V(T, x) = \Phi(x).$

Where it appears in ML. Every continuous-time reinforcement learning formulation reduces to an HJB equation under a sign flip. Control theory minimizes cost, so HJB has a $\min_u$ ; RL maximizes reward, so the RL version has a $\max_u$ and the value function changes sign accordingly. The discrete-time Bellman equation $Q(s, a) = r(s, a) + \gamma \max_{a'} Q(s', a')$ is the backward Euler discretization of the reward-form HJB with a sample-based replacement of the expectation. Entropy-regularized RL (soft Q-learning, soft actor-critic) replaces $\max_a Q$ by the soft-max $\alpha \log \sum_a \exp(Q / \alpha)$ , giving a soft-HJB equation whose optimal policy has the Gibbs form $\pi(a \mid s) \propto \exp(Q(s, a) / \alpha)$ .

The HJB equation is fully nonlinear and does not generally admit classical (differentiable) solutions. The right notion is Crandall-Lions viscosity solutions (1983, Transactions of the American Mathematical Society, 277(1), pp 1-42).

5. Poisson and Laplace equations (elliptic)

$\Delta u = f \quad \text{(Poisson)}, \qquad \Delta u = 0 \quad \text{(Laplace)}.$

The fundamental solution of $\Delta \Phi = \delta_0$ on $\mathbb{R}^n$ is $\Phi(x) = -c_n \|x\|^{2-n}$ for $n \geq 3$ (with $c_n = [(n-2) \omega_{n-1}]^{-1}$ and $\omega_{n-1}$ the surface area of the unit sphere) and $\Phi(x) = \tfrac{1}{2\pi} \log \|x\|$ for $n = 2$ . Any sufficiently decaying solution of Poisson can be written as a convolution with $\Phi$ ; the negative sign in $n \geq 3$ and the positive sign in $n = 2$ come from the sign of the Laplacian acting on these radially symmetric profiles and are not a typo.

Where it appears in ML. The discrete Laplace operator is the core object in spectral clustering, graph neural networks, and manifold learning: eigenvectors of the graph Laplacian approximate eigenfunctions of the Laplace-Beltrami operator on the underlying manifold (Belkin and Niyogi 2003, Neural Computation 15(6), pp 1373-1396). Poisson equations also arise directly in physics-based ML applications: solving $-\Delta u = f$ for potential fields is the canonical linear PDE benchmark for FNOs and classical PINN tasks, because the exact solution operator is a translation-invariant convolution, which diagonal-in-Fourier methods match exactly.

6. Burgers' equation (nonlinear, parabolic limit of inviscid hyperbolic)

$\partial_t u + u \, \partial_x u = \nu \, \partial_{xx} u.$

At $\nu = 0$ this is inviscid Burgers, which develops shocks (discontinuities) in finite time even for smooth initial data. For $\nu > 0$ the equation has a Cole-Hopf transformation to the heat equation.

Where it appears in ML. Burgers' is the standard benchmark problem for PINNs and neural operators because it exhibits shocks: a regime where neural PDE solvers either handle or dramatically fail to handle a nonlinear, singular feature. If a method cannot solve Burgers' at $\nu \to 0$ , it will not solve Navier-Stokes, which has the same nonlinear advection plus a coupling and a pressure constraint.

Notions of Solution

Classical solutions are differentiable enough to plug into the PDE pointwise. Real applied PDEs usually do not have classical solutions, and one of the weaker notions below is what the theory actually guarantees.

Definition

Weak Solution

A weak solution of $Lu = f$ on $\Omega$ is a function $u$ in a Sobolev space $H^s(\Omega)$ for which

$\int_\Omega u \, L^* \varphi \, dx = \int_\Omega f \, \varphi \, dx$

for every test function $\varphi$ compactly supported in $\Omega$ , where $L^*$ is the formal adjoint of $L$ . Derivatives are moved onto the smooth test function via integration by parts; $u$ is allowed to be nondifferentiable in the strong sense.

Definition

Viscosity Solution

A viscosity solution of a fully nonlinear first-order or second-order PDE is defined by Crandall-Lions 1983 test-function conditions that generalize the maximum principle. Intuitively, $u$ is a viscosity solution if and only if for every smooth $\varphi$ touching $u$ from above (resp. below) at a point $x_0$ , the PDE holds at $x_0$ with $\varphi$ in place of $u$ with appropriate inequality. Viscosity solutions exist and are unique under mild assumptions for HJB and many nonlinear first-order PDEs where classical and weak theories both fail.

Definition

Distributional Solution

A distributional solution is a weak solution where the test space is Schwartz functions (or compactly supported smooth functions) and $u$ is interpreted as a distribution. This is the setting for PDEs with singular source terms or point masses, including Green's functions themselves.

Solution notion	Regularity of $u$	Test function space	Typical use case	Key reference
Classical	$C^k$ for $k$ -th order PDE	None required	Well-posed linear PDEs on smooth domains	Evans ch 2-4
Weak	$H^s$ Sobolev, $s \geq 1$ usually	$H^1_0$ or $C^\infty_c$	Linear PDEs with rough data or singular geometry	Evans ch 5-6, Brezis ch 8-9
Viscosity	Continuous (bounded)	Smooth test functions touching $u$	Fully nonlinear first- and second-order PDEs, HJB	Crandall-Lions 1983
Distributional	Distribution (possibly non-function)	Schwartz / $C^\infty_c$	PDEs with singular sources, Green's functions, fundamental solutions	Hörmander, The Analysis of Linear Partial Differential Operators

Where PDEs Embed in ML Systems

ML system	PDE	Learned object	What the PDE guarantees
Score-based diffusion	Fokker-Planck forward + Anderson reverse	Score $\nabla \log p_t$	Exact samples from data distribution if score is exact
Flow matching	Continuity equation	Velocity field $v_\theta(t, x)$	Deterministic coupling between base and data
Normalizing flows	Continuity equation	Invertible transformation	Exact log-likelihood via change of variables
Continuous-time RL	Hamilton-Jacobi-Bellman	Value function $V(t, x)$	Optimal policy from greedy choice with respect to $\nabla V$
Score matching	Poisson-like (implicit)	Score $\nabla \log p$	Log-density up to constant
PINNs	User-specified PDE	Solution $u_\theta(x, t)$	Physics-consistent interpolation if loss is zero
Neural operators (FNO, DeepONet)	Parametric family of PDEs	Solution map $f \mapsto u_f$	Fast evaluation of the solution operator

What Neural PDE Solvers Actually Buy You

Classical solvers (finite differences, finite elements, spectral) are the correct tool for well-posed PDEs in low dimension with simple geometry. They have rigorous error bounds, provable stability, and decades of engineering. Neural solvers are worth using when at least one of the following holds:

High dimension. Classical methods scale as $O(N^d)$ grid points. For $d \geq 10$ (common in stochastic control, Boltzmann equations, quantum chemistry) this is infeasible. Neural parameterizations can break the curse of dimensionality when the target has enough structure (Han, Jentzen, E 2018 on deep BSDE methods for HJB, PNAS 115(34)).
Parametric families. A FNO or DeepONet trained across a family of PDE coefficients can amortize the solution cost: inference is one forward pass instead of a full solve. Classical solvers have no analog; each new coefficient requires a new solve.
Inverse problems. Backing out unknown coefficients from observations of $u$ is a natural fit for autodiff. PINNs and neural operators can jointly fit data and residual (Raissi, Perdikaris, Karniadakis 2019, Journal of Computational Physics 378, pp 686-707).
Implicit access. If the PDE is only given through a simulator (Navier-Stokes with a specific turbulence model, molecular dynamics), classical PDE theory does not apply directly. ML can learn the coarse-grained map.

What neural solvers do not currently deliver: convergence guarantees at classical-solver rates, reliable performance on shock-dominated or highly multiscale problems, or competitive accuracy on standard 2D or 3D benchmarks where finite elements have been tuned for thirty years. The honest reading of the 2020-2025 literature is that neural PDE solvers extend reach into regimes classical methods cannot handle, rather than replacing classical solvers in their home regime.

Worked Example: The DDPM Forward Process Is the Heat Equation on the Data Density

A common point of confusion is how a stochastic process on samples relates to a deterministic PDE. The standard DDPM forward process adds Gaussian noise to each sample independently:

$x_t = \sqrt{1 - \beta_t}\, x_{t-1} + \sqrt{\beta_t}\, \varepsilon_t, \qquad \varepsilon_t \sim \mathcal{N}(0, I).$

At the level of an individual trajectory, this is pure noise injection and has no PDE associated with it. But the density $p_t(x)$ of $x_t$ does satisfy a PDE. In the continuous-time limit with variance schedule $g(t)^2 = d\beta_t/dt$ , the density obeys the Fokker-Planck equation for the forward SDE $dx = -\tfrac{1}{2}\beta(t) x\,dt + \sqrt{\beta(t)}\,dW$ :

$\partial_t p_t(x) \;=\; \tfrac{1}{2}\nabla\cdot\!\big(\beta(t)\, x\, p_t(x)\big) \;+\; \tfrac{1}{2}\beta(t)\,\Delta p_t(x).$

In the variance-exploding limit (no drift, pure noise injection), the first term drops and we are left with

$\partial_t p_t \;=\; \tfrac{1}{2}\beta(t)\,\Delta p_t,$

which is exactly the heat equation with time-varying diffusivity. Take the spatial Fourier transform: every mode decays independently as

$\hat{p}_t(k) \;=\; \hat{p}_0(k)\, \exp\!\left(-\tfrac{1}{2}\, |k|^2 \int_0^t \beta(s)\, ds\right).$

This is the equation visible in the interactive explorer above (set $\beta(t) \equiv 2\alpha$ to recover the standard heat propagator). The high-frequency content of the data density is attenuated exponentially in $|k|^2$ . Running the process backward in time requires inverting this multiplier, which amplifies high- $k$ noise by the same exponential factor. No finite amount of data lets you recover that information without a prior; the learned score $\nabla \log p_t$ is exactly the object that pins the trajectory to the data manifold during reverse time (Anderson 1982; Song et al. 2021, ICLR).

The takeaway: DDPM training is not noise prediction for its own sake. It learns a score field $\nabla \log p_t$ associated with the parabolic diffusion semigroup applied to the data distribution. In simple VE/VP forward processes, the forward kernel is known and Gaussian; the hard object is the evolving score geometry of the noised data marginals, not the heat kernel itself. The spectral structure visible in the explorer's Fourier pane shows how that semigroup acts on mode amplitudes -- the same mechanism that diffusion models must implicitly invert during sampling.

Common Confusions

Watch Out

The Fokker-Planck equation is not the SDE

Fokker-Planck is a deterministic PDE for the density $p_t$ . The SDE is a stochastic equation for the trajectory $x_t$ . Both encode the same Markov process, but they are distinct mathematical objects. Score-based diffusion trains on trajectories (SDE view) and generates by integrating a reverse SDE, but the theoretical analysis is almost always stated in the density (Fokker-Planck) view.

Watch Out

A PINN is not a PDE solver in the classical sense

A PINN minimizes a composite loss that includes a PDE residual, evaluated at sampled collocation points. Minimizing the loss to zero on a finite point set does not imply the PDE holds pointwise, and no classical PINN formulation has error bounds that scale like finite element or spectral methods. Treat a PINN as a regularized regression with a physics-inspired penalty, not as a convergent numerical scheme.

Watch Out

Flow matching is deterministic; diffusion is stochastic

Flow matching trains a velocity field for a continuity equation. At inference you solve an ODE with no noise. Diffusion trains a score for a Fokker-Planck reverse SDE. At inference you integrate an SDE with injected noise. The two produce samples from the same distribution (when trained well) but have different variance properties and sampler trade-offs. The continuity equation and the Fokker-Planck equation are related: Fokker-Planck with zero diffusion is the continuity equation.

Watch Out

Viscosity solutions are not solutions that got smoothed

The name is historical: Crandall and Lions introduced the definition via a vanishing-viscosity argument ( $\varepsilon \Delta u$ added, $\varepsilon \to 0$ ). The resulting notion of solution, however, is purely algebraic and does not require any actual diffusion. Viscosity solutions are the correct notion of solution for fully nonlinear first-order and second-order PDEs, including HJB. Nothing in the definition involves smoothing.

Watch Out

Neural operators do not learn PDEs; they learn solution maps

A Fourier Neural Operator (Li et al. 2021, arXiv:2010.08895) learns a mapping $a \mapsto u_a$ from PDE coefficients to PDE solutions, trained on a dataset of (coefficient, solution) pairs obtained by running a classical solver. It does not learn what a PDE is. Without the classical solver, there is no training data. The "neural" part accelerates repeated evaluation of an already-understood solution map; it does not replace the PDE model.

Summary

PDEs are local constraints on functions. The three archetypes (elliptic, parabolic, hyperbolic) dictate what solutions look like and which numerical and ML methods are stable.
Six PDEs recur in ML: heat, Fokker-Planck, continuity, Hamilton-Jacobi-Bellman, Poisson, Burgers'. Diffusion models, flow matching, RL, and score matching are each specific ML incarnations of one of these.
Classical solutions are rarely available for real problems. Weak, viscosity, and distributional solutions are the right formal objects.
Neural solvers extend PDE reach into high dimension, parametric families, inverse problems, and simulator-only settings, and do not compete with classical solvers in their home regime as of 2025.
The correct mental model: an ML system that learns a score, a velocity, or a value is learning a specific field whose mathematical existence and meaning are given by a classical PDE theorem.

Exercises

ExerciseCore

Problem

Starting from the Ito SDE $dx_t = -\nabla V(x_t) \, dt + \sqrt{2 \beta^{-1}} \, dW_t$ (overdamped Langevin dynamics with potential $V$ and inverse temperature $\beta$ ), write out the Fokker-Planck equation for the density $p_t$ and identify the stationary distribution.

ExerciseCore

Problem

The heat equation $\partial_t u = \tfrac{1}{2} \Delta u$ on $\mathbb{R}^n$ with initial data $u_0$ has solution $u(t, x) = (G_t * u_0)(x)$ where $G_t$ is the Gaussian kernel of variance $t I$ . Explain in what sense Gaussian smoothing of an image is a heat-equation simulation, and what "time" corresponds to in standard image-processing notation.

ExerciseAdvanced

Problem

Starting from the forward SDE $dx = -\tfrac{1}{2} \beta(t) x \, dt + \sqrt{\beta(t)} \, dW$ (the variance-preserving diffusion used in DDPM), derive the Anderson reverse-time SDE. Identify the drift correction that makes the reverse SDE generate samples from the initial density.

ExerciseResearch

Problem

Consider training a Fourier Neural Operator on a parametric family of Poisson equations $\Delta u = f$ on $[0, 1]^2$ with periodic boundary conditions, where $f$ is drawn from a Gaussian random field prior. Explain why the FNO's translation-invariant kernel parameterization is a particularly good fit for this problem class, and identify two concrete settings where that fit breaks down.

References

Canonical PDE texts:

Evans, Partial Differential Equations (2nd ed., AMS 2010). Chapter 2 for the three archetypes; Chapter 5 for Sobolev spaces and weak solutions; Chapter 10 for Hamilton-Jacobi and viscosity solutions.
Brezis, Functional Analysis, Sobolev Spaces and Partial Differential Equations (Springer 2011). Chapters 8-9 for Sobolev theory; Chapter 10 for evolution equations.
Strauss, Partial Differential Equations: An Introduction (2nd ed., Wiley 2007). Chapters 1-5 for classification and the three archetypes at an introductory level.

Stochastic and Kolmogorov-forward theory:

Pavliotis, Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations (Springer 2014). Chapter 2 for the Fokker-Planck derivation; Chapter 4 for stationary distributions; Chapter 6 for overdamped Langevin.
Øksendal, Stochastic Differential Equations (6th ed., Springer 2003). Chapters 7-8 for the Kolmogorov forward and backward equations.
Anderson, "Reverse-time diffusion equation models" (Stochastic Processes and their Applications, 12(3), pp 313-326, 1982). Primary source for the reverse-time SDE formula.
Hörmander, "Hypoelliptic second order differential equations" (Acta Mathematica, 119, pp 147-171, 1967). For degenerate diffusion operators.

Viscosity solutions and HJB:

Crandall, Lions, "Viscosity solutions of Hamilton-Jacobi equations" (Transactions of the American Mathematical Society, 277(1), pp 1-42, 1983). The defining paper.
Fleming, Soner, Controlled Markov Processes and Viscosity Solutions (2nd ed., Springer 2006). Chapters 2-3 for HJB theory with applications to control.

Optimal transport and continuity equation:

Villani, Topics in Optimal Transportation (AMS 2003). Chapters 1-2 for Monge-Kantorovich.
Villani, Optimal Transport: Old and New (Springer 2008). Chapter 23 for Wasserstein geometry and gradient flows.
Benamou, Brenier, "A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem" (Numerische Mathematik, 84(3), pp 375-393, 2000). Dynamical formulation of optimal transport as a constrained minimization over continuity-equation-compatible velocity fields.

Machine learning meets PDEs (current):

Raissi, Perdikaris, Karniadakis, "Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations" (Journal of Computational Physics, 378, pp 686-707, 2019). The reference PINN paper.
Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, Stuart, Anandkumar, "Fourier Neural Operator for Parametric Partial Differential Equations" (ICLR 2021, arXiv:2010.08895).
Kovachki, Li, Liu, Azizzadenesheli, Bhattacharya, Stuart, Anandkumar, "Neural Operator: Learning Maps Between Function Spaces" (JMLR 24, 2023, arXiv:2108.08481). Unifying framework for FNO, DeepONet, and related architectures.
Lu, Jin, Karniadakis, "DeepONet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators" (Nature Machine Intelligence 3, pp 218-229, 2021, arXiv:1910.03193).
Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021, arXiv:2011.13456). Unifies score matching and Anderson reverse SDE as the generative framework.
Lipman, Chen, Ben-Hamu, Nickel, Le, "Flow Matching for Generative Modeling" (ICLR 2023, arXiv:2210.02747). Training objective for continuity-equation velocity fields.
Han, Jentzen, E, "Solving high-dimensional partial differential equations using deep learning" (PNAS, 115(34), pp 8505-8510, 2018). Deep BSDE method for HJB in high dimension.
Karniadakis, Kevrekidis, Lu, Perdikaris, Wang, Yang, "Physics-informed machine learning" (Nature Reviews Physics, 3, pp 422-440, 2021). Overview of the PINN and neural operator landscape.

Next Topics

Physics-informed neural networks: the direct application of this material to solving PDEs with neural loss functions.
Diffusion models: score-based generative modeling, where Fokker-Planck and the Anderson reverse SDE are the operational equations.
Flow matching: continuity-equation-based generative modeling with deterministic inference.
Neural ODEs: continuous-depth networks, adjacent to the neural operator and PDE literature.

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Eigenvalues and Eigenvectorslayer 0A · tier 1
Measure-Theoretic Probabilitylayer 0B · tier 1
Divergence, Curl, and Line Integralslayer 0A · tier 2
Functional Analysis Corelayer 0B · tier 2
Fast Fourier Transformlayer 1 · tier 2

Derived topics

5

Diffusion Modelslayer 4 · tier 1
Fokker–Planck Equationlayer 3 · tier 2
Flow Matchinglayer 4 · tier 2
Physics-Informed Neural Networkslayer 4 · tier 2
Neural ODEs and Continuous-Depth Networkslayer 4 · tier 3

Graph-backed continuations

Physics-Informed Neural Networks Diffusion Models Flow Matching Neural ODEs and Continuous-Depth Networks Fokker–Planck Equation