Skip to main content

ML Methods

DeepONet

DeepONet (Lu, Karniadakis et al., 2021) approximates nonlinear operators between function spaces by splitting a network into a branch (encoding the input function at fixed sensors) and a trunk (encoding query coordinates), then taking an inner product. The architecture is the practical realization of Chen and Chen's 1995 universal approximation theorem for operators.

AdvancedTier 2Current~45 min
0

Why This Matters

DeepONet (Lu, Jin, Pang, Zhang, and Karniadakis, Nature Machine Intelligence 2021) is the operator-learning architecture grounded in a theorem that predates it by 26 years: Chen and Chen's 1995 universal approximation theorem for nonlinear operators. The theorem says a single-hidden-layer neural network can approximate any continuous operator G:VC(K)G : V \to C(K') between function spaces, provided the input function is sampled at finitely many fixed sensor locations. DeepONet is the modern, deep-learning realization of that construction.

The architecture splits computation in two: a branch network encodes the discretized input function uu at sensors {x1,,xm}\{x_1, \ldots, x_m\} into a coefficient vector, and a trunk network encodes a query coordinate yy into a basis vector. Their inner product produces (Gu)(y)(G u)(y), the value of the target operator's output function at yy. This branch-trunk split is what gives DeepONet its theoretical interpretability: the trunk learns a basis for the output function space, the branch learns coefficients in that basis.

In the data-driven PDE landscape, DeepONet competes directly with the Fourier Neural Operator and adjacent variants (graph neural operators, MIONet, PI-DeepONet). FNO often wins on regular-grid benchmarks where its O(NlogN)O(N \log N) FFT-per-layer cost is decisive; DeepONet wins on geometry-flexible problems and where the branch input is naturally low-dimensional, such as parameterized PDE coefficients or boundary-condition functions sampled sparsely. The Karniadakis group's DeepXDE library codified the architecture and spawned the PI-DeepONet, MIONet, and DeepM&Mnet variants now standard in scientific ML.

The reason to care: DeepONet is the cleanest example of an operator-learning architecture whose approximation guarantees are stated and proved as theorems about operators, not as heuristics about networks. Reading it teaches you what "learning an operator" actually means as a function-space problem.

Mental Model

The branch network produces a vector of pp coefficients b(u)Rpb(u) \in \mathbb{R}^p. The trunk network produces a vector of pp basis evaluations t(y)Rpt(y) \in \mathbb{R}^p. Their inner product b(u)t(y)b(u)^\top t(y) approximates (Gu)(y)(G u)(y). Read the trunk as a learned basis {φ1,,φp}\{\varphi_1, \ldots, \varphi_p\} over the output function's domain; read the branch as a learned encoder that maps the input function to expansion coefficients in that basis.

The basis is learned, not fixed. POD truncates onto the top eigenvectors of an empirical covariance; Fourier methods project onto eikye^{i k \cdot y}; spectral element methods project onto Legendre polynomials. DeepONet jointly learns the basis (via the trunk) and the projection map (via the branch) end-to-end, optimizing both for the operator class observed in training data.

Formal Statement

Definition

DeepONet (Branch-Trunk Operator Network)

Let VC(K)V \subset C(K) be a compact set of input functions on a compact domain KRdK \subset \mathbb{R}^d, and let G:VC(K)G : V \to C(K') be the target operator producing functions on KRdK' \subset \mathbb{R}^{d'}. Fix sensor locations {x1,,xm}K\{x_1, \ldots, x_m\} \subset K. A DeepONet is a parametric operator Gθ:VC(K)G_\theta : V \to C(K') defined by

(Gθu)(y)=k=1pbk(u(x1),,u(xm))tk(y)+b0(G_\theta u)(y) = \sum_{k=1}^{p} b_k(u(x_1), \ldots, u(x_m)) \cdot t_k(y) + b_0

where:

  • the branch network b:RmRpb : \mathbb{R}^m \to \mathbb{R}^p is a neural network (typically MLP or CNN) acting on the sensor-value vector
  • the trunk network t:RdRpt : \mathbb{R}^{d'} \to \mathbb{R}^p is a neural network acting on the query coordinate
  • b0Rb_0 \in \mathbb{R} is a learned scalar bias
  • pp is the basis size (number of branch outputs equals number of trunk outputs)

The output is a scalar (Gθu)(y)R(G_\theta u)(y) \in \mathbb{R}. Vector-valued operators are handled by replicating the architecture across output channels.

The bias b0b_0 matters: without it, (Gθu)(y)(G_\theta u)(y) vanishes whenever b(u)=0b(u) = 0, ruling out affine reconstructions like uu+cu \mapsto u + c where cc is a nonzero constant target offset. Lu et al. (2021) report consistent improvement when b0b_0 is included.

Chen–Chen Universal Approximation

Theorem

Chen–Chen 1995 Universal Approximation for Nonlinear Operators

Statement

Let VV be a compact subset of C(K)C(K) for compact KRdK \subset \mathbb{R}^d, let KRdK' \subset \mathbb{R}^{d'} be compact, and let G:VC(K)G : V \to C(K') be a continuous operator. For any ε>0\varepsilon > 0, there exist a positive integer mm, sensor locations x1,,xmKx_1, \ldots, x_m \in K, positive integer pp, real coefficients cki,ξkij,θki,ζkRc_k^i, \xi_k^{ij}, \theta_k^i, \zeta_k \in \mathbb{R}, and weights wkRdw_k \in \mathbb{R}^{d'} such that

(Gu)(y)k=1p[i=1qckiσ ⁣(j=1mξkiju(xj)+θki)]branch coefficient bk(u)σ(wky+ζk)trunk basis tk(y)<ε\left| (G u)(y) - \sum_{k=1}^{p} \underbrace{\left[ \sum_{i=1}^{q} c_k^i \, \sigma\!\left( \sum_{j=1}^{m} \xi_k^{ij} u(x_j) + \theta_k^i \right) \right]}_{\text{branch coefficient } b_k(u)} \cdot \underbrace{\sigma(w_k \cdot y + \zeta_k)}_{\text{trunk basis } t_k(y)} \right| < \varepsilon

uniformly in uVu \in V and yKy \in K', where σ\sigma is any non-polynomial Tauber–Wiener activation.

Intuition

Three ingredients combine. First, VC(K)V \subset C(K) is compact so finite sensor sampling captures functions to arbitrary accuracy (a Stone-Weierstrass-style density argument). Second, classical universal approximation lets a neural network approximate the now-finite-dimensional map from sensor values to expansion coefficients. Third, a separate neural network approximates the target basis functions evaluated at query points. The bilinear pairing reassembles them.

Proof Sketch

Step 1 (sensor discretization). Since VV is compact in C(K)C(K), the evaluation map u(u(x1),,u(xm))u \mapsto (u(x_1), \ldots, u(x_m)) is uniformly continuous on VV for sufficiently dense sensors, so uu is determined up to ε\varepsilon by its sensor values.

Step 2 (output approximation). The target output function (Gu)C(K)(G u) \in C(K') lies in a compact set (continuous image of a compact set). Apply classical universal approximation in the yy-variable: there exist pp ridge functions σ(wky+ζk)\sigma(w_k \cdot y + \zeta_k) and coefficients depending on uu such that the linear combination approximates (Gu)(y)(G u)(y) uniformly.

Step 3 (branch approximation). The coefficients in step 2 are continuous functionals of uu. By step 1, they are continuous functions of the sensor-value vector (u(x1),,u(xm))Rm(u(x_1), \ldots, u(x_m)) \in \mathbb{R}^m. Apply classical universal approximation a second time: a single-hidden-layer network in the sensor variables approximates each coefficient.

Step 4 (combine). The bilinear pairing of the branch (sensor values to coefficients) and trunk (query to basis) recovers (Gu)(y)(G u)(y) within ε\varepsilon uniformly.

Why It Matters

This theorem is the theoretical backbone of every branch-trunk operator network. It says the architecture class is dense in the space of continuous operators between function spaces, given enough sensors and basis size. Without this result, DeepONet would be a heuristic; with it, the architecture is the natural instantiation of a 1995 approximation theorem.

Failure Mode

The theorem is non-quantitative: it does not say how mm, pp, or network width scale with target accuracy ε\varepsilon, nor with the smoothness of GG or the regularity of the input space VV. Practical bounds came later (Lanthaler-Mishra-Karniadakis 2022; see next section). The continuity assumption on GG is also material: discontinuous operators (e.g., shock-forming hyperbolic PDE solution maps at the shock) fall outside the theorem's scope.

Quantitative Error Bounds

The Chen–Chen theorem guarantees existence; it does not give rates. Lanthaler, Mishra, and Karniadakis (Transactions of Mathematics and Its Applications 6, 2022) provide the first explicit bounds. For a Lipschitz operator GG between Sobolev spaces, the DeepONet approximation error decomposes into three additive pieces:

GGθL2(μ)2Eenc2+Erec2+Eapprox2\|G - G_\theta\|_{L^2(\mu)}^2 \leq \mathcal{E}_{\text{enc}}^2 + \mathcal{E}_{\text{rec}}^2 + \mathcal{E}_{\text{approx}}^2

where:

  • Eenc\mathcal{E}_{\text{enc}} is the encoding error from sensor discretization: it depends on the modulus of continuity of u(u(x1),,u(xm))u \mapsto (u(x_1), \ldots, u(x_m)) on the input function space and decays as mm \to \infty at a rate set by the input-space smoothness.
  • Erec\mathcal{E}_{\text{rec}} is the reconstruction error from the finite-rank trunk basis: this is the analog of singular-value truncation error and decays as pp \to \infty at a rate set by the singular-value decay of the operator.
  • Eapprox\mathcal{E}_{\text{approx}} is the branch approximation error: how well the branch network approximates the (now finite-dimensional) coefficient map. This decays at standard neural-network approximation rates.

The decomposition is informative because each term has a separate cure. Encoding error: add sensors. Reconstruction error: increase basis size pp. Approximation error: widen or deepen the branch network. The bounds are loose in absolute terms but identify the bottleneck for any given operator.

DeepONet vs FNO Trade-offs

The two architectures were designed against the same class of problems and trade differently across three axes.

Computational cost per forward pass. The Fourier Neural Operator costs O(NlogN)O(N \log N) per layer for an NN-point grid via FFT-based global convolution. DeepONet costs O(p)O(p) per query point, but full-field evaluation on an NN-point grid scales as O(Np)O(N \cdot p). For dense full-field outputs at large NN, FNO is cheaper. For sparse query patterns (e.g., evaluating at scattered sensors, on irregular meshes, or at a few quantities of interest), DeepONet's per-point cost is decisive.

Geometric flexibility. The trunk network accepts arbitrary query coordinates yy, so DeepONet handles unstructured meshes, irregular domains, and pointwise queries without modification. FNO requires a regular grid because the FFT does. Workarounds (geo-FNO, factorized FNO) exist but lose some of the architectural simplicity. For complex geometries, parameterized domains, or multi-physics coupling at irregular interfaces, DeepONet has the advantage.

Empirical performance on standard benchmarks. Lu et al., A comprehensive and fair comparison of two neural operators (CMAME 2022), runs both architectures across regular-grid PDE benchmarks. FNO wins on translation-invariant problems with periodic-like structure (Burgers, Navier-Stokes, Darcy). DeepONet wins when the branch input is naturally low-dimensional (parametric PDEs with a few coefficients) or when the geometry is irregular. Neither dominates universally.

Resolution invariance. FNO is genuinely resolution-invariant in input and output: train at one grid resolution, evaluate at another. DeepONet is resolution-flexible only in the output (the trunk handles arbitrary yy). The branch input is locked to the training sensor configuration; you cannot evaluate on test inputs sampled at different sensor locations.

Worked Example: Antiderivative Operator

A canonical sanity check from Lu et al. (2021) §3.1: learn the antiderivative operator

G:uv,v(y)=u(y),v(0)=0,y[0,1]G : u \mapsto v, \quad v'(y) = u(y), \quad v(0) = 0, \quad y \in [0, 1]

so (Gu)(y)=0yu(s)ds(G u)(y) = \int_0^y u(s) \, ds. Train inputs uu are drawn from a Gaussian random field with RBF covariance kernel of length scale 0.2, evaluated at m=100m = 100 uniformly spaced sensors on [0,1][0, 1]. Query points yy are sampled uniformly on [0,1][0, 1].

Architecture: branch is a 3-layer MLP R100R40\mathbb{R}^{100} \to \mathbb{R}^{40} with ReLU activations and 40 hidden units per layer; trunk is a 3-layer MLP R1R40\mathbb{R}^1 \to \mathbb{R}^{40}. Basis size p=40p = 40. Train for 10410^4 epochs with Adam, learning rate 10310^{-3}. Lu et al. report relative L2L^2 test error around 10310^{-3} on held-out input functions, with the dominant residual concentrated near y=0y = 0 (where the boundary condition v(0)=0v(0) = 0 creates an integrable cusp).

Two diagnostics worth running on this example. First, plot the learned trunk basis {t1(y),,t40(y)}\{t_1(y), \ldots, t_{40}(y)\} — for the antiderivative operator, the top trunk modes should resemble polynomial or sigmoidal ramps reflecting the integration kernel. Second, sweep the basis size pp from 5 to 80 and watch the error curve: the elbow gives the effective rank of the operator, which for the smooth antiderivative is small.

Common Confusions

Watch Out

DeepONet is not a neural operator in the strict Kovachki sense

Kovachki, Li, Liu, Azizzadenesheli, Bhattacharya, Stuart, and Anandkumar (JMLR 2023) define a "neural operator" as a parametric map between function spaces with a kernel-integral structure that is intrinsically resolution-invariant in both input and output. FNO satisfies this definition. DeepONet does not: its branch input is a fixed-length vector tied to the training sensor configuration, so input-side resolution invariance fails. DeepONet is better described as a bilinear operator approximator using a learned finite-rank basis. The distinction matters when reading theoretical papers — error bounds for "neural operators" in the Kovachki sense do not automatically apply to DeepONet, and vice versa.

Watch Out

Sensor count and locations are fixed at training time

The branch network expects an mm-dimensional input vector (u(x1),,u(xm))(u(x_1), \ldots, u(x_m)). The number mm and the locations {xj}\{x_j\} are baked into the trained weights. You cannot evaluate a trained DeepONet on a test input sampled at 200 locations if it was trained on 100, nor on a test input sampled at locations that differ from the training sensors. Resolution flexibility is output-only, mediated by the trunk. To support multiple sensor configurations you must either retrain or use architectural extensions (e.g., DeepONet variants with adaptive sensor encoders, or set-based input encoders).

Watch Out

Stacked vs unstacked DeepONet

Lu et al. (2021) introduced two variants. The stacked version trains pp independent branch networks, one per basis coefficient, producing pp scalar outputs concatenated into a length-pp vector. The unstacked version uses a single shared branch network with pp output channels. Unstacked is faster (one forward pass instead of pp) and is the default in DeepXDE; stacked is occasionally more accurate when basis modes are highly heterogeneous. The original Chen–Chen approximation result is stated for the stacked construction; Lanthaler-Mishra-Karniadakis 2022 covers both. When reading benchmarks, check which variant is being reported.

Exercises

ExerciseCore

Problem

Consider the antiderivative operator G:uvG : u \mapsto v with v(y)=0yu(s)dsv(y) = \int_0^y u(s) \, ds on [0,1][0, 1]. Take a single test input u(x)=sin(πx)u(x) = \sin(\pi x) so that (Gu)(y)=(1cos(πy))/π(G u)(y) = (1 - \cos(\pi y))/\pi. Now consider a degenerate DeepONet with basis size p=1p = 1, so (Gθu)(y)=b1(u)t1(y)+b0(G_\theta u)(y) = b_1(u) \cdot t_1(y) + b_0. Sketch the trunk function t1(y)t_1(y) that minimizes squared error against (Gu)(y)(G u)(y) on this single test input, and explain why p=1p = 1 cannot represent the antiderivative operator across a generic input distribution.

ExerciseAdvanced

Problem

Show that the bilinear part of DeepONet, b(u)t(y)b(u)^\top t(y), is a finite-rank approximation of the operator GG. When GG is a compact operator on a Hilbert space, the optimal rank-pp approximation in operator norm is given by the top pp singular triples (the spectral / SVD theorem). State the optimal trunk and branch in terms of the SVD of GG, and argue why the trained DeepONet basis need not coincide with the singular vectors.

References

Canonical:

  • Chen, T., and Chen, H., "Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems" (IEEE Transactions on Neural Networks 6, 1995), Sections 2-4. The original universal approximation theorem.
  • Lu, L., Jin, P., Pang, G., Zhang, Z., and Karniadakis, G. E., "Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators" (Nature Machine Intelligence 3, 2021), arXiv:1910.03193. The DeepONet paper.

Current:

  • Lanthaler, S., Mishra, S., and Karniadakis, G. E., "Error estimates for DeepONets: a deep learning framework in infinite dimensions" (Transactions of Mathematics and Its Applications 6, 2022), Sections 3-5. Quantitative approximation bounds.
  • Lu, L., Meng, X., Cai, S., Mao, Z., Goswami, S., Zhang, Z., and Karniadakis, G. E., "A comprehensive and fair comparison of two neural operators (with practical extensions) based on FAIR data" (Computer Methods in Applied Mechanics and Engineering 393, 2022). DeepONet vs FNO benchmark.
  • Wang, S., Wang, H., and Perdikaris, P., "Learning the solution operator of parametric partial differential equations with physics-informed DeepONets" (Science Advances 7, 2021). PI-DeepONet variant.
  • Jin, P., Meng, S., Lu, L., and Karniadakis, G. E., "MIONet: Learning multiple-input operators via tensor product" (SIAM Journal on Scientific Computing 44, 2022). Multi-input extension.
  • Kovachki, N., Lanthaler, S., and Mishra, S., "On universal approximation and error bounds for Fourier neural operators" (Journal of Machine Learning Research 22, 2021). FNO comparison reference.
  • Karniadakis, G. E., Kevrekidis, I. G., Lu, L., Perdikaris, P., Wang, S., and Yang, L., "Physics-informed machine learning" (Nature Reviews Physics 3, 2021), Sections 4-5. Survey context placing DeepONet in the broader scientific-ML landscape.

Summary

  • DeepONet realizes Chen and Chen's 1995 universal approximation theorem for nonlinear operators via a branch network (sensor encoder) and a trunk network (query basis) joined by an inner product.
  • The architecture is a bilinear, finite-rank operator approximator with a learned basis; Lanthaler-Mishra-Karniadakis 2022 decomposes its error into encoding, reconstruction, and approximation pieces.
  • Against FNO: DeepONet is mesh-flexible and cheap per query but locked to its training sensor configuration; FNO is grid-bound but resolution-invariant and faster on dense full-field outputs.

Next Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics