DeepONet

Sneiderman, Robby

ML Methods

DeepONet

DeepONet (Lu, Karniadakis et al., 2021) approximates nonlinear operators between function spaces by splitting a network into a branch (encoding the input function at fixed sensors) and a trunk (encoding query coordinates), then taking an inner product. The architecture is the practical realization of Chen and Chen's 1995 universal approximation theorem for operators.

AdvancedTier 2CurrentSupporting~45 min

Prerequisites

Spectral Theory of Operators Navier Stokes for ML Fourier Neural Operator

Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 3 | tier 2. This page has 2 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Fourier Neural Operator

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

DeepONet (Lu, Jin, Pang, Zhang, and Karniadakis, Nature Machine Intelligence 2021) is the operator-learning architecture grounded in a theorem that predates it by 26 years: Chen and Chen's 1995 universal approximation theorem for nonlinear operators. The theorem says a single-hidden-layer neural network can approximate any continuous operator $G : V \to C(K')$ between function spaces, provided the input function is sampled at finitely many fixed sensor locations. DeepONet is the modern, deep-learning realization of that construction.

The architecture splits computation in two: a branch network encodes the discretized input function $u$ at sensors $\{x_1, \ldots, x_m\}$ into a coefficient vector, and a trunk network encodes a query coordinate $y$ into a basis vector. Their inner product produces $(G u)(y)$ , the value of the target operator's output function at $y$ . This branch-trunk split is what gives DeepONet its theoretical interpretability: the trunk learns a basis for the output function space, the branch learns coefficients in that basis.

In the data-driven PDE landscape, DeepONet competes directly with the Fourier Neural Operator and adjacent variants (graph neural operators, MIONet, PI-DeepONet). FNO often wins on regular-grid benchmarks where its $O(N \log N)$ FFT-per-layer cost is decisive; DeepONet wins on geometry-flexible problems and where the branch input is naturally low-dimensional, such as parameterized PDE coefficients or boundary-condition functions sampled sparsely. The Karniadakis group's DeepXDE library codified the architecture and spawned the PI-DeepONet, MIONet, and DeepM&Mnet variants now standard in scientific ML.

The reason to care: DeepONet is the cleanest example of an operator-learning architecture whose approximation guarantees are stated and proved as theorems about operators, not as heuristics about networks. Reading it teaches you what "learning an operator" actually means as a function-space problem.

Mental Model

The branch network produces a vector of $p$ coefficients $b(u) \in \mathbb{R}^p$ . The trunk network produces a vector of $p$ basis evaluations $t(y) \in \mathbb{R}^p$ . Their inner product $b(u)^\top t(y)$ approximates $(G u)(y)$ . Read the trunk as a learned basis $\{\varphi_1, \ldots, \varphi_p\}$ over the output function's domain; read the branch as a learned encoder that maps the input function to expansion coefficients in that basis.

The basis is learned, not fixed. POD truncates onto the top eigenvectors of an empirical covariance; Fourier methods project onto $e^{i k \cdot y}$ ; spectral element methods project onto Legendre polynomials. DeepONet jointly learns the basis (via the trunk) and the projection map (via the branch) end-to-end, optimizing both for the operator class observed in training data.

Formal Statement

Definition

DeepONet (Branch-Trunk Operator Network)

Let $V \subset C(K)$ be a compact set of input functions on a compact domain $K \subset \mathbb{R}^d$ , and let $G : V \to C(K')$ be the target operator producing functions on $K' \subset \mathbb{R}^{d'}$ . Fix sensor locations $\{x_1, \ldots, x_m\} \subset K$ . A DeepONet is a parametric operator $G_\theta : V \to C(K')$ defined by

$(G_\theta u)(y) = \sum_{k=1}^{p} b_k(u(x_1), \ldots, u(x_m)) \cdot t_k(y) + b_0$

where:

the branch network $b : \mathbb{R}^m \to \mathbb{R}^p$ is a neural network (typically MLP or CNN) acting on the sensor-value vector
the trunk network $t : \mathbb{R}^{d'} \to \mathbb{R}^p$ is a neural network acting on the query coordinate
$b_0 \in \mathbb{R}$ is a learned scalar bias
$p$ is the basis size (number of branch outputs equals number of trunk outputs)

The output is a scalar $(G_\theta u)(y) \in \mathbb{R}$ . Vector-valued operators are handled by replicating the architecture across output channels.

The bias $b_0$ matters: without it, $(G_\theta u)(y)$ vanishes whenever $b(u) = 0$ , ruling out affine reconstructions like $u \mapsto u + c$ where $c$ is a nonzero constant target offset. Lu et al. (2021) report consistent improvement when $b_0$ is included.

Chen–Chen Universal Approximation

Theorem

Chen–Chen 1995 Universal Approximation for Nonlinear Operators

Statement

Let $V$ be a compact subset of $C(K)$ for compact $K \subset \mathbb{R}^d$ , let $K' \subset \mathbb{R}^{d'}$ be compact, and let $G : V \to C(K')$ be a continuous operator. For any $\varepsilon > 0$ , there exist a positive integer $m$ , sensor locations $x_1, \ldots, x_m \in K$ , positive integer $p$ , real coefficients $c_k^i, \xi_k^{ij}, \theta_k^i, \zeta_k \in \mathbb{R}$ , and weights $w_k \in \mathbb{R}^{d'}$ such that

$\left| (G u)(y) - \sum_{k=1}^{p} \underbrace{\left[ \sum_{i=1}^{q} c_k^i \, \sigma\!\left( \sum_{j=1}^{m} \xi_k^{ij} u(x_j) + \theta_k^i \right) \right]}_{\text{branch coefficient } b_k(u)} \cdot \underbrace{\sigma(w_k \cdot y + \zeta_k)}_{\text{trunk basis } t_k(y)} \right| < \varepsilon$

uniformly in $u \in V$ and $y \in K'$ , where $\sigma$ is any non-polynomial Tauber–Wiener activation.

Intuition

Three ingredients combine. First, $V \subset C(K)$ is compact so finite sensor sampling captures functions to arbitrary accuracy (a Stone-Weierstrass-style density argument). Second, classical universal approximation lets a neural network approximate the now-finite-dimensional map from sensor values to expansion coefficients. Third, a separate neural network approximates the target basis functions evaluated at query points. The bilinear pairing reassembles them.

Proof Sketch

Step 1 (sensor discretization). Since $V$ is compact in $C(K)$ , the evaluation map $u \mapsto (u(x_1), \ldots, u(x_m))$ is uniformly continuous on $V$ for sufficiently dense sensors, so $u$ is determined up to $\varepsilon$ by its sensor values.

Step 2 (output approximation). The target output function $(G u) \in C(K')$ lies in a compact set (continuous image of a compact set). Apply classical universal approximation in the $y$ -variable: there exist $p$ ridge functions $\sigma(w_k \cdot y + \zeta_k)$ and coefficients depending on $u$ such that the linear combination approximates $(G u)(y)$ uniformly.

Step 3 (branch approximation). The coefficients in step 2 are continuous functionals of $u$ . By step 1, they are continuous functions of the sensor-value vector $(u(x_1), \ldots, u(x_m)) \in \mathbb{R}^m$ . Apply classical universal approximation a second time: a single-hidden-layer network in the sensor variables approximates each coefficient.

Step 4 (combine). The bilinear pairing of the branch (sensor values to coefficients) and trunk (query to basis) recovers $(G u)(y)$ within $\varepsilon$ uniformly.

Why It Matters

This theorem is the theoretical backbone of every branch-trunk operator network. It says the architecture class is dense in the space of continuous operators between function spaces, given enough sensors and basis size. Without this result, DeepONet would be a heuristic; with it, the architecture is the natural instantiation of a 1995 approximation theorem.

Failure Mode

The theorem is non-quantitative: it does not say how $m$ , $p$ , or network width scale with target accuracy $\varepsilon$ , nor with the smoothness of $G$ or the regularity of the input space $V$ . Practical bounds came later (Lanthaler-Mishra-Karniadakis 2022; see next section). The continuity assumption on $G$ is also material: discontinuous operators (e.g., shock-forming hyperbolic PDE solution maps at the shock) fall outside the theorem's scope.

report a correction →

Quantitative Error Bounds

The Chen–Chen theorem guarantees existence; it does not give rates. Lanthaler, Mishra, and Karniadakis (Transactions of Mathematics and Its Applications 6, 2022) provide the first explicit bounds. For a Lipschitz operator $G$ between Sobolev spaces, the DeepONet approximation error decomposes into three additive pieces:

$\|G - G_\theta\|_{L^2(\mu)}^2 \leq \mathcal{E}_{\text{enc}}^2 + \mathcal{E}_{\text{rec}}^2 + \mathcal{E}_{\text{approx}}^2$

where:

$\mathcal{E}_{\text{enc}}$ is the encoding error from sensor discretization: it depends on the modulus of continuity of $u \mapsto (u(x_1), \ldots, u(x_m))$ on the input function space and decays as $m \to \infty$ at a rate set by the input-space smoothness.
$\mathcal{E}_{\text{rec}}$ is the reconstruction error from the finite-rank trunk basis: this is the analog of singular-value truncation error and decays as $p \to \infty$ at a rate set by the singular-value decay of the operator.
$\mathcal{E}_{\text{approx}}$ is the branch approximation error: how well the branch network approximates the (now finite-dimensional) coefficient map. This decays at standard neural-network approximation rates.

The decomposition is informative because each term has a separate cure. Encoding error: add sensors. Reconstruction error: increase basis size $p$ . Approximation error: widen or deepen the branch network. The bounds are loose in absolute terms but identify the bottleneck for any given operator.

DeepONet vs FNO Trade-offs

The two architectures were designed against the same class of problems and trade differently across three axes.

Computational cost per forward pass. The Fourier Neural Operator costs $O(N \log N)$ per layer for an $N$ -point grid via FFT-based global convolution. DeepONet costs $O(p)$ per query point, but full-field evaluation on an $N$ -point grid scales as $O(N \cdot p)$ . For dense full-field outputs at large $N$ , FNO is cheaper. For sparse query patterns (e.g., evaluating at scattered sensors, on irregular meshes, or at a few quantities of interest), DeepONet's per-point cost is decisive.

Geometric flexibility. The trunk network accepts arbitrary query coordinates $y$ , so DeepONet handles unstructured meshes, irregular domains, and pointwise queries without modification. FNO requires a regular grid because the FFT does. Workarounds (geo-FNO, factorized FNO) exist but lose some of the architectural simplicity. For complex geometries, parameterized domains, or multi-physics coupling at irregular interfaces, DeepONet has the advantage.

Empirical performance on standard benchmarks. Lu et al., A comprehensive and fair comparison of two neural operators (CMAME 2022), runs both architectures across regular-grid PDE benchmarks. FNO wins on translation-invariant problems with periodic-like structure (Burgers, Navier-Stokes, Darcy). DeepONet wins when the branch input is naturally low-dimensional (parametric PDEs with a few coefficients) or when the geometry is irregular. Neither dominates universally.

Resolution invariance. FNO is genuinely resolution-invariant in input and output: train at one grid resolution, evaluate at another. DeepONet is resolution-flexible only in the output (the trunk handles arbitrary $y$ ). The branch input is locked to the training sensor configuration; you cannot evaluate on test inputs sampled at different sensor locations.

Worked Example: Antiderivative Operator

A canonical sanity check from Lu et al. (2021) §3.1: learn the antiderivative operator

$G : u \mapsto v, \quad v'(y) = u(y), \quad v(0) = 0, \quad y \in [0, 1]$

so $(G u)(y) = \int_0^y u(s) \, ds$ . Train inputs $u$ are drawn from a Gaussian random field with RBF covariance kernel of length scale 0.2, evaluated at $m = 100$ uniformly spaced sensors on $[0, 1]$ . Query points $y$ are sampled uniformly on $[0, 1]$ .

Architecture: branch is a 3-layer MLP $\mathbb{R}^{100} \to \mathbb{R}^{40}$ with ReLU activations and 40 hidden units per layer; trunk is a 3-layer MLP $\mathbb{R}^1 \to \mathbb{R}^{40}$ . Basis size $p = 40$ . Train for $10^4$ epochs with Adam, learning rate $10^{-3}$ . Lu et al. report relative $L^2$ test error around $10^{-3}$ on held-out input functions, with the dominant residual concentrated near $y = 0$ (where the boundary condition $v(0) = 0$ creates an integrable cusp).

Two diagnostics worth running on this example. First, plot the learned trunk basis $\{t_1(y), \ldots, t_{40}(y)\}$ — for the antiderivative operator, the top trunk modes should resemble polynomial or sigmoidal ramps reflecting the integration kernel. Second, sweep the basis size $p$ from 5 to 80 and watch the error curve: the elbow gives the effective rank of the operator, which for the smooth antiderivative is small.

Common Confusions

Watch Out

DeepONet is not a neural operator in the strict Kovachki sense

Kovachki, Li, Liu, Azizzadenesheli, Bhattacharya, Stuart, and Anandkumar (JMLR 2023) define a "neural operator" as a parametric map between function spaces with a kernel-integral structure that is intrinsically resolution-invariant in both input and output. FNO satisfies this definition. DeepONet does not: its branch input is a fixed-length vector tied to the training sensor configuration, so input-side resolution invariance fails. DeepONet is better described as a bilinear operator approximator using a learned finite-rank basis. The distinction matters when reading theoretical papers — error bounds for "neural operators" in the Kovachki sense do not automatically apply to DeepONet, and vice versa.

Watch Out

Sensor count and locations are fixed at training time

The branch network expects an $m$ -dimensional input vector $(u(x_1), \ldots, u(x_m))$ . The number $m$ and the locations $\{x_j\}$ are baked into the trained weights. You cannot evaluate a trained DeepONet on a test input sampled at 200 locations if it was trained on 100, nor on a test input sampled at locations that differ from the training sensors. Resolution flexibility is output-only, mediated by the trunk. To support multiple sensor configurations you must either retrain or use architectural extensions (e.g., DeepONet variants with adaptive sensor encoders, or set-based input encoders).

Watch Out

Stacked vs unstacked DeepONet

Lu et al. (2021) introduced two variants. The stacked version trains $p$ independent branch networks, one per basis coefficient, producing $p$ scalar outputs concatenated into a length- $p$ vector. The unstacked version uses a single shared branch network with $p$ output channels. Unstacked is faster (one forward pass instead of $p$ ) and is the default in DeepXDE; stacked is occasionally more accurate when basis modes are highly heterogeneous. The original Chen–Chen approximation result is stated for the stacked construction; Lanthaler-Mishra-Karniadakis 2022 covers both. When reading benchmarks, check which variant is being reported.

Exercises

ExerciseCore

Problem

Consider the antiderivative operator $G : u \mapsto v$ with $v(y) = \int_0^y u(s) \, ds$ on $[0, 1]$ . Take a single test input $u(x) = \sin(\pi x)$ so that $(G u)(y) = (1 - \cos(\pi y))/\pi$ . Now consider a degenerate DeepONet with basis size $p = 1$ , so $(G_\theta u)(y) = b_1(u) \cdot t_1(y) + b_0$ . Sketch the trunk function $t_1(y)$ that minimizes squared error against $(G u)(y)$ on this single test input, and explain why $p = 1$ cannot represent the antiderivative operator across a generic input distribution.

ExerciseAdvanced

Problem

Show that the bilinear part of DeepONet, $b(u)^\top t(y)$ , is a finite-rank approximation of the operator $G$ . When $G$ is a compact operator on a Hilbert space, the optimal rank- $p$ approximation in operator norm is given by the top $p$ singular triples (the spectral / SVD theorem). State the optimal trunk and branch in terms of the SVD of $G$ , and argue why the trained DeepONet basis need not coincide with the singular vectors.

References

Canonical:

Chen, T., and Chen, H., "Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems" (IEEE Transactions on Neural Networks 6, 1995), Sections 2-4. The original universal approximation theorem.
Lu, L., Jin, P., Pang, G., Zhang, Z., and Karniadakis, G. E., "Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators" (Nature Machine Intelligence 3, 2021), arXiv:1910.03193. The DeepONet paper.

Current:

Lanthaler, S., Mishra, S., and Karniadakis, G. E., "Error estimates for DeepONets: a deep learning framework in infinite dimensions" (Transactions of Mathematics and Its Applications 6, 2022), Sections 3-5. Quantitative approximation bounds.
Lu, L., Meng, X., Cai, S., Mao, Z., Goswami, S., Zhang, Z., and Karniadakis, G. E., "A comprehensive and fair comparison of two neural operators (with practical extensions) based on FAIR data" (Computer Methods in Applied Mechanics and Engineering 393, 2022). DeepONet vs FNO benchmark.
Wang, S., Wang, H., and Perdikaris, P., "Learning the solution operator of parametric partial differential equations with physics-informed DeepONets" (Science Advances 7, 2021). PI-DeepONet variant.
Jin, P., Meng, S., Lu, L., and Karniadakis, G. E., "MIONet: Learning multiple-input operators via tensor product" (SIAM Journal on Scientific Computing 44, 2022). Multi-input extension.
Kovachki, N., Lanthaler, S., and Mishra, S., "On universal approximation and error bounds for Fourier neural operators" (Journal of Machine Learning Research 22, 2021). FNO comparison reference.
Karniadakis, G. E., Kevrekidis, I. G., Lu, L., Perdikaris, P., Wang, S., and Yang, L., "Physics-informed machine learning" (Nature Reviews Physics 3, 2021), Sections 4-5. Survey context placing DeepONet in the broader scientific-ML landscape.

Summary

DeepONet realizes Chen and Chen's 1995 universal approximation theorem for nonlinear operators via a branch network (sensor encoder) and a trunk network (query basis) joined by an inner product.
The architecture is a bilinear, finite-rank operator approximator with a learned basis; Lanthaler-Mishra-Karniadakis 2022 decomposes its error into encoding, reconstruction, and approximation pieces.
Against FNO: DeepONet is mesh-flexible and cheap per query but locked to its training sensor configuration; FNO is grid-bound but resolution-invariant and faster on dense full-field outputs.

Next Topics

Fourier Neural Operator: the spectral-convolution alternative for resolution-invariant operator learning on regular grids
Physics-informed neural networks: per-instance PDE solving via residual minimization, the architectural sibling to operator learning
Navier-Stokes for ML: the canonical PDE benchmark where DeepONet, FNO, and PINNs meet
Spectral theory of operators: the functional-analytic foundation for finite-rank operator approximation and the Eckart-Young optimality of the SVD
PDE fundamentals for ML: well-posedness, weak solutions, and the function-space setting that operator learning inhabits

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Fourier Neural Operatorlayer 3 · tier 2
Spectral Theory of Operatorslayer 0B · tier 3

Derived topics

0

No published topic currently declares this as a prerequisite.