Skip to main content

ML Methods

Fourier Neural Operator

The Fourier Neural Operator (Li, Kovachki, Anandkumar et al., ICLR 2021) parameterizes the kernel of an integral operator directly in Fourier space, giving a resolution-invariant architecture for learning maps between function spaces. Canonical baseline for data-driven PDE solvers and the architectural backbone of FourCastNet weather prediction.

AdvancedTier 2Current~45 min
0

Why This Matters

Classical neural networks learn maps RnRm\mathbb{R}^n \to \mathbb{R}^m between finite-dimensional vector spaces. A PDE solver, by contrast, computes a map between function spaces: from an initial condition u0L2(Ω)u_0 \in L^2(\Omega) to the solution u(,T)L2(Ω)u(\cdot, T) \in L^2(\Omega) at later time, or from a coefficient field aL(Ω)a \in L^\infty(\Omega) to the corresponding solution of an elliptic boundary-value problem. Discretize on a grid of NN points and a CNN can imitate this map, but the learned weights are tied to that grid: train at 64×6464 \times 64 and the network has no principled meaning at 256×256256 \times 256.

The Fourier Neural Operator (FNO) of Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, Stuart, and Anandkumar (ICLR 2021) was the first architecture to break this discretization tie convincingly. The construction is direct: parameterize the kernel of an integral operator in Fourier space, truncate to a fixed number of low-frequency modes, and evaluate with an FFT. The learned weights live in spectral space and are independent of the spatial resolution, so the same trained network applies to any sufficiently fine grid.

On the standard 2D Navier-Stokes benchmark of Li et al. (2021), FNO matched a pseudospectral solver to relative L2L^2 error around 10310^{-3} at sub-second inference, three orders of magnitude faster than the solver it was trained against. This result rearranged the data-driven PDE landscape: FNO became the default baseline that every subsequent neural operator (DeepONet, Geo-FNO, U-Net Operators, transformer operators) is benchmarked against. The architecture also scaled to global weather: Pathak et al. (2022) trained an Adaptive FNO variant (FourCastNet) on ERA5 reanalysis data and matched the IFS forecast on key surface variables at 104×10^4 \times the throughput.

The conceptual contrast with physics-informed neural networks is sharp. PINNs solve a single PDE instance by encoding the residual into the loss; every new initial condition demands a new optimization. FNO amortizes the cost: train once on a distribution of initial conditions, then evaluate the learned operator instantly on any new instance from that distribution. The tradeoff is that FNO needs labeled solution data (typically from a conventional solver), while PINNs need only the PDE.

Mental Model

Many useful operators on functions can be written as integral operators of the form (Kv)(x)=Ωκ(x,y)v(y)dy(K v)(x) = \int_\Omega \kappa(x, y)\, v(y)\, dy. When the kernel is shift-invariant, κ(x,y)=κ(xy)\kappa(x, y) = \kappa(x - y), this integral is a convolution, and the convolution theorem turns it into pointwise multiplication in the Fourier domain: Kv^(k)=κ^(k)v^(k)\widehat{Kv}(k) = \hat{\kappa}(k) \cdot \hat{v}(k). FNO takes this identity as a design principle. Rather than parameterize κ\kappa in physical space (as a CNN kernel does), it parameterizes the spectral multiplier κ^(k)\hat{\kappa}(k) directly, learns a different multiplier for each Fourier mode up to a truncation cutoff, and applies the operator with two FFTs.

The truncation is the crucial engineering choice. Without it, the spectral multiplier on a grid of NN points has NN independent values per channel pair, making the parameter count grow with resolution. With it, the truncation kkmaxk \le k_{\max} pins the parameter count to O(kmaxddv2)O(k_{\max}^d \cdot d_v^2) in dd spatial dimensions and dvd_v channels, independent of NN. The same trained weights can be evaluated on any grid of size NkmaxN \ge k_{\max} by zero-padding or truncating the FFT output appropriately.

Formal Statement

Definition

Fourier Layer

Let vt:ΩRdvv_t : \Omega \to \mathbb{R}^{d_v} be a function-valued hidden state on a domain Ω\Omega with periodic boundary, sampled on an NN-point grid. A Fourier layer maps vtv_t to vt+1v_{t+1} via

vt+1(x)=σ ⁣(Wvt(x)+(F1RFvt)(x)),v_{t+1}(x) = \sigma\!\left(W\, v_t(x) + \bigl(\mathcal{F}^{-1} R\, \mathcal{F}\, v_t\bigr)(x)\right),

where F\mathcal{F} is the discrete Fourier transform applied channelwise, RCkmaxd×dv×dvR \in \mathbb{C}^{k_{\max}^d \times d_v \times d_v} is a learned complex tensor of spectral multipliers indexed by wavenumber kk with kkmax|k| \le k_{\max}, WRdv×dvW \in \mathbb{R}^{d_v \times d_v} is a pointwise linear map (a 1×11 \times 1 convolution), and σ\sigma is a pointwise nonlinearity such as GELU. Modes outside the truncation k>kmax|k| > k_{\max} are zeroed before the inverse FFT.

The pointwise term Wvt(x)W v_t(x) is the residual channel: it carries information across layers without filtering through the spectral cutoff and is essential for representing high-frequency content that the truncation would otherwise discard. Without it, the network can only output linear combinations of the lowest kmaxdk_{\max}^d Fourier modes, which is a strictly bandlimited family.

The FNO Architecture

A complete FNO is a sandwich. Input functions a:ΩRdaa : \Omega \to \mathbb{R}^{d_a} are first lifted to the hidden width via a pointwise MLP P:RdaRdvP : \mathbb{R}^{d_a} \to \mathbb{R}^{d_v}, applied channelwise. The lifted function v0(x)=P(a(x))v_0(x) = P(a(x)) then passes through LL Fourier layers, producing vLv_L. Finally, a pointwise projection Q:RdvRduQ : \mathbb{R}^{d_v} \to \mathbb{R}^{d_u} collapses the hidden width to the output dimension u(x)=Q(vL(x))u(x) = Q(v_L(x)). Typical hyperparameters in the original paper: L=4L = 4 Fourier layers, dv=32d_v = 32 channels, and kmax=12k_{\max} = 12 modes per spatial direction.

The compute cost per evaluation on an NN-point grid is O(LNlogN)O(L \cdot N \log N), dominated by the 2L2L FFTs (one forward, one inverse per layer). Compare this to a standard CNN with kernel size kk in dd dimensions: a single layer costs O(kdN)O(k^d \cdot N) per channel pair. For modest kmaxk_{\max} and large NN, FNO is asymptotically more expensive than a small-stencil CNN; the win is not asymptotic complexity but the much larger effective receptive field. A single Fourier layer sees the entire domain, while a CNN needs depth proportional to the diameter divided by the kernel size.

Universal Approximation

Theorem

FNO Universal Approximation (Kovachki, Lanthaler, Mishra 2023)

Statement

Let G:Hs(Td)Hs(Td)\mathcal{G} : H^s(\mathbb{T}^d) \to H^s(\mathbb{T}^d) be a continuous operator on Sobolev functions over the torus, and let KHs(Td)K \subset H^s(\mathbb{T}^d) be compact. For every ϵ>0\epsilon > 0 there exists an FNO Nθ\mathcal{N}_\theta with finitely many Fourier layers, finite hidden width, and finite mode cutoff kmaxk_{\max} such that

supaKNθ(a)G(a)Hs<ϵ.\sup_{a \in K} \, \lVert \mathcal{N}_\theta(a) - \mathcal{G}(a) \rVert_{H^s} < \epsilon.

Intuition

The compact-set restriction is doing the work. On a compact subset of HsH^s, input functions are uniformly bounded in spectral content, and a continuous operator into HsH^s outputs functions whose high-frequency tail decays uniformly. So a finite mode cutoff kmaxk_{\max} is enough to capture the relevant frequencies, and a finite-depth FNO can interpolate the truncated map on the compact set.

Proof Sketch

The argument has two parts. First, by density of trigonometric polynomials in Hs(Td)H^s(\mathbb{T}^d) and continuity of G\mathcal{G}, both inputs and outputs can be approximated to error ϵ/3\epsilon/3 by their truncations to the lowest kmaxdk_{\max}^d Fourier modes for kmaxk_{\max} large enough. Second, the truncated operator is a continuous map between two finite-dimensional spaces (the trigonometric polynomial spaces of degree kmaxk_{\max}), and the Chen-Chen (1995) universal approximation theorem for operator networks adapts to this finite-dimensional case: a sufficiently wide pointwise MLP composed with the spectral multiplier and inverse FFT realizes any continuous map up to error ϵ/3\epsilon/3. Triangle inequality closes the argument.

Why It Matters

This is the operator-learning analog of the Hornik-Cybenko universal approximation theorem for finite-dimensional networks. It justifies FNO as an architecture: in principle, any continuous operator on Sobolev functions can be approximated arbitrarily well. The result tells you what to expect from infinite-data, infinite-compute training; it says nothing about generalization from finite samples or about which operators are efficiently approximable.

Failure Mode

The constants in the approximation bound are exponential in the spatial dimension dd and the Sobolev order ss. The cutoff kmaxk_{\max} needed to reach error ϵ\epsilon scales like ϵ1/s\epsilon^{-1/s}. Targets with shocks, discontinuities, or heavy high-wavenumber content (compressible flows with shocks, fully developed 3D turbulence below the Kolmogorov scale) require an impractically large kmaxk_{\max}, and the FFT-based architecture cannot represent them efficiently. Empirically, FNO degrades on advection-dominated problems where the solution sharpens rather than smooths over time.

Resolution Invariance

The headline property of FNO is that the same learned weights R,W,P,QR, W, P, Q apply to any spatial resolution NkmaxN \ge k_{\max}. Concretely: train on a 64×6464 \times 64 grid with kmax=12k_{\max} = 12, then evaluate on a 256×256256 \times 256 grid by computing the FFT at the new resolution, multiplying the lowest kmaxk_{\max} modes by RR (and zeroing the rest), and inverse-transforming. No retraining, no interpolation of weights, no architecture change. This is a genuine property of operators rather than discretizations and is the cleanest argument for treating FNO as something more than a CNN with a different kernel parameterization.

In practice the picture is more nuanced. Kovachki et al. (2023, §3.4) and Bartolucci et al. (2024) document that empirical resolution invariance degrades when training data is single-resolution: the network learns spectral biases tied to the training grid (aliasing artifacts, mode-coupling errors at the Nyquist frequency, behavior of WW on under-resolved patches) that do not transfer cleanly to other resolutions. The fix is either multi-resolution training (mix grids during training) or alias-free architectural variants. Treat the resolution-invariance claim as a property of the operator the architecture is capable of representing, not as a guarantee about any particular trained instance.

Worked Example: 2D Darcy Flow

Consider the steady-state Darcy equation on the unit square Ω=[0,1]2\Omega = [0,1]^2 with periodic boundary:

(a(x)u(x))=f(x),xΩ,-\nabla \cdot \bigl(a(x)\, \nabla u(x)\bigr) = f(x), \qquad x \in \Omega,

with a fixed forcing ff and a spatially varying diffusion coefficient a(x)>0a(x) > 0 drawn from a Gaussian random field. The solution operator G:au\mathcal{G} : a \mapsto u is nonlinear in aa (the inverse of an aa-dependent elliptic operator), continuous on L(Ω){aamin}L^\infty(\Omega) \cap \{a \ge a_{\min}\}, and is exactly the kind of map FNO targets.

In Li et al. (2021, §5.1), training data is generated by solving the PDE with a second-order finite-difference method on a 241×241241 \times 241 grid for Ntrain=1024N_{\text{train}} = 1024 samples of aa. An FNO with L=4L = 4 Fourier layers, dv=32d_v = 32 channels, and kmax=12k_{\max} = 12 modes per direction is trained for 500 epochs with Adam. The learned operator achieves relative L2L^2 test error of about 0.7×1030.7 \times 10^{-3}, comparable to the discretization error of the FEM solver itself. Inference takes around 5 ms per sample on a single GPU; the FEM solver takes tens of seconds. The cost asymmetry is the entire point of operator learning: pay an upfront training cost to amortize a large number of downstream evaluations.

Common Confusions

Watch Out

FNO is not just a CNN in the Fourier domain

A standard CNN convolves with a fixed-size spatial kernel: parameter count O(kd)O(k^d) per channel pair, receptive field grows linearly with depth, and the kernel is local in space. FNO learns a full-rank (up to truncation) spectral multiplier of size O(kmaxd)O(k_{\max}^d) per channel pair: every output mode depends nontrivially on the corresponding input mode at every grid point, so the receptive field is the entire domain in a single layer. The two architectures encode different inductive biases. CNNs are biased toward local features; FNO is biased toward smooth global structure with a controlled bandwidth.

Watch Out

Resolution invariance is a property of the architecture, not a guarantee about a trained model

The FNO weights are independent of the grid, so the same parameters can be applied at any resolution above the mode cutoff. This is a real architectural property and distinguishes FNO from CNN-based operators. It is not a guarantee that a network trained on one resolution will generalize cleanly to a much higher resolution: training data at a single grid size induces spectral biases tied to that discretization, and zero-shot transfer to dramatically finer grids often degrades. Multi-resolution training or alias-aware variants are the standard remedy.

Watch Out

Plain FNO requires function values on a regular periodic grid

The FFT in the Fourier layer is a discrete Fourier transform on a uniform grid with periodic boundary; FNO out of the box is not mesh-free, not adaptive, and not directly applicable to functions sampled on irregular point clouds or in domains with non-trivial boundary. Extensions exist: Geo-FNO (Li et al. 2023) learns a coordinate diffeomorphism that maps a non-trivial geometry to a periodic reference domain; Graph Neural Operators (Anandkumar et al. 2020) replace the FFT with message-passing on irregular graphs; spherical FNO (Bonev et al. 2023) replaces the planar FFT with a spherical harmonic transform for global atmospheric data. All of these are responses to the periodic-grid restriction of the original architecture.

Exercises

ExerciseCore

Problem

Consider 1D periodic functions sampled on NN grid points and a single Fourier layer with hidden width dvd_v and mode cutoff kmaxk_{\max}. Drop the nonlinearity and the residual term WvW v, so the layer is vt+1=F1RFvtv_{t+1} = \mathcal{F}^{-1} R \mathcal{F} v_t. Write out the matrix form of this map and show that the parameter count of RR is O(kmaxdv2)O(k_{\max} d_v^2), independent of NN.

ExerciseAdvanced

Problem

A linear operator T:L2(Td)L2(Td)T : L^2(\mathbb{T}^d) \to L^2(\mathbb{T}^d) is shift-invariant if it commutes with all translations: Tτh=τhTT \circ \tau_h = \tau_h \circ T for every shift τh\tau_h. Show that any bounded shift-invariant operator on L2(Td)L^2(\mathbb{T}^d) can be represented exactly by a single Fourier layer with no truncation (kmax=Nk_{\max} = N), no nonlinearity, and no residual term WW.

References

No canonical references provided.

Next Topics

  • Navier-Stokes for ML: the canonical PDE benchmark FNO is evaluated against and the setting where operator learning competes head-on with PINN approaches.
  • Physics-Informed Neural Networks: the per-instance optimization alternative to FNO; compare data-driven operator learning against residual-loss training on a single PDE solve.
  • Spectral Theory of Operators: the Hilbert-space machinery underlying the integral-operator view; eigenfunction expansions are the abstract version of the Fourier basis FNO uses concretely.
  • PDE Fundamentals for ML: elliptic, parabolic, and hyperbolic classification and what each implies for the regularity of the solution operator FNO is trying to learn.
  • Fast Fourier Transform: the O(NlogN)O(N \log N) algorithm that makes the Fourier layer practical at scale.

Summary. FNO parameterizes the kernel of an integral operator directly in Fourier space, truncates to kmaxk_{\max} low-frequency modes, and applies the operator with two FFTs per layer. The architecture is resolution-invariant by construction: the same learned weights apply to any grid above the mode cutoff. Universal approximation holds for continuous operators on Sobolev spaces over the torus, but the constants are exponential in dimension, the architecture struggles on shock-dominated and high-wavenumber problems, and empirical resolution invariance degrades under single-resolution training without alias-free variants.

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics