ML Methods
Fourier Neural Operator
The Fourier Neural Operator (Li, Kovachki, Anandkumar et al., ICLR 2021) parameterizes the kernel of an integral operator directly in Fourier space, giving a resolution-invariant architecture for learning maps between function spaces. Canonical baseline for data-driven PDE solvers and the architectural backbone of FourCastNet weather prediction.
Why This Matters
Classical neural networks learn maps between finite-dimensional vector spaces. A PDE solver, by contrast, computes a map between function spaces: from an initial condition to the solution at later time, or from a coefficient field to the corresponding solution of an elliptic boundary-value problem. Discretize on a grid of points and a CNN can imitate this map, but the learned weights are tied to that grid: train at and the network has no principled meaning at .
The Fourier Neural Operator (FNO) of Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, Stuart, and Anandkumar (ICLR 2021) was the first architecture to break this discretization tie convincingly. The construction is direct: parameterize the kernel of an integral operator in Fourier space, truncate to a fixed number of low-frequency modes, and evaluate with an FFT. The learned weights live in spectral space and are independent of the spatial resolution, so the same trained network applies to any sufficiently fine grid.
On the standard 2D Navier-Stokes benchmark of Li et al. (2021), FNO matched a pseudospectral solver to relative error around at sub-second inference, three orders of magnitude faster than the solver it was trained against. This result rearranged the data-driven PDE landscape: FNO became the default baseline that every subsequent neural operator (DeepONet, Geo-FNO, U-Net Operators, transformer operators) is benchmarked against. The architecture also scaled to global weather: Pathak et al. (2022) trained an Adaptive FNO variant (FourCastNet) on ERA5 reanalysis data and matched the IFS forecast on key surface variables at the throughput.
The conceptual contrast with physics-informed neural networks is sharp. PINNs solve a single PDE instance by encoding the residual into the loss; every new initial condition demands a new optimization. FNO amortizes the cost: train once on a distribution of initial conditions, then evaluate the learned operator instantly on any new instance from that distribution. The tradeoff is that FNO needs labeled solution data (typically from a conventional solver), while PINNs need only the PDE.
Mental Model
Many useful operators on functions can be written as integral operators of the form . When the kernel is shift-invariant, , this integral is a convolution, and the convolution theorem turns it into pointwise multiplication in the Fourier domain: . FNO takes this identity as a design principle. Rather than parameterize in physical space (as a CNN kernel does), it parameterizes the spectral multiplier directly, learns a different multiplier for each Fourier mode up to a truncation cutoff, and applies the operator with two FFTs.
The truncation is the crucial engineering choice. Without it, the spectral multiplier on a grid of points has independent values per channel pair, making the parameter count grow with resolution. With it, the truncation pins the parameter count to in spatial dimensions and channels, independent of . The same trained weights can be evaluated on any grid of size by zero-padding or truncating the FFT output appropriately.
Formal Statement
Fourier Layer
Let be a function-valued hidden state on a domain with periodic boundary, sampled on an -point grid. A Fourier layer maps to via
where is the discrete Fourier transform applied channelwise, is a learned complex tensor of spectral multipliers indexed by wavenumber with , is a pointwise linear map (a convolution), and is a pointwise nonlinearity such as GELU. Modes outside the truncation are zeroed before the inverse FFT.
The pointwise term is the residual channel: it carries information across layers without filtering through the spectral cutoff and is essential for representing high-frequency content that the truncation would otherwise discard. Without it, the network can only output linear combinations of the lowest Fourier modes, which is a strictly bandlimited family.
The FNO Architecture
A complete FNO is a sandwich. Input functions are first lifted to the hidden width via a pointwise MLP , applied channelwise. The lifted function then passes through Fourier layers, producing . Finally, a pointwise projection collapses the hidden width to the output dimension . Typical hyperparameters in the original paper: Fourier layers, channels, and modes per spatial direction.
The compute cost per evaluation on an -point grid is , dominated by the FFTs (one forward, one inverse per layer). Compare this to a standard CNN with kernel size in dimensions: a single layer costs per channel pair. For modest and large , FNO is asymptotically more expensive than a small-stencil CNN; the win is not asymptotic complexity but the much larger effective receptive field. A single Fourier layer sees the entire domain, while a CNN needs depth proportional to the diameter divided by the kernel size.
Universal Approximation
FNO Universal Approximation (Kovachki, Lanthaler, Mishra 2023)
Statement
Let be a continuous operator on Sobolev functions over the torus, and let be compact. For every there exists an FNO with finitely many Fourier layers, finite hidden width, and finite mode cutoff such that
Intuition
The compact-set restriction is doing the work. On a compact subset of , input functions are uniformly bounded in spectral content, and a continuous operator into outputs functions whose high-frequency tail decays uniformly. So a finite mode cutoff is enough to capture the relevant frequencies, and a finite-depth FNO can interpolate the truncated map on the compact set.
Proof Sketch
The argument has two parts. First, by density of trigonometric polynomials in and continuity of , both inputs and outputs can be approximated to error by their truncations to the lowest Fourier modes for large enough. Second, the truncated operator is a continuous map between two finite-dimensional spaces (the trigonometric polynomial spaces of degree ), and the Chen-Chen (1995) universal approximation theorem for operator networks adapts to this finite-dimensional case: a sufficiently wide pointwise MLP composed with the spectral multiplier and inverse FFT realizes any continuous map up to error . Triangle inequality closes the argument.
Why It Matters
This is the operator-learning analog of the Hornik-Cybenko universal approximation theorem for finite-dimensional networks. It justifies FNO as an architecture: in principle, any continuous operator on Sobolev functions can be approximated arbitrarily well. The result tells you what to expect from infinite-data, infinite-compute training; it says nothing about generalization from finite samples or about which operators are efficiently approximable.
Failure Mode
The constants in the approximation bound are exponential in the spatial dimension and the Sobolev order . The cutoff needed to reach error scales like . Targets with shocks, discontinuities, or heavy high-wavenumber content (compressible flows with shocks, fully developed 3D turbulence below the Kolmogorov scale) require an impractically large , and the FFT-based architecture cannot represent them efficiently. Empirically, FNO degrades on advection-dominated problems where the solution sharpens rather than smooths over time.
Resolution Invariance
The headline property of FNO is that the same learned weights apply to any spatial resolution . Concretely: train on a grid with , then evaluate on a grid by computing the FFT at the new resolution, multiplying the lowest modes by (and zeroing the rest), and inverse-transforming. No retraining, no interpolation of weights, no architecture change. This is a genuine property of operators rather than discretizations and is the cleanest argument for treating FNO as something more than a CNN with a different kernel parameterization.
In practice the picture is more nuanced. Kovachki et al. (2023, §3.4) and Bartolucci et al. (2024) document that empirical resolution invariance degrades when training data is single-resolution: the network learns spectral biases tied to the training grid (aliasing artifacts, mode-coupling errors at the Nyquist frequency, behavior of on under-resolved patches) that do not transfer cleanly to other resolutions. The fix is either multi-resolution training (mix grids during training) or alias-free architectural variants. Treat the resolution-invariance claim as a property of the operator the architecture is capable of representing, not as a guarantee about any particular trained instance.
Worked Example: 2D Darcy Flow
Consider the steady-state Darcy equation on the unit square with periodic boundary:
with a fixed forcing and a spatially varying diffusion coefficient drawn from a Gaussian random field. The solution operator is nonlinear in (the inverse of an -dependent elliptic operator), continuous on , and is exactly the kind of map FNO targets.
In Li et al. (2021, §5.1), training data is generated by solving the PDE with a second-order finite-difference method on a grid for samples of . An FNO with Fourier layers, channels, and modes per direction is trained for 500 epochs with Adam. The learned operator achieves relative test error of about , comparable to the discretization error of the FEM solver itself. Inference takes around 5 ms per sample on a single GPU; the FEM solver takes tens of seconds. The cost asymmetry is the entire point of operator learning: pay an upfront training cost to amortize a large number of downstream evaluations.
Common Confusions
FNO is not just a CNN in the Fourier domain
A standard CNN convolves with a fixed-size spatial kernel: parameter count per channel pair, receptive field grows linearly with depth, and the kernel is local in space. FNO learns a full-rank (up to truncation) spectral multiplier of size per channel pair: every output mode depends nontrivially on the corresponding input mode at every grid point, so the receptive field is the entire domain in a single layer. The two architectures encode different inductive biases. CNNs are biased toward local features; FNO is biased toward smooth global structure with a controlled bandwidth.
Resolution invariance is a property of the architecture, not a guarantee about a trained model
The FNO weights are independent of the grid, so the same parameters can be applied at any resolution above the mode cutoff. This is a real architectural property and distinguishes FNO from CNN-based operators. It is not a guarantee that a network trained on one resolution will generalize cleanly to a much higher resolution: training data at a single grid size induces spectral biases tied to that discretization, and zero-shot transfer to dramatically finer grids often degrades. Multi-resolution training or alias-aware variants are the standard remedy.
Plain FNO requires function values on a regular periodic grid
The FFT in the Fourier layer is a discrete Fourier transform on a uniform grid with periodic boundary; FNO out of the box is not mesh-free, not adaptive, and not directly applicable to functions sampled on irregular point clouds or in domains with non-trivial boundary. Extensions exist: Geo-FNO (Li et al. 2023) learns a coordinate diffeomorphism that maps a non-trivial geometry to a periodic reference domain; Graph Neural Operators (Anandkumar et al. 2020) replace the FFT with message-passing on irregular graphs; spherical FNO (Bonev et al. 2023) replaces the planar FFT with a spherical harmonic transform for global atmospheric data. All of these are responses to the periodic-grid restriction of the original architecture.
Exercises
Problem
Consider 1D periodic functions sampled on grid points and a single Fourier layer with hidden width and mode cutoff . Drop the nonlinearity and the residual term , so the layer is . Write out the matrix form of this map and show that the parameter count of is , independent of .
Problem
A linear operator is shift-invariant if it commutes with all translations: for every shift . Show that any bounded shift-invariant operator on can be represented exactly by a single Fourier layer with no truncation (), no nonlinearity, and no residual term .
References
No canonical references provided.
No current references provided.
No frontier references provided.
Next Topics
- Navier-Stokes for ML: the canonical PDE benchmark FNO is evaluated against and the setting where operator learning competes head-on with PINN approaches.
- Physics-Informed Neural Networks: the per-instance optimization alternative to FNO; compare data-driven operator learning against residual-loss training on a single PDE solve.
- Spectral Theory of Operators: the Hilbert-space machinery underlying the integral-operator view; eigenfunction expansions are the abstract version of the Fourier basis FNO uses concretely.
- PDE Fundamentals for ML: elliptic, parabolic, and hyperbolic classification and what each implies for the regularity of the solution operator FNO is trying to learn.
- Fast Fourier Transform: the algorithm that makes the Fourier layer practical at scale.
Summary. FNO parameterizes the kernel of an integral operator directly in Fourier space, truncates to low-frequency modes, and applies the operator with two FFTs per layer. The architecture is resolution-invariant by construction: the same learned weights apply to any grid above the mode cutoff. Universal approximation holds for continuous operators on Sobolev spaces over the torus, but the constants are exponential in dimension, the architecture struggles on shock-dominated and high-wavenumber problems, and empirical resolution invariance degrades under single-resolution training without alias-free variants.
Last reviewed: April 18, 2026
Prerequisites
Foundations this topic depends on.
- Fast Fourier TransformLayer 1
- Exponential Function PropertiesLayer 0A
- Navier-Stokes for MLLayer 4
- PDE Fundamentals for Machine LearningLayer 1
- Eigenvalues and EigenvectorsLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Stochastic Differential EquationsLayer 3
- Brownian MotionLayer 2
- Measure-Theoretic ProbabilityLayer 0B
- Martingale TheoryLayer 0B
- Ito's LemmaLayer 3
- Stochastic Calculus for MLLayer 3
- Functional Analysis CoreLayer 0B
- Metric Spaces, Convergence, and CompletenessLayer 0A
- Inner Product Spaces and OrthogonalityLayer 0A
- Vectors, Matrices, and Linear MapsLayer 0A
- Physics-Informed Neural NetworksLayer 4
- The Jacobian MatrixLayer 0A
- Automatic DifferentiationLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Gradient Descent VariantsLayer 1
- Spectral Theory of OperatorsLayer 0B