Kolmogorov-Arnold Networks (KANs)

Sneiderman, Robby

Beyond LLMS

Kolmogorov-Arnold Networks (KANs)

An alternative to MLPs where learnable univariate functions (typically B-splines) sit on edges and pure summation sits on nodes. Motivated by the Kolmogorov-Arnold representation theorem, competitive on small smooth scientific tasks, unproven at frontier scale.

AdvancedTier 2FrontierFrontier watch~65 min

Prerequisites

Universal Approximation Theorem Feedforward Networks and Backpropagation Activation Functions

Prereq Map

Learning position

Read this page in the graph.

beyond-llms | layer 4 | tier 2. This page has 3 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Physics-Informed Neural Networks

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every standard neural architecture since the late 1980s fixes the same design: linear weights on edges, fixed nonlinearities (ReLU, tanh, GELU) on nodes. Kolmogorov-Arnold Networks (Liu, Wang, Vaidya, Ruehle, Halverson, Soljačić, Hou, Tegmark, April 2024, arXiv:2404.19756) invert this layout. Each edge carries its own learnable univariate function, typically a B-spline; each node simply sums its inputs. The universal approximation backing is replaced by the Kolmogorov-Arnold representation theorem from 1957.

MLP layer

weights on edges, fixed σ on nodes

yⱼ = σ(Σᵢ wᵢⱼ xᵢ + bⱼ)

KAN layer

learnable φᵢⱼ on edges, sum on nodes

yⱼ = Σᵢ φᵢⱼ(xᵢ)

A single KAN edge

drag the knots to reshape φ(x)

φ(x) = wₛ·spline(x) + wᵦ·SiLU(x)

Shape:

The MLP computes a fixed nonlinearity over a weighted sum. The KAN replaces both. Every edge carries its own learnable univariate function φᵢⱼ (a B-spline in the Liu et al. 2024 parameterization), and nodes just add. The right panel shows one such φ, written as a scalar-weighted sum of a spline component and a SiLU base. Grounded in the Kolmogorov-Arnold representation theorem: every continuous function of several variables decomposes into sums of univariate functions.

KANs are worth understanding for two reasons. On small, smooth scientific targets (PDE solutions, symbolic regression, low-dimensional regression) they reach comparable or better accuracy than MLPs at matched parameter counts, and the spline coefficients are directly visualizable, which lets researchers extract closed-form approximations. They have also not (as of April 2026) demonstrated a decisive win at language-model scale: larger comparisons (Yu, Yu, Wang 2024, arXiv:2407.16674) find MLPs still dominate on standard vision and NLP benchmarks at equal compute. Treat KANs as a scientific-ML architecture with sharp interpretability properties, not as a replacement for the transformer MLP.

Mental Model

An MLP layer computes $y = \sigma(Wx + b)$ : a linear combination, then a fixed activation. A KAN layer computes $y_j = \sum_i \phi_{j,i}(x_i)$ : every input coordinate is passed through its own learnable 1D function, and the outputs are summed. The "weights" and "activations" trade places.

The consequence is that the nonlinearity is not a fixed global choice but a learned, edge-specific shape. A single KAN edge can be linear in one region, sinusoidal in another, and a step at a third location. After training, each $\phi_{j,i}$ is a curve you can plot. Contrast this with an MLP, where the only thing you can plot is a weight scalar.

The Kolmogorov-Arnold Representation Theorem

Theorem

Kolmogorov-Arnold Representation (1957)

Statement

Every continuous function $f: [0,1]^n \to \mathbb{R}$ can be written as

$f(x_1, \ldots, x_n) = \sum_{q=0}^{2n} \Phi_q\!\left(\sum_{p=1}^{n} \phi_{q,p}(x_p)\right)$

for some continuous outer functions $\Phi_q: \mathbb{R} \to \mathbb{R}$ and continuous inner functions $\phi_{q,p}: [0,1] \to \mathbb{R}$ . The number of outer functions is $2n+1$ , independent of $f$ . The inner functions can be chosen universal: by Sprecher's 1965 refinement, $\phi_{q,p}(x) = \lambda_p \, \phi(x + q a)$ for a single continuous $\phi$ and constants $\lambda_p, a$ , so only the outer functions $\Phi_q$ depend on $f$ .

Intuition

Multivariate continuous functions reduce to sums and univariate continuous functions. No genuine $n$ -ary operation is needed. This resolves Hilbert's 13th problem in the continuous case: any continuous function of many variables is a superposition of continuous functions of one variable and a single binary operation (addition).

Proof Sketch

The construction is a superposition argument, not a topological embedding. Kolmogorov builds $2n+1$ Hölder-continuous inner functions $\phi_{q,p}: [0,1] \to \mathbb{R}$ with the property that the $2n+1$ sums $s_q(x) = \sum_p \phi_{q,p}(x_p)$ separate points of $[0,1]^n$ well enough that, for any continuous $f$ , the outer functions $\Phi_q$ can be chosen to make $\sum_q \Phi_q(s_q(x))$ recover $f(x)$ uniformly. Note that no single continuous $s_q$ can be injective on $[0,1]^n$ for $n > 1$ (a continuous injection of an $n$ -cube into $\mathbb{R}$ is impossible); the trick is that the collection of $2n+1$ sums does the separating jointly. Arnold extended the construction so the inner functions can be chosen independent of $f$ , and Sprecher showed they can all be derived from a single universal $\phi$ . The outer functions $\Phi_q$ are then obtained by a uniform-approximation argument on the compact image of $s_q$ .

Why It Matters

The theorem says the "intrinsic" complexity of a continuous multivariate function lives in $2n+1$ univariate pieces. Any architecture that can fit arbitrary univariate functions and compose via summation is, in principle, universal.

Failure Mode

The inner functions $\phi_{q,p}$ are typically highly non-smooth. For target $f$ with, say, Lipschitz regularity, the inner functions in the Kolmogorov construction can still fail to be differentiable, and Girosi and Poggio (1989, "Representation Properties of Networks: Kolmogorov's Theorem is Irrelevant," Neural Computation 1(4)) argued that this pathology makes the theorem a poor foundation for practical learning. KART guarantees existence of a representation in continuous functions; it does not guarantee that the representation is smooth, efficiently parameterizable, or learnable by gradient descent.

report a correction →

From the Theorem to the Architecture

The KAN paper does not claim that Kolmogorov-Arnold representations with smooth splines exist for arbitrary continuous $f$ . It instead proposes a generalization: allow the outer and inner functions to be smooth (B-splines) and stack the two-layer structure to arbitrary depth. The resulting object is still a network of univariate edges and summation nodes, but it is no longer tied to the exact $2n+1$ structure of the original theorem.

Under this generalization, the target class is not all continuous functions but the class of functions that admit a smooth KAN decomposition. Practical evidence in Liu et al. 2024 suggests this class covers many scientific targets (PDE solutions, dynamical-system flows, physics formulas) but leaves open what it excludes.

The KAN Layer

Definition

KAN Layer $Φ : R^{n_{in}} \to R^{n_{out}}$

A KAN layer with input dimension $n_{\text{in}}$ and output dimension $n_{\text{out}}$ is a collection of learnable univariate functions $\phi_{j,i}: \mathbb{R} \to \mathbb{R}$ for $i \in \{1, \ldots, n_{\text{in}}\}$ , $j \in \{1, \ldots, n_{\text{out}}\}$ , with output

$y_j = \sum_{i=1}^{n_{\text{in}}} \phi_{j,i}(x_i).$

Each $\phi_{j,i}$ is parameterized as

$\phi_{j,i}(x) = w_{j,i} \left( b(x) + \mathrm{spline}_{j,i}(x) \right),$

where $b(x) = \mathrm{SiLU}(x) = x / (1 + e^{-x})$ is a fixed base, $w_{j,i}$ is a scalar, and

$\mathrm{spline}_{j,i}(x) = \sum_{\ell} c^{(j,i)}_{\ell} \, B_{\ell}(x)$

is a cubic B-spline expansion on a grid of $G$ points. Parameters per edge: $G + k$ spline coefficients plus one scalar $w_{j,i}$ , typically $k = 3$ (cubic).

Definition

KAN Network

A depth- $L$ KAN is the composition

$\mathrm{KAN}(x) = \Phi_L \circ \Phi_{L-1} \circ \cdots \circ \Phi_1(x),$

with widths $[n_0, n_1, \ldots, n_L]$ . Total parameter count is approximately

$P_{\mathrm{KAN}} = \sum_{\ell=1}^L n_\ell \, n_{\ell-1} \, (G + k + 1),$

compared to $P_{\mathrm{MLP}} = \sum_\ell n_\ell \, n_{\ell-1} + n_\ell$ for an MLP of matched widths. For $G = 5, k = 3$ , a KAN has roughly $9\times$ the parameters of an equal-width MLP.

The KAN Approximation Bound

Theorem

KAN Approximation Bound (Liu et al. 2024)

Statement

Let $f$ be expressible as an $L$ -layer KAN whose component functions each lie in $C^{m+k+1}$ . A KAN with the same architecture and order- $k$ B-splines on $G$ -point grids achieves uniform approximation error

$\|f - \widehat{f}_G\|_{L^\infty} \leq C \cdot G^{-k-1+m}$

where $C$ depends on the smoothness norms of the component functions and on $L$ , but not on the input dimension $n$ (Liu et al. 2024, arXiv:2404.19756, Theorem 2.1).

Intuition

Each edge is a 1D spline approximation, and 1D spline error on smooth targets is well understood: order- $k$ splines on a $G$ -point grid achieve $O(G^{-k-1})$ error for $C^{k+1}$ targets (the $m$ shift accounts for higher derivatives). The key claim is that stacking these edges preserves the rate, because the composition only ever touches 1D functions.

Why It Matters

The bound is dimension-independent when the target admits a smooth KAN decomposition. MLP approximation bounds for general $C^s$ functions on $[0,1]^n$ take the form $O(N^{-s/n})$ , which degrades badly in high dimension. If the KAN-compositional assumption holds, KANs sidestep that dimensional penalty.

Failure Mode

The assumption "admits a KAN representation with smooth component functions" is strong and not implied by KART. For a generic continuous $f$ , the Kolmogorov inner functions are non-smooth; the bound does not apply. Empirically, the rate is observed on smooth scientific targets (Feynman physics formulas, PDE solutions) and is not a general statement about arbitrary functions. Liu et al. themselves frame the theorem as conditional on the target structure.

report a correction →

Writing $N$ for the parameter count (scaling linearly with $G$ at fixed width), the bound translates to $\|f - \widehat{f}_N\|_{L^\infty} = O(N^{-(k+1-m)})$ . With cubic splines ( $k = 3$ ) and $m = 0$ , this is $O(N^{-4})$ , substantially faster than the MLP rate for smooth functions in comparable dimension. This is the source of the paper's "KAN scaling exponent 4" claim. It holds only under the structural assumption stated.

MLP vs. KAN

Property	MLP	KAN
Edge	Scalar weight	Univariate function (spline)
Node	Fixed activation then sum	Pure sum
Theoretical backing	Universal approximation (Cybenko, Hornik)	Kolmogorov-Arnold representation (1957)
Smoothness of representation	Composition of fixed smooth activations	Splines of chosen order $k$
Parameters per edge	1	$G + k + 1$ (typically 8-12)
Interpretability of a single unit	Weight magnitude	Curve that can be plotted, compared, pruned
Training speed per step	Baseline	5-20x slower (spline evaluation and gradient)
Evidence at LLM scale	Mature	No demonstrated win
Evidence on scientific targets	Strong baseline	Competitive or better at matched params

Interpretability and Symbolic Distillation

The feature that most distinguishes KANs from MLPs in practice is the after-training pipeline. Each $\phi_{j,i}$ is a 1D curve. After sparse training, many edges become approximately zero and can be pruned. Surviving edges can be visually inspected, classified against a library of elementary functions ( $x$ , $x^2$ , $\sin x$ , $\exp x$ , $\log x$ , ...), and snapped to the nearest match. The result is a symbolic expression for the trained network.

Liu et al. demonstrate this on Feynman-I symbolic regression tasks: a KAN trained on the relativistic addition-of-velocities formula recovers a sparse network whose edges match $1 + x^2$ , division, and square-root shapes, from which the analytical form is reconstructed. KAN 2.0 (Liu et al., August 2024, arXiv:2408.10205) extends this pipeline with multiplication nodes and better pruning heuristics.

MLPs have no direct analog of this: you cannot plot a weight and read off what function it represents. Post-hoc mechanistic interpretability methods are required, and they recover circuits, not equations.

Variants and Extensions

The KAN idea has spawned a family of architectures that swap the per-edge basis or extend the layout. The core object (learnable univariate functions on edges, summation on nodes) is preserved in each.

FastKAN (Li, 2024, arXiv:2405.06721): Gaussian RBF basis as an approximation to B-splines; roughly 3x faster training at comparable accuracy.
ChebyKAN (SS et al., 2024, arXiv:2405.07200) and JacobiKAN: orthogonal polynomial bases on a fixed interval; avoid grid management and are sometimes more expressive per coefficient.
FourierKAN: sinusoidal basis per edge; the natural choice when the target is periodic or when spectral bias of the per-edge function matters.
Wav-KAN (Bozorgasl and Chen, 2024, arXiv:2405.12832): wavelet basis; handles multiresolution features better than a single-grid spline.
Convolutional KANs: replace conv filters with KAN-style learnable nonlinear filters; early results (Bodner et al., 2024, arXiv:2406.13155) are mixed and show modest gains over equivalent CNNs on small image tasks.
KAN-Transformer hybrids: swap MLP blocks inside transformers with KAN blocks. Most published runs report modest or no improvement on standard language benchmarks, at higher compute cost.
Temporal KANs for time-series forecasting: apply KAN layers inside recurrent or convolutional time-series architectures; the per-edge function learns a per-lag nonlinearity.
MultKAN / KAN 2.0 (Liu et al., August 2024, arXiv:2408.10205): adds multiplication nodes alongside summation nodes, closing the gap with classical scientific-ML representations that need products.

Where KANs Sit in the Architecture Space

KANs are closer to MARS (multivariate adaptive regression splines) and classical basis-function expansions than they are to modern deep nets. The innovation is the layering and gradient-based training, not the use of splines. This makes KANs a spline-based regression architecture with deep-learning infrastructure attached, rather than a neural architecture with new representational primitives.

The honest reading of the empirical record: KANs provide a clean interpretability story and competitive numbers on smooth, low-dimensional, scientific tasks. They do not currently compete with transformers on the tasks that drive compute spending (language, vision, multimodal). The per-edge spline cost is a real constant factor that grows with $G$ , and the parameter count is inflated.

The Deeper Question KANs Raise

The dominance of the $y = \sigma(Wx + b)$ primitive is partly a historical accident from the Rumelhart-Hinton-Williams 1986 era and partly a genuine inductive bias that happens to suit high-dimensional discrete tasks: matrix multiplies are the operation that silicon and compilers are best at, and discrete tokens need content-addressable lookups more than they need smooth interpolation. There is probably room for architectures that move beyond the linear-then-nonlinearity atom as the unit of computation. KANs are one concrete exploration of that space. Learnable edges force a different question: what should the atomic operation be if you are willing to give up cheap matmul? Whether this particular exploration becomes foundational or stays niche is still being decided.

A useful frame: treat $y = \sigma(Wx + b)$ as a default that is extremely hard to beat on hardware-matched benchmarks, and treat alternatives like KANs as probes that expose which parts of that default are load-bearing. The interpretability property of KANs is not a side effect; it is the direct consequence of making each unit of computation a plottable 1D object. That alone is an argument for keeping them in the toolbox for scientific problems where the target is a physical law, not a token distribution.

Common Confusions

Watch Out

KANs are not universal because of KART

The Kolmogorov-Arnold theorem applies to continuous functions represented with non-smooth inner functions. Practical KANs use smooth B-splines, which cannot realize arbitrary KART representations. KANs are universal in a weaker, MLP-like sense: deep enough and wide enough, they approximate continuous functions. The KART connection is motivational, not a tight theoretical grounding.

Watch Out

The N^{-4} scaling does not apply to arbitrary targets

The $O(N^{-4})$ rate in Liu et al. 2024 assumes the target admits a smooth KAN decomposition. For a generic $f$ on $[0,1]^n$ with only Lipschitz regularity, the rate degrades and the dimension independence is lost. Reports of "better scaling than MLPs" refer to specific smooth scientific benchmarks, not to language or vision.

Watch Out

Learnable activations are not new

Maxout (Goodfellow et al. 2013), adaptive piecewise-linear units (Agostinelli et al. 2015), PReLU (He et al. 2015), and Swish-β (Ramachandran et al. 2017) all allow some form of learned activation. What KANs add is edge-specificity (every edge has its own function) combined with a spline parameterization rich enough to represent arbitrary 1D shapes. The innovation is the placement and parameterization, not the idea of learning an activation.

Watch Out

KANs are slower, not faster

Per training step, a KAN evaluates $G + k$ spline basis functions per edge and their gradients, compared to one multiply-add per edge for an MLP. Implementations in late 2024 and 2025 (FastKAN, efficient-kan repos) narrowed the gap, but KANs remain 3-10x slower per forward-backward pass at equal width. The argument for KANs is parameter efficiency on specific tasks and interpretability, not raw speed.

Watch Out

Pruning to a symbolic form is not guaranteed

The symbolic distillation pipeline in Liu et al. 2024 works well on Feynman-I style formulas because those targets have sparse, elementary structure. For real data, post-training edges often do not snap cleanly to any elementary function, and the "symbolic expression" has to be rounded, truncated, or accepted as a spline. Interpretability is a property of the target, not the architecture.

Summary

KANs place learnable univariate functions on edges and plain summation on nodes, inverting the MLP layout.
The Kolmogorov-Arnold theorem motivates the architecture but does not justify it directly: classical KART inner functions are non-smooth, while KANs use smooth splines.
The approximation bound $O(N^{-4})$ for cubic splines assumes a smooth KAN decomposition of the target; it is a conditional result, not a general scaling law.
Wins: interpretability, symbolic distillation, competitive parameter efficiency on smooth scientific tasks.
Open: no demonstrated advantage at large-scale language, vision, or multimodal training as of April 2026; training is 3-10x slower than MLPs at equal width.
Treat KANs as a scientific-ML architecture with an interpretability story, not as a transformer-MLP replacement.

Exercises

ExerciseCore

Problem

State the Kolmogorov-Arnold representation theorem precisely for a continuous function $f: [0,1]^3 \to \mathbb{R}$ . How many outer functions $\Phi_q$ are required, and how many total inner functions $\phi_{q,p}$ appear? Explain why this count does not imply the existence of an efficient learnable architecture.

ExerciseCore

Problem

Consider a KAN with widths $[2, 4, 1]$ , cubic splines ( $k = 3$ ) on grids of $G = 5$ points, and an MLP with the same widths and a fixed activation. Count the trainable parameters in each. Where does the gap come from?

ExerciseAdvanced

Problem

The KAN approximation theorem gives $O(G^{-k-1+m})$ error for smooth KAN-decomposable targets. Derive the corresponding rate in terms of total parameter count $N$ at fixed width, and compare to the MLP rate $O(N^{-s/n})$ for $C^s$ functions in dimension $n$ . For what $(s, n)$ does an MLP match the KAN rate when $k = 3, m = 0$ ?

ExerciseResearch

Problem

Suppose you are advising a team that wants to train a KAN-based replacement for the MLP blocks inside a transformer LLM. List three specific technical obstacles that would need to be addressed, and for each, state whether current KAN variants (FastKAN, KAN 2.0, etc.) plausibly address it.

References

Canonical:

Liu, Wang, Vaidya, Ruehle, Halverson, Soljačić, Hou, Tegmark, "KAN: Kolmogorov-Arnold Networks" (April 2024, arXiv:2404.19756). Theorem 2.1 for the approximation bound; Sections 2.2-2.5 for the spline parameterization, grid extension, and symbolic distillation pipeline.
Kolmogorov, "On the representation of continuous functions of several variables by superpositions of continuous functions of one variable and addition" (Doklady Akad. Nauk SSSR, 1957).
Arnold, "On functions of three variables" (Doklady Akad. Nauk SSSR, 1957). Completion of Hilbert's 13th problem in the continuous case.
Sprecher, "On the structure of continuous functions of several variables" (Trans. AMS, 1965). Universal-inner-function refinement of KART.

Classical critique of using KART for neural networks:

Girosi, Poggio, "Representation Properties of Networks: Kolmogorov's Theorem is Irrelevant" (Neural Computation 1(4), 1989).
Hecht-Nielsen, "Kolmogorov's Mapping Neural Network Existence Theorem" (ICNN 1987). The original argument for KART relevance that Girosi-Poggio responded to.

Current (KAN variants and empirical evaluations):

Liu, Ma, Wang, Matusik, Tegmark, "KAN 2.0: Kolmogorov-Arnold Networks Meet Science" (August 2024, arXiv:2408.10205). MultKAN and improved symbolic fitting.
Yu, Yu, Wang, "KAN or MLP: A Fairer Comparison" (July 2024, arXiv:2407.16674). Head-to-head on vision, NLP, and scientific tasks at matched compute.
Li, "Kolmogorov-Arnold Networks are Radial Basis Function Networks" (May 2024, arXiv:2405.06721). FastKAN.
Sidharth SS, Keerthana AR, Gokul R, Anas KP, "Chebyshev Polynomial-Based Kolmogorov-Arnold Networks: An Efficient Architecture for Nonlinear Function Approximation" (May 2024, arXiv:2405.07200). ChebyKAN.
Bozorgasl, Chen, "Wav-KAN: Wavelet Kolmogorov-Arnold Networks" (May 2024, arXiv:2405.12832).

Background on splines and function approximation:

de Boor, A Practical Guide to Splines (Revised ed., Springer 2001), Chapters IX-XI for spline approximation rates.
DeVore, Lorentz, Constructive Approximation (Springer 1993), Chapter 13 for multivariate approximation lower bounds.

Where to Go Deeper

The original paper (Liu et al. 2024, arXiv:2404.19756) is readable and well-illustrated. Read it first, then read the critical-response papers (Yu, Yu, Wang 2024, arXiv:2407.16674 is the most useful starting point) to calibrate the claims against matched-compute MLP baselines. The pykan reference implementation on GitHub is the cleanest starting point for hands-on work, and a 2D function-fitting exercise on something like $f(x, y) = \exp(\sin(\pi x) + y^2)$ will build the intuition for what the spline edges actually learn in about thirty minutes. For a second pass, compare the symbolic-distillation pipeline in KAN 2.0 (arXiv:2408.10205) against the symbolic regression baselines PySR and AI Feynman to see where KAN wins, ties, or loses on a task that was already competitively served.

Next Topics

Physics-informed neural networks: a related scientific-ML architecture where KANs are being tested as drop-in replacements for MLP sub-networks.
Mechanistic interpretability: the post-hoc interpretability toolkit for standard networks. KANs offer a cleaner story on small models; mechanistic interpretability remains the only viable path at frontier scale.

Last reviewed: May 27, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Activation Functionslayer 1 · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1
Universal Approximation Theoremlayer 2 · tier 1

Derived topics

2

Mechanistic Interpretability: Features, Circuits, and Causal Faithfulnesslayer 4 · tier 1
Physics-Informed Neural Networkslayer 4 · tier 2

Graph-backed continuations

Physics-Informed Neural Networks Mechanistic Interpretability: Features, Circuits, and Causal Faithfulness