Skip to main content

Beyond Llms

Kolmogorov-Arnold Networks (KANs)

An alternative to MLPs where learnable univariate functions (typically B-splines) sit on edges and pure summation sits on nodes. Motivated by the Kolmogorov-Arnold representation theorem, competitive on small smooth scientific tasks, unproven at frontier scale.

AdvancedTier 2Frontier~65 min
0

Why This Matters

Every standard neural architecture since the late 1980s fixes the same design: linear weights on edges, fixed nonlinearities (ReLU, tanh, GELU) on nodes. Kolmogorov-Arnold Networks (Liu, Wang, Vaidya, Ruehle, Halverson, Soljačić, Hou, Tegmark, April 2024, arXiv:2404.19756) invert this layout. Each edge carries its own learnable univariate function, typically a B-spline; each node simply sums its inputs. The universal approximation backing is replaced by the Kolmogorov-Arnold representation theorem from 1957.

MLP layer
weights on edges, fixed σ on nodes
wwxxxσσσ
y = σ(Σᵢ wᵢⱼ xᵢ + bⱼ)
KAN layer
learnable φᵢⱼ on edges, sum on nodes
xxxΣΣΣ
yⱼ = Σᵢ φᵢⱼ(xᵢ)
A single KAN edge
drag the knots to reshape φ(x)
xφ(x)SiLU basespline
φ(x) = w·spline(x) + w·SiLU(x)
Shape:

The MLP computes a fixed nonlinearity over a weighted sum. The KAN replaces both. Every edge carries its own learnable univariate function φᵢⱼ (a B-spline in the Liu et al. 2024 parameterization), and nodes just add. The right panel shows one such φ, written as a scalar-weighted sum of a spline component and a SiLU base. Grounded in the Kolmogorov-Arnold representation theorem: every continuous function of several variables decomposes into sums of univariate functions.

KANs are worth understanding for two reasons. On small, smooth scientific targets (PDE solutions, symbolic regression, low-dimensional regression) they reach comparable or better accuracy than MLPs at matched parameter counts, and the spline coefficients are directly visualizable, which lets researchers extract closed-form approximations. They have also not (as of April 2026) demonstrated a decisive win at language-model scale: larger comparisons (Yu, Yu, Wang 2024, arXiv:2407.16674) find MLPs still dominate on standard vision and NLP benchmarks at equal compute. Treat KANs as a scientific-ML architecture with sharp interpretability properties, not as a replacement for the transformer MLP.

Mental Model

An MLP layer computes y=σ(Wx+b)y = \sigma(Wx + b): a linear combination, then a fixed activation. A KAN layer computes yj=iϕj,i(xi)y_j = \sum_i \phi_{j,i}(x_i): every input coordinate is passed through its own learnable 1D function, and the outputs are summed. The "weights" and "activations" trade places.

The consequence is that the nonlinearity is not a fixed global choice but a learned, edge-specific shape. A single KAN edge can be linear in one region, sinusoidal in another, and a step at a third location. After training, each ϕj,i\phi_{j,i} is a curve you can plot. Contrast this with an MLP, where the only thing you can plot is a weight scalar.

The Kolmogorov-Arnold Representation Theorem

Theorem

Kolmogorov-Arnold Representation (1957)

Statement

Every continuous function f:[0,1]nRf: [0,1]^n \to \mathbb{R} can be written as

f(x1,,xn)=q=02nΦq ⁣(p=1nϕq,p(xp))f(x_1, \ldots, x_n) = \sum_{q=0}^{2n} \Phi_q\!\left(\sum_{p=1}^{n} \phi_{q,p}(x_p)\right)

for some continuous outer functions Φq:RR\Phi_q: \mathbb{R} \to \mathbb{R} and continuous inner functions ϕq,p:[0,1]R\phi_{q,p}: [0,1] \to \mathbb{R}. The number of outer functions is 2n+12n+1, independent of ff. The inner functions can be chosen universal: by Sprecher's 1965 refinement, ϕq,p(x)=λpϕ(x+qa)\phi_{q,p}(x) = \lambda_p \, \phi(x + q a) for a single continuous ϕ\phi and constants λp,a\lambda_p, a, so only the outer functions Φq\Phi_q depend on ff.

Intuition

Multivariate continuous functions reduce to sums and univariate continuous functions. No genuine nn-ary operation is needed. This resolves Hilbert's 13th problem in the continuous case: any continuous function of many variables is a superposition of continuous functions of one variable and a single binary operation (addition).

Proof Sketch

Kolmogorov constructs the inner functions ϕq,p\phi_{q,p} as monotone Hölder-continuous mappings that embed [0,1]n[0,1]^n into R\mathbb{R} in a way that separates points well enough for any continuous ff to be recovered by suitable Φq\Phi_q. Arnold, building on work by both, showed the construction can be made with inner functions independent of ff, and Sprecher showed they can be derived from a single universal ϕ\phi. The outer functions Φq\Phi_q are then obtained by a uniform-approximation argument on a compact image set.

Why It Matters

The theorem says the "intrinsic" complexity of a continuous multivariate function lives in 2n+12n+1 univariate pieces. Any architecture that can fit arbitrary univariate functions and compose via summation is, in principle, universal.

Failure Mode

The inner functions ϕq,p\phi_{q,p} are typically highly non-smooth. For target ff with, say, Lipschitz regularity, the inner functions in the Kolmogorov construction can still fail to be differentiable, and Girosi and Poggio (1989, "Representation Properties of Networks: Kolmogorov's Theorem is Irrelevant," Neural Computation 1(4)) argued that this pathology makes the theorem a poor foundation for practical learning. KART guarantees existence of a representation in continuous functions; it does not guarantee that the representation is smooth, efficiently parameterizable, or learnable by gradient descent.

From the Theorem to the Architecture

The KAN paper does not claim that Kolmogorov-Arnold representations with smooth splines exist for arbitrary continuous ff. It instead proposes a generalization: allow the outer and inner functions to be smooth (B-splines) and stack the two-layer structure to arbitrary depth. The resulting object is still a network of univariate edges and summation nodes, but it is no longer tied to the exact 2n+12n+1 structure of the original theorem.

Under this generalization, the target class is not all continuous functions but the class of functions that admit a smooth KAN decomposition. Practical evidence in Liu et al. 2024 suggests this class covers many scientific targets (PDE solutions, dynamical-system flows, physics formulas) but leaves open what it excludes.

The KAN Layer

Definition

KAN Layer

A KAN layer with input dimension ninn_{\text{in}} and output dimension noutn_{\text{out}} is a collection of learnable univariate functions ϕj,i:RR\phi_{j,i}: \mathbb{R} \to \mathbb{R} for i{1,,nin}i \in \{1, \ldots, n_{\text{in}}\}, j{1,,nout}j \in \{1, \ldots, n_{\text{out}}\}, with output

yj=i=1ninϕj,i(xi).y_j = \sum_{i=1}^{n_{\text{in}}} \phi_{j,i}(x_i).

Each ϕj,i\phi_{j,i} is parameterized as

ϕj,i(x)=wj,i(b(x)+splinej,i(x)),\phi_{j,i}(x) = w_{j,i} \left( b(x) + \mathrm{spline}_{j,i}(x) \right),

where b(x)=SiLU(x)=x/(1+ex)b(x) = \mathrm{SiLU}(x) = x / (1 + e^{-x}) is a fixed base, wj,iw_{j,i} is a scalar, and

splinej,i(x)=c(j,i)B(x)\mathrm{spline}_{j,i}(x) = \sum_{\ell} c^{(j,i)}_{\ell} \, B_{\ell}(x)

is a cubic B-spline expansion on a grid of GG points. Parameters per edge: G+kG + k spline coefficients plus one scalar wj,iw_{j,i}, typically k=3k = 3 (cubic).

Definition

KAN Network

A depth-LL KAN is the composition

KAN(x)=ΦLΦL1Φ1(x),\mathrm{KAN}(x) = \Phi_L \circ \Phi_{L-1} \circ \cdots \circ \Phi_1(x),

with widths [n0,n1,,nL][n_0, n_1, \ldots, n_L]. Total parameter count is approximately

PKAN==1Lnn1(G+k+1),P_{\mathrm{KAN}} = \sum_{\ell=1}^L n_\ell \, n_{\ell-1} \, (G + k + 1),

compared to PMLP=nn1+nP_{\mathrm{MLP}} = \sum_\ell n_\ell \, n_{\ell-1} + n_\ell for an MLP of matched widths. For G=5,k=3G = 5, k = 3, a KAN has roughly 9×9\times the parameters of an equal-width MLP.

The KAN Approximation Bound

Theorem

KAN Approximation Bound (Liu et al. 2024)

Statement

Let ff be expressible as an LL-layer KAN whose component functions each lie in Cm+k+1C^{m+k+1}. A KAN with the same architecture and order-kk B-splines on GG-point grids achieves uniform approximation error

ff^GLCGk1+m\|f - \widehat{f}_G\|_{L^\infty} \leq C \cdot G^{-k-1+m}

where CC depends on the smoothness norms of the component functions and on LL, but not on the input dimension nn (Liu et al. 2024, arXiv:2404.19756, Theorem 2.1).

Intuition

Each edge is a 1D spline approximation, and 1D spline error on smooth targets is well understood: order-kk splines on a GG-point grid achieve O(Gk1)O(G^{-k-1}) error for Ck+1C^{k+1} targets (the mm shift accounts for higher derivatives). The key claim is that stacking these edges preserves the rate, because the composition only ever touches 1D functions.

Why It Matters

The bound is dimension-independent when the target admits a smooth KAN decomposition. MLP approximation bounds for general CsC^s functions on [0,1]n[0,1]^n take the form O(Ns/n)O(N^{-s/n}), which degrades badly in high dimension. If the KAN-compositional assumption holds, KANs sidestep that dimensional penalty.

Failure Mode

The assumption "admits a KAN representation with smooth component functions" is strong and not implied by KART. For a generic continuous ff, the Kolmogorov inner functions are non-smooth; the bound does not apply. Empirically, the rate is observed on smooth scientific targets (Feynman physics formulas, PDE solutions) and is not a general statement about arbitrary functions. Liu et al. themselves frame the theorem as conditional on the target structure.

Writing NN for the parameter count (scaling linearly with GG at fixed width), the bound translates to ff^NL=O(N(k+1m))\|f - \widehat{f}_N\|_{L^\infty} = O(N^{-(k+1-m)}). With cubic splines (k=3k = 3) and m=0m = 0, this is O(N4)O(N^{-4}), substantially faster than the MLP rate for smooth functions in comparable dimension. This is the source of the paper's "KAN scaling exponent 4" claim. It holds only under the structural assumption stated.

MLP vs. KAN

PropertyMLPKAN
EdgeScalar weightUnivariate function (spline)
NodeFixed activation then sumPure sum
Theoretical backingUniversal approximation (Cybenko, Hornik)Kolmogorov-Arnold representation (1957)
Smoothness of representationComposition of fixed smooth activationsSplines of chosen order kk
Parameters per edge1G+k+1G + k + 1 (typically 8-12)
Interpretability of a single unitWeight magnitudeCurve that can be plotted, compared, pruned
Training speed per stepBaseline5-20x slower (spline evaluation and gradient)
Evidence at LLM scaleMatureNo demonstrated win
Evidence on scientific targetsStrong baselineCompetitive or better at matched params

Interpretability and Symbolic Distillation

The feature that most distinguishes KANs from MLPs in practice is the after-training pipeline. Each ϕj,i\phi_{j,i} is a 1D curve. After sparse training, many edges become approximately zero and can be pruned. Surviving edges can be visually inspected, classified against a library of elementary functions (xx, x2x^2, sinx\sin x, expx\exp x, logx\log x, ...), and snapped to the nearest match. The result is a symbolic expression for the trained network.

Liu et al. demonstrate this on Feynman-I symbolic regression tasks: a KAN trained on the relativistic addition-of-velocities formula recovers a sparse network whose edges match 1+x21 + x^2, division, and square-root shapes, from which the analytical form is reconstructed. KAN 2.0 (Liu et al., August 2024, arXiv:2408.10205) extends this pipeline with multiplication nodes and better pruning heuristics.

MLPs have no direct analog of this: you cannot plot a weight and read off what function it represents. Post-hoc mechanistic interpretability methods are required, and they recover circuits, not equations.

Variants and Extensions

The KAN idea has spawned a family of architectures that swap the per-edge basis or extend the layout. The core object (learnable univariate functions on edges, summation on nodes) is preserved in each.

  • FastKAN (Li, 2024, arXiv:2405.06721): Gaussian RBF basis as an approximation to B-splines; roughly 3x faster training at comparable accuracy.
  • ChebyKAN (SS et al., 2024, arXiv:2405.07200) and JacobiKAN: orthogonal polynomial bases on a fixed interval; avoid grid management and are sometimes more expressive per coefficient.
  • FourierKAN: sinusoidal basis per edge; the natural choice when the target is periodic or when spectral bias of the per-edge function matters.
  • Wav-KAN (Bozorgasl and Chen, 2024, arXiv:2405.12832): wavelet basis; handles multiresolution features better than a single-grid spline.
  • Convolutional KANs: replace conv filters with KAN-style learnable nonlinear filters; early results (Bodner et al., 2024, arXiv:2406.13155) are mixed and show modest gains over equivalent CNNs on small image tasks.
  • KAN-Transformer hybrids: swap MLP blocks inside transformers with KAN blocks. Most published runs report modest or no improvement on standard language benchmarks, at higher compute cost.
  • Temporal KANs for time-series forecasting: apply KAN layers inside recurrent or convolutional time-series architectures; the per-edge function learns a per-lag nonlinearity.
  • MultKAN / KAN 2.0 (Liu et al., August 2024, arXiv:2408.10205): adds multiplication nodes alongside summation nodes, closing the gap with classical scientific-ML representations that need products.

Where KANs Sit in the Architecture Space

KANs are closer to MARS (multivariate adaptive regression splines) and classical basis-function expansions than they are to modern deep nets. The innovation is the layering and gradient-based training, not the use of splines. This makes KANs a spline-based regression architecture with deep-learning infrastructure attached, rather than a neural architecture with new representational primitives.

The honest reading of the empirical record: KANs provide a clean interpretability story and competitive numbers on smooth, low-dimensional, scientific tasks. They do not currently compete with transformers on the tasks that drive compute spending (language, vision, multimodal). The per-edge spline cost is a real constant factor that grows with GG, and the parameter count is inflated.

The Deeper Question KANs Raise

The dominance of the y=σ(Wx+b)y = \sigma(Wx + b) primitive is partly a historical accident from the Rumelhart-Hinton-Williams 1986 era and partly a genuine inductive bias that happens to suit high-dimensional discrete tasks: matrix multiplies are the operation that silicon and compilers are best at, and discrete tokens need content-addressable lookups more than they need smooth interpolation. There is probably room for architectures that move beyond the linear-then-nonlinearity atom as the unit of computation. KANs are one concrete exploration of that space. Learnable edges force a different question: what should the atomic operation be if you are willing to give up cheap matmul? Whether this particular exploration becomes foundational or stays niche is still being decided.

A useful frame: treat y=σ(Wx+b)y = \sigma(Wx + b) as a default that is extremely hard to beat on hardware-matched benchmarks, and treat alternatives like KANs as probes that expose which parts of that default are load-bearing. The interpretability property of KANs is not a side effect; it is the direct consequence of making each unit of computation a plottable 1D object. That alone is an argument for keeping them in the toolbox for scientific problems where the target is a physical law, not a token distribution.

Common Confusions

Watch Out

KANs are not universal because of KART

The Kolmogorov-Arnold theorem applies to continuous functions represented with non-smooth inner functions. Practical KANs use smooth B-splines, which cannot realize arbitrary KART representations. KANs are universal in a weaker, MLP-like sense: deep enough and wide enough, they approximate continuous functions. The KART connection is motivational, not a tight theoretical grounding.

Watch Out

The N^{-4} scaling does not apply to arbitrary targets

The O(N4)O(N^{-4}) rate in Liu et al. 2024 assumes the target admits a smooth KAN decomposition. For a generic ff on [0,1]n[0,1]^n with only Lipschitz regularity, the rate degrades and the dimension independence is lost. Reports of "better scaling than MLPs" refer to specific smooth scientific benchmarks, not to language or vision.

Watch Out

Learnable activations are not new

Maxout (Goodfellow et al. 2013), adaptive piecewise-linear units (Agostinelli et al. 2015), PReLU (He et al. 2015), and Swish-β (Ramachandran et al. 2017) all allow some form of learned activation. What KANs add is edge-specificity (every edge has its own function) combined with a spline parameterization rich enough to represent arbitrary 1D shapes. The innovation is the placement and parameterization, not the idea of learning an activation.

Watch Out

KANs are slower, not faster

Per training step, a KAN evaluates G+kG + k spline basis functions per edge and their gradients, compared to one multiply-add per edge for an MLP. Implementations in late 2024 and 2025 (FastKAN, efficient-kan repos) narrowed the gap, but KANs remain 3-10x slower per forward-backward pass at equal width. The argument for KANs is parameter efficiency on specific tasks and interpretability, not raw speed.

Watch Out

Pruning to a symbolic form is not guaranteed

The symbolic distillation pipeline in Liu et al. 2024 works well on Feynman-I style formulas because those targets have sparse, elementary structure. For real data, post-training edges often do not snap cleanly to any elementary function, and the "symbolic expression" has to be rounded, truncated, or accepted as a spline. Interpretability is a property of the target, not the architecture.

Summary

  • KANs place learnable univariate functions on edges and plain summation on nodes, inverting the MLP layout.
  • The Kolmogorov-Arnold theorem motivates the architecture but does not justify it directly: classical KART inner functions are non-smooth, while KANs use smooth splines.
  • The approximation bound O(N4)O(N^{-4}) for cubic splines assumes a smooth KAN decomposition of the target; it is a conditional result, not a general scaling law.
  • Wins: interpretability, symbolic distillation, competitive parameter efficiency on smooth scientific tasks.
  • Open: no demonstrated advantage at large-scale language, vision, or multimodal training as of April 2026; training is 3-10x slower than MLPs at equal width.
  • Treat KANs as a scientific-ML architecture with an interpretability story, not as a transformer-MLP replacement.

Exercises

ExerciseCore

Problem

State the Kolmogorov-Arnold representation theorem precisely for a continuous function f:[0,1]3Rf: [0,1]^3 \to \mathbb{R}. How many outer functions Φq\Phi_q are required, and how many total inner functions ϕq,p\phi_{q,p} appear? Explain why this count does not imply the existence of an efficient learnable architecture.

ExerciseCore

Problem

Consider a KAN with widths [2,4,1][2, 4, 1], cubic splines (k=3k = 3) on grids of G=5G = 5 points, and an MLP with the same widths and a fixed activation. Count the trainable parameters in each. Where does the gap come from?

ExerciseAdvanced

Problem

The KAN approximation theorem gives O(Gk1+m)O(G^{-k-1+m}) error for smooth KAN-decomposable targets. Derive the corresponding rate in terms of total parameter count NN at fixed width, and compare to the MLP rate O(Ns/n)O(N^{-s/n}) for CsC^s functions in dimension nn. For what (s,n)(s, n) does an MLP match the KAN rate when k=3,m=0k = 3, m = 0?

ExerciseResearch

Problem

Suppose you are advising a team that wants to train a KAN-based replacement for the MLP blocks inside a transformer LLM. List three specific technical obstacles that would need to be addressed, and for each, state whether current KAN variants (FastKAN, KAN 2.0, etc.) plausibly address it.

References

Canonical:

  • Liu, Wang, Vaidya, Ruehle, Halverson, Soljačić, Hou, Tegmark, "KAN: Kolmogorov-Arnold Networks" (April 2024, arXiv:2404.19756). Theorem 2.1 for the approximation bound; Sections 2.2-2.5 for the spline parameterization, grid extension, and symbolic distillation pipeline.
  • Kolmogorov, "On the representation of continuous functions of several variables by superpositions of continuous functions of one variable and addition" (Doklady Akad. Nauk SSSR, 1957).
  • Arnold, "On functions of three variables" (Doklady Akad. Nauk SSSR, 1957). Completion of Hilbert's 13th problem in the continuous case.
  • Sprecher, "On the structure of continuous functions of several variables" (Trans. AMS, 1965). Universal-inner-function refinement of KART.

Classical critique of using KART for neural networks:

  • Girosi, Poggio, "Representation Properties of Networks: Kolmogorov's Theorem is Irrelevant" (Neural Computation 1(4), 1989).
  • Hecht-Nielsen, "Kolmogorov's Mapping Neural Network Existence Theorem" (ICNN 1987). The original argument for KART relevance that Girosi-Poggio responded to.

Current (KAN variants and empirical evaluations):

  • Liu, Ma, Wang, Matusik, Tegmark, "KAN 2.0: Kolmogorov-Arnold Networks Meet Science" (August 2024, arXiv:2408.10205). MultKAN and improved symbolic fitting.
  • Yu, Yu, Wang, "KAN or MLP: A Fairer Comparison" (July 2024, arXiv:2407.16674). Head-to-head on vision, NLP, and scientific tasks at matched compute.
  • Li, "Kolmogorov-Arnold Networks are Radial Basis Function Networks" (May 2024, arXiv:2405.06721). FastKAN.
  • Sidharth SS, Keerthana AR, Gokul R, Anas KP, "Chebyshev Polynomial-Based Kolmogorov-Arnold Networks: An Efficient Architecture for Nonlinear Function Approximation" (May 2024, arXiv:2405.07200). ChebyKAN.
  • Bozorgasl, Chen, "Wav-KAN: Wavelet Kolmogorov-Arnold Networks" (May 2024, arXiv:2405.12832).

Background on splines and function approximation:

  • de Boor, A Practical Guide to Splines (Revised ed., Springer 2001), Chapters IX-XI for spline approximation rates.
  • DeVore, Lorentz, Constructive Approximation (Springer 1993), Chapter 13 for multivariate approximation lower bounds.

Where to Go Deeper

The original paper (Liu et al. 2024, arXiv:2404.19756) is readable and well-illustrated. Read it first, then read the critical-response papers (Yu, Yu, Wang 2024, arXiv:2407.16674 is the most useful starting point) to calibrate the claims against matched-compute MLP baselines. The pykan reference implementation on GitHub is the cleanest starting point for hands-on work, and a 2D function-fitting exercise on something like f(x,y)=exp(sin(πx)+y2)f(x, y) = \exp(\sin(\pi x) + y^2) will build the intuition for what the spline edges actually learn in about thirty minutes. For a second pass, compare the symbolic-distillation pipeline in KAN 2.0 (arXiv:2408.10205) against the symbolic regression baselines PySR and AI Feynman to see where KAN wins, ties, or loses on a task that was already competitively served.

Next Topics

  • Physics-informed neural networks: a related scientific-ML architecture where KANs are being tested as drop-in replacements for MLP sub-networks.
  • Mechanistic interpretability: the post-hoc interpretability toolkit for standard networks. KANs offer a cleaner story on small models; mechanistic interpretability remains the only viable path at frontier scale.

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics