Beyond Llms
Kolmogorov-Arnold Networks (KANs)
An alternative to MLPs where learnable univariate functions (typically B-splines) sit on edges and pure summation sits on nodes. Motivated by the Kolmogorov-Arnold representation theorem, competitive on small smooth scientific tasks, unproven at frontier scale.
Prerequisites
Why This Matters
Every standard neural architecture since the late 1980s fixes the same design: linear weights on edges, fixed nonlinearities (ReLU, tanh, GELU) on nodes. Kolmogorov-Arnold Networks (Liu, Wang, Vaidya, Ruehle, Halverson, Soljačić, Hou, Tegmark, April 2024, arXiv:2404.19756) invert this layout. Each edge carries its own learnable univariate function, typically a B-spline; each node simply sums its inputs. The universal approximation backing is replaced by the Kolmogorov-Arnold representation theorem from 1957.
The MLP computes a fixed nonlinearity over a weighted sum. The KAN replaces both. Every edge carries its own learnable univariate function φᵢⱼ (a B-spline in the Liu et al. 2024 parameterization), and nodes just add. The right panel shows one such φ, written as a scalar-weighted sum of a spline component and a SiLU base. Grounded in the Kolmogorov-Arnold representation theorem: every continuous function of several variables decomposes into sums of univariate functions.
KANs are worth understanding for two reasons. On small, smooth scientific targets (PDE solutions, symbolic regression, low-dimensional regression) they reach comparable or better accuracy than MLPs at matched parameter counts, and the spline coefficients are directly visualizable, which lets researchers extract closed-form approximations. They have also not (as of April 2026) demonstrated a decisive win at language-model scale: larger comparisons (Yu, Yu, Wang 2024, arXiv:2407.16674) find MLPs still dominate on standard vision and NLP benchmarks at equal compute. Treat KANs as a scientific-ML architecture with sharp interpretability properties, not as a replacement for the transformer MLP.
Mental Model
An MLP layer computes : a linear combination, then a fixed activation. A KAN layer computes : every input coordinate is passed through its own learnable 1D function, and the outputs are summed. The "weights" and "activations" trade places.
The consequence is that the nonlinearity is not a fixed global choice but a learned, edge-specific shape. A single KAN edge can be linear in one region, sinusoidal in another, and a step at a third location. After training, each is a curve you can plot. Contrast this with an MLP, where the only thing you can plot is a weight scalar.
The Kolmogorov-Arnold Representation Theorem
Kolmogorov-Arnold Representation (1957)
Statement
Every continuous function can be written as
for some continuous outer functions and continuous inner functions . The number of outer functions is , independent of . The inner functions can be chosen universal: by Sprecher's 1965 refinement, for a single continuous and constants , so only the outer functions depend on .
Intuition
Multivariate continuous functions reduce to sums and univariate continuous functions. No genuine -ary operation is needed. This resolves Hilbert's 13th problem in the continuous case: any continuous function of many variables is a superposition of continuous functions of one variable and a single binary operation (addition).
Proof Sketch
Kolmogorov constructs the inner functions as monotone Hölder-continuous mappings that embed into in a way that separates points well enough for any continuous to be recovered by suitable . Arnold, building on work by both, showed the construction can be made with inner functions independent of , and Sprecher showed they can be derived from a single universal . The outer functions are then obtained by a uniform-approximation argument on a compact image set.
Why It Matters
The theorem says the "intrinsic" complexity of a continuous multivariate function lives in univariate pieces. Any architecture that can fit arbitrary univariate functions and compose via summation is, in principle, universal.
Failure Mode
The inner functions are typically highly non-smooth. For target with, say, Lipschitz regularity, the inner functions in the Kolmogorov construction can still fail to be differentiable, and Girosi and Poggio (1989, "Representation Properties of Networks: Kolmogorov's Theorem is Irrelevant," Neural Computation 1(4)) argued that this pathology makes the theorem a poor foundation for practical learning. KART guarantees existence of a representation in continuous functions; it does not guarantee that the representation is smooth, efficiently parameterizable, or learnable by gradient descent.
From the Theorem to the Architecture
The KAN paper does not claim that Kolmogorov-Arnold representations with smooth splines exist for arbitrary continuous . It instead proposes a generalization: allow the outer and inner functions to be smooth (B-splines) and stack the two-layer structure to arbitrary depth. The resulting object is still a network of univariate edges and summation nodes, but it is no longer tied to the exact structure of the original theorem.
Under this generalization, the target class is not all continuous functions but the class of functions that admit a smooth KAN decomposition. Practical evidence in Liu et al. 2024 suggests this class covers many scientific targets (PDE solutions, dynamical-system flows, physics formulas) but leaves open what it excludes.
The KAN Layer
KAN Layer
A KAN layer with input dimension and output dimension is a collection of learnable univariate functions for , , with output
Each is parameterized as
where is a fixed base, is a scalar, and
is a cubic B-spline expansion on a grid of points. Parameters per edge: spline coefficients plus one scalar , typically (cubic).
KAN Network
A depth- KAN is the composition
with widths . Total parameter count is approximately
compared to for an MLP of matched widths. For , a KAN has roughly the parameters of an equal-width MLP.
The KAN Approximation Bound
KAN Approximation Bound (Liu et al. 2024)
Statement
Let be expressible as an -layer KAN whose component functions each lie in . A KAN with the same architecture and order- B-splines on -point grids achieves uniform approximation error
where depends on the smoothness norms of the component functions and on , but not on the input dimension (Liu et al. 2024, arXiv:2404.19756, Theorem 2.1).
Intuition
Each edge is a 1D spline approximation, and 1D spline error on smooth targets is well understood: order- splines on a -point grid achieve error for targets (the shift accounts for higher derivatives). The key claim is that stacking these edges preserves the rate, because the composition only ever touches 1D functions.
Why It Matters
The bound is dimension-independent when the target admits a smooth KAN decomposition. MLP approximation bounds for general functions on take the form , which degrades badly in high dimension. If the KAN-compositional assumption holds, KANs sidestep that dimensional penalty.
Failure Mode
The assumption "admits a KAN representation with smooth component functions" is strong and not implied by KART. For a generic continuous , the Kolmogorov inner functions are non-smooth; the bound does not apply. Empirically, the rate is observed on smooth scientific targets (Feynman physics formulas, PDE solutions) and is not a general statement about arbitrary functions. Liu et al. themselves frame the theorem as conditional on the target structure.
Writing for the parameter count (scaling linearly with at fixed width), the bound translates to . With cubic splines () and , this is , substantially faster than the MLP rate for smooth functions in comparable dimension. This is the source of the paper's "KAN scaling exponent 4" claim. It holds only under the structural assumption stated.
MLP vs. KAN
| Property | MLP | KAN |
|---|---|---|
| Edge | Scalar weight | Univariate function (spline) |
| Node | Fixed activation then sum | Pure sum |
| Theoretical backing | Universal approximation (Cybenko, Hornik) | Kolmogorov-Arnold representation (1957) |
| Smoothness of representation | Composition of fixed smooth activations | Splines of chosen order |
| Parameters per edge | 1 | (typically 8-12) |
| Interpretability of a single unit | Weight magnitude | Curve that can be plotted, compared, pruned |
| Training speed per step | Baseline | 5-20x slower (spline evaluation and gradient) |
| Evidence at LLM scale | Mature | No demonstrated win |
| Evidence on scientific targets | Strong baseline | Competitive or better at matched params |
Interpretability and Symbolic Distillation
The feature that most distinguishes KANs from MLPs in practice is the after-training pipeline. Each is a 1D curve. After sparse training, many edges become approximately zero and can be pruned. Surviving edges can be visually inspected, classified against a library of elementary functions (, , , , , ...), and snapped to the nearest match. The result is a symbolic expression for the trained network.
Liu et al. demonstrate this on Feynman-I symbolic regression tasks: a KAN trained on the relativistic addition-of-velocities formula recovers a sparse network whose edges match , division, and square-root shapes, from which the analytical form is reconstructed. KAN 2.0 (Liu et al., August 2024, arXiv:2408.10205) extends this pipeline with multiplication nodes and better pruning heuristics.
MLPs have no direct analog of this: you cannot plot a weight and read off what function it represents. Post-hoc mechanistic interpretability methods are required, and they recover circuits, not equations.
Variants and Extensions
The KAN idea has spawned a family of architectures that swap the per-edge basis or extend the layout. The core object (learnable univariate functions on edges, summation on nodes) is preserved in each.
- FastKAN (Li, 2024, arXiv:2405.06721): Gaussian RBF basis as an approximation to B-splines; roughly 3x faster training at comparable accuracy.
- ChebyKAN (SS et al., 2024, arXiv:2405.07200) and JacobiKAN: orthogonal polynomial bases on a fixed interval; avoid grid management and are sometimes more expressive per coefficient.
- FourierKAN: sinusoidal basis per edge; the natural choice when the target is periodic or when spectral bias of the per-edge function matters.
- Wav-KAN (Bozorgasl and Chen, 2024, arXiv:2405.12832): wavelet basis; handles multiresolution features better than a single-grid spline.
- Convolutional KANs: replace conv filters with KAN-style learnable nonlinear filters; early results (Bodner et al., 2024, arXiv:2406.13155) are mixed and show modest gains over equivalent CNNs on small image tasks.
- KAN-Transformer hybrids: swap MLP blocks inside transformers with KAN blocks. Most published runs report modest or no improvement on standard language benchmarks, at higher compute cost.
- Temporal KANs for time-series forecasting: apply KAN layers inside recurrent or convolutional time-series architectures; the per-edge function learns a per-lag nonlinearity.
- MultKAN / KAN 2.0 (Liu et al., August 2024, arXiv:2408.10205): adds multiplication nodes alongside summation nodes, closing the gap with classical scientific-ML representations that need products.
Where KANs Sit in the Architecture Space
KANs are closer to MARS (multivariate adaptive regression splines) and classical basis-function expansions than they are to modern deep nets. The innovation is the layering and gradient-based training, not the use of splines. This makes KANs a spline-based regression architecture with deep-learning infrastructure attached, rather than a neural architecture with new representational primitives.
The honest reading of the empirical record: KANs provide a clean interpretability story and competitive numbers on smooth, low-dimensional, scientific tasks. They do not currently compete with transformers on the tasks that drive compute spending (language, vision, multimodal). The per-edge spline cost is a real constant factor that grows with , and the parameter count is inflated.
The Deeper Question KANs Raise
The dominance of the primitive is partly a historical accident from the Rumelhart-Hinton-Williams 1986 era and partly a genuine inductive bias that happens to suit high-dimensional discrete tasks: matrix multiplies are the operation that silicon and compilers are best at, and discrete tokens need content-addressable lookups more than they need smooth interpolation. There is probably room for architectures that move beyond the linear-then-nonlinearity atom as the unit of computation. KANs are one concrete exploration of that space. Learnable edges force a different question: what should the atomic operation be if you are willing to give up cheap matmul? Whether this particular exploration becomes foundational or stays niche is still being decided.
A useful frame: treat as a default that is extremely hard to beat on hardware-matched benchmarks, and treat alternatives like KANs as probes that expose which parts of that default are load-bearing. The interpretability property of KANs is not a side effect; it is the direct consequence of making each unit of computation a plottable 1D object. That alone is an argument for keeping them in the toolbox for scientific problems where the target is a physical law, not a token distribution.
Common Confusions
KANs are not universal because of KART
The Kolmogorov-Arnold theorem applies to continuous functions represented with non-smooth inner functions. Practical KANs use smooth B-splines, which cannot realize arbitrary KART representations. KANs are universal in a weaker, MLP-like sense: deep enough and wide enough, they approximate continuous functions. The KART connection is motivational, not a tight theoretical grounding.
The N^{-4} scaling does not apply to arbitrary targets
The rate in Liu et al. 2024 assumes the target admits a smooth KAN decomposition. For a generic on with only Lipschitz regularity, the rate degrades and the dimension independence is lost. Reports of "better scaling than MLPs" refer to specific smooth scientific benchmarks, not to language or vision.
Learnable activations are not new
Maxout (Goodfellow et al. 2013), adaptive piecewise-linear units (Agostinelli et al. 2015), PReLU (He et al. 2015), and Swish-β (Ramachandran et al. 2017) all allow some form of learned activation. What KANs add is edge-specificity (every edge has its own function) combined with a spline parameterization rich enough to represent arbitrary 1D shapes. The innovation is the placement and parameterization, not the idea of learning an activation.
KANs are slower, not faster
Per training step, a KAN evaluates spline basis functions per edge and their gradients, compared to one multiply-add per edge for an MLP. Implementations in late 2024 and 2025 (FastKAN, efficient-kan repos) narrowed the gap, but KANs remain 3-10x slower per forward-backward pass at equal width. The argument for KANs is parameter efficiency on specific tasks and interpretability, not raw speed.
Pruning to a symbolic form is not guaranteed
The symbolic distillation pipeline in Liu et al. 2024 works well on Feynman-I style formulas because those targets have sparse, elementary structure. For real data, post-training edges often do not snap cleanly to any elementary function, and the "symbolic expression" has to be rounded, truncated, or accepted as a spline. Interpretability is a property of the target, not the architecture.
Summary
- KANs place learnable univariate functions on edges and plain summation on nodes, inverting the MLP layout.
- The Kolmogorov-Arnold theorem motivates the architecture but does not justify it directly: classical KART inner functions are non-smooth, while KANs use smooth splines.
- The approximation bound for cubic splines assumes a smooth KAN decomposition of the target; it is a conditional result, not a general scaling law.
- Wins: interpretability, symbolic distillation, competitive parameter efficiency on smooth scientific tasks.
- Open: no demonstrated advantage at large-scale language, vision, or multimodal training as of April 2026; training is 3-10x slower than MLPs at equal width.
- Treat KANs as a scientific-ML architecture with an interpretability story, not as a transformer-MLP replacement.
Exercises
Problem
State the Kolmogorov-Arnold representation theorem precisely for a continuous function . How many outer functions are required, and how many total inner functions appear? Explain why this count does not imply the existence of an efficient learnable architecture.
Problem
Consider a KAN with widths , cubic splines () on grids of points, and an MLP with the same widths and a fixed activation. Count the trainable parameters in each. Where does the gap come from?
Problem
The KAN approximation theorem gives error for smooth KAN-decomposable targets. Derive the corresponding rate in terms of total parameter count at fixed width, and compare to the MLP rate for functions in dimension . For what does an MLP match the KAN rate when ?
Problem
Suppose you are advising a team that wants to train a KAN-based replacement for the MLP blocks inside a transformer LLM. List three specific technical obstacles that would need to be addressed, and for each, state whether current KAN variants (FastKAN, KAN 2.0, etc.) plausibly address it.
References
Canonical:
- Liu, Wang, Vaidya, Ruehle, Halverson, Soljačić, Hou, Tegmark, "KAN: Kolmogorov-Arnold Networks" (April 2024, arXiv:2404.19756). Theorem 2.1 for the approximation bound; Sections 2.2-2.5 for the spline parameterization, grid extension, and symbolic distillation pipeline.
- Kolmogorov, "On the representation of continuous functions of several variables by superpositions of continuous functions of one variable and addition" (Doklady Akad. Nauk SSSR, 1957).
- Arnold, "On functions of three variables" (Doklady Akad. Nauk SSSR, 1957). Completion of Hilbert's 13th problem in the continuous case.
- Sprecher, "On the structure of continuous functions of several variables" (Trans. AMS, 1965). Universal-inner-function refinement of KART.
Classical critique of using KART for neural networks:
- Girosi, Poggio, "Representation Properties of Networks: Kolmogorov's Theorem is Irrelevant" (Neural Computation 1(4), 1989).
- Hecht-Nielsen, "Kolmogorov's Mapping Neural Network Existence Theorem" (ICNN 1987). The original argument for KART relevance that Girosi-Poggio responded to.
Current (KAN variants and empirical evaluations):
- Liu, Ma, Wang, Matusik, Tegmark, "KAN 2.0: Kolmogorov-Arnold Networks Meet Science" (August 2024, arXiv:2408.10205). MultKAN and improved symbolic fitting.
- Yu, Yu, Wang, "KAN or MLP: A Fairer Comparison" (July 2024, arXiv:2407.16674). Head-to-head on vision, NLP, and scientific tasks at matched compute.
- Li, "Kolmogorov-Arnold Networks are Radial Basis Function Networks" (May 2024, arXiv:2405.06721). FastKAN.
- Sidharth SS, Keerthana AR, Gokul R, Anas KP, "Chebyshev Polynomial-Based Kolmogorov-Arnold Networks: An Efficient Architecture for Nonlinear Function Approximation" (May 2024, arXiv:2405.07200). ChebyKAN.
- Bozorgasl, Chen, "Wav-KAN: Wavelet Kolmogorov-Arnold Networks" (May 2024, arXiv:2405.12832).
Background on splines and function approximation:
- de Boor, A Practical Guide to Splines (Revised ed., Springer 2001), Chapters IX-XI for spline approximation rates.
- DeVore, Lorentz, Constructive Approximation (Springer 1993), Chapter 13 for multivariate approximation lower bounds.
Where to Go Deeper
The original paper (Liu et al. 2024, arXiv:2404.19756) is readable and well-illustrated. Read it first, then read the critical-response papers (Yu, Yu, Wang 2024, arXiv:2407.16674 is the most useful starting point) to calibrate the claims against matched-compute MLP baselines. The pykan reference implementation on GitHub is the cleanest starting point for hands-on work, and a 2D function-fitting exercise on something like will build the intuition for what the spline edges actually learn in about thirty minutes. For a second pass, compare the symbolic-distillation pipeline in KAN 2.0 (arXiv:2408.10205) against the symbolic regression baselines PySR and AI Feynman to see where KAN wins, ties, or loses on a task that was already competitively served.
Next Topics
- Physics-informed neural networks: a related scientific-ML architecture where KANs are being tested as drop-in replacements for MLP sub-networks.
- Mechanistic interpretability: the post-hoc interpretability toolkit for standard networks. KANs offer a cleaner story on small models; mechanistic interpretability remains the only viable path at frontier scale.
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Universal Approximation TheoremLayer 2
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A