Symbolic Regression and Equation Discovery

Sneiderman, Robby

Applied ML

Symbolic Regression and Equation Discovery

SINDy sparse identification of dynamics, PySR genetic-programming search, AI Feynman dimensional analysis, Eureqa baselines, Pareto fronts for parsimony, and the regimes where sparse regression succeeds or fails.

AdvancedTier 3CurrentReference~15 min

Prerequisites

Lasso Regression Sparse Recovery and Compressed Sensing

Prereq Map

Learning position

Read this page in the graph.

applied-ml | layer 4 | tier 3. This page has 2 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Physics-Informed Neural Networks

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

A neural network that fits a dataset well still leaves the underlying law hidden inside its weights. Symbolic regression aims for the opposite end of the interpretability spectrum: search the space of closed-form expressions for one that fits the data and is short enough to read. The output is a formula, not a black box.

The field cracked open in 2009 (Schmidt and Lipson, Eureqa) and matured between 2016 and 2023 with three distinct algorithmic families. Sparse regression in a fixed library (SINDy) is fast and provably consistent under restricted-isometry conditions but limited to the chosen basis. Genetic programming (PySR, Operon) searches expression trees directly, slower but basis-free. Dimensional-analysis pipelines (AI Feynman) exploit physical units to prune the search space dramatically.

For ML, equation discovery sits at the boundary of model selection, interpretability, and physics-informed learning. It is one of few methods that produces a model a human can read, falsify, and extend without retraining.

Core Ideas

SINDy: sparse identification of nonlinear dynamics. Brunton, Proctor, and Kutz (PNAS 113, 2016; arXiv 1509.03580) cast equation discovery as a sparse linear regression. Build a library $\boldsymbol{\Theta}(\mathbf{x})$ of candidate functions (polynomials, trigonometrics, etc.), estimate time derivatives $\dot{\mathbf{X}}$ from data, and solve $\dot{\mathbf{X}} = \boldsymbol{\Theta}(\mathbf{X}) \boldsymbol{\Xi}$ with a sparsity penalty on $\boldsymbol{\Xi}$ . The active columns identify which library terms appear; coefficients give amplitudes. SINDy recovers the Lorenz system, the Navier-Stokes vorticity equation, and many ODEs from clean data. Performance degrades sharply with measurement noise (derivative estimation is noisy) and when the true dynamics fall outside the library.

PySR and modern genetic programming. Cranmer (arXiv 2305.01582, 2023) released PySR as a high-performance Julia/Python implementation of genetic-programming symbolic regression with a Pareto front over accuracy and expression complexity. Population-based search mutates and crosses expression trees; the algorithm returns the full Pareto front, leaving the parsimony-versus-fit tradeoff to the user. PySR and Operon (Burlacu et al. 2020) are the current state of the practice for tabular symbolic regression and consistently top the SRBench benchmark.

AI Feynman. Udrescu and Tegmark (Science Advances 6, 2020; arXiv 1905.11481) decompose symbolic regression by exploiting properties physicists rely on: dimensional analysis, separability, symmetry, and smoothness. The pipeline tests for these structural properties on the data, recursively decomposes the regression problem, and falls back to brute-force search on small subproblems. AI Feynman recovered $100$ of the $100$ formulas from the Feynman Lectures on Physics that the authors curated, where prior methods recovered substantially fewer.

Pareto fronts and parsimony. Symbolic regression has no single objective. A constant function has complexity $1$ and large error; a deep expression tree can interpolate noise. The principled output is the Pareto front of expressions that are not dominated on both axes. Selecting from the front uses an Occam criterion such as MDL, AIC on a complexity-penalized likelihood, or domain-expert judgment. Reporting only the highest-accuracy expression is the classic overfitting trap.

Where sparse regression succeeds and fails. SINDy and its variants succeed when the library contains the true terms, derivatives can be estimated cleanly (low noise, fine sampling), and the dynamics are well-conditioned. They fail when the true equation lies outside the library (no amount of $\ell_1$ penalty recovers a $\sin(x)$ term from a polynomial library), when noise dominates the derivative estimate, or when the system is highly nonnormal so coefficients are weakly identifiable. Genetic programming sidesteps the library problem at substantial compute cost.

Common Confusions

Watch Out

Symbolic regression is not interpretable by default

A 30-node expression tree with nested transcendental functions is no more readable than a small neural network. Interpretability requires actively constraining complexity, either through Pareto-front selection or hard depth caps. PySR's strength is exposing the tradeoff explicitly; the user still chooses where on the front to live.

Watch Out

SINDy success on clean data does not transfer to noisy data

Demonstrations on noise-free Lorenz or Lotka-Volterra trajectories make SINDy look magical. With realistic measurement noise on derivatives, the procedure needs total-variation derivative estimators, weak-form formulations (Messenger and Bortz 2021), or implicit-SINDy variants to remain identifiable. Clean-data benchmarks understate the engineering effort of real applications.

References

Brunton, Proctor, Kutz, Discovering governing equations from data by sparse identification of nonlinear dynamical systems (PNAS 113, 2016; arXiv 1509.03580). Original SINDy paper.
Cranmer, Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl (arXiv 2305.01582, 2023). PySR design and benchmarks.
Udrescu, Tegmark, AI Feynman: A physics-inspired method for symbolic regression (Science Advances 6, 2020; arXiv 1905.11481). Dimensional-analysis decomposition.
Schmidt, Lipson, Distilling Free-Form Natural Laws from Experimental Data (Science 324, 2009). Eureqa and the original demonstration on pendulum dynamics.
Messenger, Bortz, Weak SINDy: Galerkin-Based Data-Driven Model Selection (Multiscale Modeling and Simulation 19, 2021; arXiv 2005.04339). Weak-form SINDy for noisy data.
La Cava et al., Contemporary Symbolic Regression Methods and their Relative Performance (NeurIPS 2021 Datasets and Benchmarks; arXiv 2107.14351). SRBench comparison across methods.

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics