Applied ML
Symbolic Regression and Equation Discovery
SINDy sparse identification of dynamics, PySR genetic-programming search, AI Feynman dimensional analysis, Eureqa baselines, Pareto fronts for parsimony, and the regimes where sparse regression succeeds or fails.
Why This Matters
A neural network that fits a dataset well still leaves the underlying law hidden inside its weights. Symbolic regression aims for the opposite end of the interpretability spectrum: search the space of closed-form expressions for one that fits the data and is short enough to read. The output is a formula, not a black box.
The field cracked open in 2009 (Schmidt and Lipson, Eureqa) and matured between 2016 and 2023 with three distinct algorithmic families. Sparse regression in a fixed library (SINDy) is fast and provably consistent under restricted-isometry conditions but limited to the chosen basis. Genetic programming (PySR, Operon) searches expression trees directly, slower but basis-free. Dimensional-analysis pipelines (AI Feynman) exploit physical units to prune the search space dramatically.
For ML, equation discovery sits at the boundary of model selection, interpretability, and physics-informed learning. It is one of few methods that produces a model a human can read, falsify, and extend without retraining.
Core Ideas
SINDy: sparse identification of nonlinear dynamics. Brunton, Proctor, and Kutz (PNAS 113, 2016; arXiv 1509.03580) cast equation discovery as a sparse linear regression. Build a library of candidate functions (polynomials, trigonometrics, etc.), estimate time derivatives from data, and solve with a sparsity penalty on . The active columns identify which library terms appear; coefficients give amplitudes. SINDy recovers the Lorenz system, the Navier-Stokes vorticity equation, and many ODEs from clean data. Performance degrades sharply with measurement noise (derivative estimation is noisy) and when the true dynamics fall outside the library.
PySR and modern genetic programming. Cranmer (arXiv 2305.01582, 2023) released PySR as a high-performance Julia/Python implementation of genetic-programming symbolic regression with a Pareto front over accuracy and expression complexity. Population-based search mutates and crosses expression trees; the algorithm returns the full Pareto front, leaving the parsimony-versus-fit tradeoff to the user. PySR and Operon (Burlacu et al. 2020) are the current state of the practice for tabular symbolic regression and consistently top the SRBench benchmark.
AI Feynman. Udrescu and Tegmark (Science Advances 6, 2020; arXiv 1905.11481) decompose symbolic regression by exploiting properties physicists rely on: dimensional analysis, separability, symmetry, and smoothness. The pipeline tests for these structural properties on the data, recursively decomposes the regression problem, and falls back to brute-force search on small subproblems. AI Feynman recovered of the formulas from the Feynman Lectures on Physics that the authors curated, where prior methods recovered substantially fewer.
Pareto fronts and parsimony. Symbolic regression has no single objective. A constant function has complexity and large error; a deep expression tree can interpolate noise. The principled output is the Pareto front of expressions that are not dominated on both axes. Selecting from the front uses an Occam criterion such as MDL, AIC on a complexity-penalized likelihood, or domain-expert judgment. Reporting only the highest-accuracy expression is the classic overfitting trap.
Where sparse regression succeeds and fails. SINDy and its variants succeed when the library contains the true terms, derivatives can be estimated cleanly (low noise, fine sampling), and the dynamics are well-conditioned. They fail when the true equation lies outside the library (no amount of penalty recovers a term from a polynomial library), when noise dominates the derivative estimate, or when the system is highly nonnormal so coefficients are weakly identifiable. Genetic programming sidesteps the library problem at substantial compute cost.
Common Confusions
Symbolic regression is not interpretable by default
A 30-node expression tree with nested transcendental functions is no more readable than a small neural network. Interpretability requires actively constraining complexity, either through Pareto-front selection or hard depth caps. PySR's strength is exposing the tradeoff explicitly; the user still chooses where on the front to live.
SINDy success on clean data does not transfer to noisy data
Demonstrations on noise-free Lorenz or Lotka-Volterra trajectories make SINDy look magical. With realistic measurement noise on derivatives, the procedure needs total-variation derivative estimators, weak-form formulations (Messenger and Bortz 2021), or implicit-SINDy variants to remain identifiable. Clean-data benchmarks understate the engineering effort of real applications.
References
Related Topics
Last reviewed: April 18, 2026
Prerequisites
Foundations this topic depends on.
- Lasso RegressionLayer 2
- Linear RegressionLayer 1
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Differentiation in RnLayer 0A
- Convex Optimization BasicsLayer 1
- Sparse Recovery and Compressed SensingLayer 4
- Sub-Gaussian Random VariablesLayer 2
- Concentration InequalitiesLayer 1
- Expectation, Variance, Covariance, and MomentsLayer 0A