Singular Learning Theory

Sneiderman, Robby

Modern Generalization

Singular Learning Theory

Singular Learning Theory (SLT), developed by Sumio Watanabe, is the Bayesian asymptotic theory of models whose Fisher information matrix is degenerate at the true parameter. Neural networks, mixture models, and hidden Markov models all fall in this class. The Real Log Canonical Threshold (RLCT) replaces half the parameter count in the Bayes free-energy expansion, and the Local Learning Coefficient (LLC) gives an empirical proxy that the developmental-interpretability community uses to study trained networks.

AdvancedAdvancedTier 1CurrentSupporting~35 min

For:MLStatsResearch

Prerequisites

Bayesian Estimation KL Divergence Fisher Information Asymptotic Statistics

Prereq Map

Learning position

Read this page in the graph.

modern-generalization | layer 3 | tier 1. This page has 7 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

PAC-Bayes Bounds

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

What this is

Singular Learning Theory (SLT) is the Bayesian asymptotic theory of statistical models whose Fisher information matrix is singular at the true parameter. The classical Bayesian Information Criterion (BIC) and the Bernstein-von Mises theorem both assume that this matrix is invertible. For neural networks, mixture models, hidden Markov models, and Bayesian networks, it is not. SLT, developed by Sumio Watanabe across the 2000s and consolidated in his 2009 monograph Algebraic Geometry and Statistical Learning Theory, replaces the half-parameter-count term d/2 in those classical results with a geometric invariant called the Real Log Canonical Threshold (RLCT), written λ. The RLCT comes from the algebraic geometry of the set of true parameters, and on singular models it is strictly smaller than d/2. Below: the geometric content of that statement, side by side.

Diagram. Why singular models break the classical asymptotic. Left: regular Fisher information, a single quadratic minimum. Middle: singular Fisher information, the true-parameter locus is a curve, not a point. Right: the free-energy expansion slope. The classical BIC slope d/2 over-counts the effective parameter count; the RLCT λ is the correct slope.

Why classical asymptotics break for singular models

A model p(x | θ) is regular at a true parameter θ* if the population log-likelihood is locally quadratic around θ*, with Hessian equal to the Fisher information matrix I(θ*). For regular models, the Laplace approximation to the marginal likelihood gives the Schwarz/BIC expansion

F_n = n L_n + \frac{d}{2} \log n + O_p(1),

where F_n = -log Z_n is the negative log marginal likelihood (the Bayes free energy), L_n is the empirical risk at the maximum-likelihood estimate, and d is the number of parameters. The Bernstein-von Mises theorem turns the same Laplace expansion into the statement that the posterior is asymptotically a Gaussian centered at the MLE with covariance I(θ*)⁻¹ / n.

A model is singular when the Fisher information matrix is degenerate on the set of true parameters. The standard examples are not exotic: a single hidden layer of a neural network already has parameter symmetries that make the Fisher matrix rank-deficient, mixture models with more components than the truth needs have a continuous family of equivalent parametrisations, and hidden Markov models inherit the same redundancy from latent states. On such models the level set {θ : KL(q || p_θ) = 0} is an algebraic variety with components of mixed dimension, and the Laplace approximation does not apply: there is no single quadratic basin to integrate over.

The consequence is concrete. On a singular model BIC penalises by (d/2) log n, but the actual Bayes free energy has a strictly smaller log-n coefficient. That coefficient is the RLCT.

The Real Log Canonical Threshold (RLCT)

Pick a prior ϕ(θ) with support containing the true parameter set W₀. Watanabe's central object is the zeta function

\zeta(z) = \int K(\theta)^z\, \phi(\theta)\, d\theta, \quad K(\theta) = \mathrm{KL}\bigl(q\,\|\,p_\theta\bigr).

This integral converges for Re(z) > 0. Watanabe proved that ζ(z) extends meromorphically to the whole complex plane, with poles on the negative real axis. The Real Log Canonical Threshold λ is minus the largest pole, and its multiplicity m is the order of that pole. On regular models, λ = d/2 and m = 1. On singular models, λ ≤ d/2, and λ can be much smaller when the singular set W₀ is large.

The geometric content is Hironaka's resolution of singularities. Under mild conditions on the model, the KL function K(θ) is real-analytic, so resolution gives a manifold Y and a proper analytic map g : Y → Θ such that K ∘ g is locally a monomial in coordinates on Y. From those monomial exponents and the Jacobian of g, one reads off the poles of ζ(z) directly. Concretely, if the local form is K ∘ g(y) = y₁^{2k₁} ⋯ y_r^{2k_r} and the prior pullback is y₁^{h₁} ⋯ y_r^{h_r}, then the local contribution to λ is

\min_i \frac{h_i + 1}{2 k_i}.

The global RLCT is the minimum over local charts. This recipe is what makes the RLCT computable in closed form for many algebraically tractable models, including reduced-rank regression, certain mixture models, and small Boltzmann machines. For deep networks, the RLCT is generally not known in closed form, which is the gap the empirical local learning coefficient is built to fill.

Definition

For a model p_θ, true distribution q, KL function K(θ) = KL(q || p_θ), and prior ϕ of compact support, the RLCT λ and its multiplicity m are

-\lambda = \max \bigl\{ \mathrm{Re}(z) : z\text{ is a pole of } \zeta(z) = \int K(\theta)^z \phi(\theta)\, d\theta \bigr\},

with m equal to the order of that largest pole. On regular models λ = d/2 and m = 1. On singular models λ ≤ d/2, with equality only if the Fisher information matrix at θ* is positive-definite.

Watanabe's free-energy formula

The asymptotic of the Bayes free energy on singular models replaces the (d/2) log n term of BIC with λ log n, and adds a log log n correction that records the multiplicity. Watanabe's formula reads

F_n = n L_n + \lambda \log n - (m - 1) \log \log n + O_p(1).

Each term carries a plain-language meaning. The $n L_n$ piece is the data fit at the maximum-likelihood estimate, and is $O(n)$ whether the model is regular or not. Next, $\lambda \log n$ is the volume-of-good-parameters correction. The near-optimal tube $\{\theta : K(\theta) < 1/n\}$ carries prior mass of order $n^{-\lambda}$ , so a smaller $\lambda$ makes that mass decay more slowly in $n$ and leaves more effective prior volume near $W_0$ ; that larger volume is exactly what gives a smaller free-energy penalty. Models with smaller $\lambda$ are penalised less because their singular set absorbs more effective prior volume, not less. Last, $(m - 1)\log\log n$ is the higher-order correction from coincident poles, and is usually small in practice. On a regular model the formula collapses to BIC, since $\lambda = d/2$ and $m = 1$ together kill the log-log term.

The same $\lambda$ controls the asymptotic Bayes generalisation error and the cross-validation gap. Watanabe (2010) used this to derive the Widely Applicable Information Criterion (WAIC), which estimates the Bayes generalisation error from a single training run using the posterior variance of the log-likelihood, and is asymptotically equivalent to leave-one-out Bayes cross-validation on both regular and singular models. WAIC is what packages such as arviz actually compute; the singular-asymptotic theory is the reason it is the recommended scoring rule in those packages.

The headline implication for model selection is that BIC over-penalises singular models. A neural network that fits the data well can have an effective $\lambda$ much smaller than $d/2$ , and a singular-aware criterion such as WAIC, sBIC, or a direct LLC estimate will rank it differently than BIC does. This is the formal version of the folk observation that overparameterised networks generalise better than parameter-counting would predict.

Theorem

Watanabe Free-Energy Expansion

Statement

Let $\lambda$ be the real log canonical threshold of $K(\theta) = \mathrm{KL}(q \,\|\, p_\theta)$ relative to the prior, and let $m$ be its multiplicity. The Bayes free energy $F_n = -\log Z_n$ then satisfies

F_n = n L_n + \lambda \log n - (m - 1)\log\log n + O_p(1).

On regular models $\lambda = d/2$ and $m = 1$ , which recovers the Schwarz/BIC expansion $F_n = n L_n + (d/2)\log n + O_p(1)$ .

Intuition

The $\lambda \log n$ term measures how fast prior mass concentrates on the near-optimal set as $n$ grows. On a singular model the set of true parameters is a positive-dimensional variety, so its effective volume shrinks more slowly than the single quadratic basin of a regular model, and $\lambda < d/2$ .

Proof Sketch

Since $K$ is real-analytic near the true parameter set, Hironaka's resolution of singularities produces coordinates in which $K$ is locally a monomial. In those coordinates the zeta function $\zeta(z) = \int K(\theta)^z \phi(\theta)\, d\theta$ is read off directly: its largest pole sits at $z = -\lambda$ with multiplicity $m$ . Watanabe's Tauberian argument transfers that pole to the large- $n$ behaviour of the free energy, where it becomes the $\lambda \log n - (m - 1)\log\log n$ correction; on a regular model the pole is at $z = -d/2$ with $m = 1$ and the expansion collapses to Schwarz/BIC.

Why It Matters

BIC penalises every model by $(d/2)\log n$ . The theorem says the honest penalty is $\lambda \log n$ with $\lambda \le d/2$ , so BIC systematically over-penalises singular models such as neural networks and mixtures. A singular-aware score (WAIC, sBIC, or a direct LLC estimate) ranks them differently.

Failure Mode

The expansion breaks when $K$ is not real-analytic, since resolution of singularities no longer applies. It also breaks when the prior puts zero mass on part of $W_0$ : the threshold is computed over the support of the prior, so an excluded component of the true set is invisible to the formula. If the truth is not realisable by the model, $K$ has no zero and the expansion around a minimum of $K$ replaces the true-parameter version.

report a correction →

The Local Learning Coefficient (LLC)

The RLCT is a global invariant of the model and the true distribution. In modern practice the object that interpretability researchers actually compute is local: the LLC at a specific trained parameter $w_*$ . The construction, due to Lau, Furman, Wang, Murfet, and Wei (2023, arXiv:2308.12108), restricts the zeta integral to a neighbourhood of $w_*$ and reads the local pole. Under regularity conditions on the loss landscape, the local RLCT recovers the global one if $w_*$ lies on the most degenerate component of the singular set.

The estimator the authors propose is what makes the LLC usable for deep networks. Run Stochastic Gradient Langevin Dynamics (SGLD) starting from $w_*$ , restricted by an $\ell^2$ ball around $w_*$ and tempered so that the stationary distribution is the local Gibbs measure at the chosen inverse temperature. The tempered excess loss $\mathbb{E}_{p_\beta}[n L(w)] - n L(w_*)$ scales like $\lambda / \beta$ , not like the $\log n$ of the free energy, so multiplying it by $\beta$ and evaluating at the WBIC inverse temperature $\beta = 1/\log n$ recovers the local RLCT. In practice $\mathbb{E}_{p_\beta}[L(w)]$ is the loss averaged over SGLD draws from $p_\beta$ , and the construction is engineering-grade enough to apply to deep linear networks up to 100M parameters, ResNet image models, and small transformer language models. The Timaeus group at the University of Melbourne and collaborators have published replications and refinements; their write-up at timaeus.co/research/2023-08-23-quantifying-degeneracy is the most accessible entry point.

Definition

For a trained parameter $w_* \in \mathbb{R}^d$ , a loss $L$ , a prior $\phi$ supported in a neighbourhood $U$ of $w_*$ , and an inverse temperature $\beta > 0$ , the local Gibbs measure is

p_\beta(w \mid U, w_*) \propto \exp\bigl(-\beta \cdot n \cdot L(w)\bigr) \cdot \phi(w) \cdot \mathbf{1}_{U}(w).

The tempered excess loss scales like $\lambda / \beta$ : under the local Gibbs measure at inverse temperature $\beta$ ,

\mathbb{E}_{p_\beta}\bigl[n L(w)\bigr] - n L(w_*) \approx \frac{\lambda(w_*)}{\beta}.

Inverting this relation gives the localized-WBIC estimator of Lau, Furman, Wang, Murfet, and Wei,

\hat{\lambda}(w_*) = n\beta\,\bigl(\mathbb{E}_{p_\beta}[L(w)] - L(w_*)\bigr), \qquad \beta = \frac{1}{\log n}.

The expectation $\mathbb{E}_{p_\beta}[L(w)]$ is the loss averaged over SGLD draws from $p_\beta$ , so the LLC comes from a single tempered chain at the WBIC temperature $\beta = 1/\log n$ , not from a regression of $\mathbb{E}_{p_\beta}[n L(w)]$ against $\log\beta$ .

What is machine-verified in FormalSLT today

FormalSLT is a Lean 4 library that formalises the finite-class classical statistical learning route from empirical risk minimisation through Rademacher symmetrisation, VC theory, finite Dudley chaining, algorithmic stability, and a finite-[0,1] PAC-Bayes layer. Its public spine is sorry-free and admit-free and builds under the axiom set [propext, Classical.choice, Quot.sound] only. The live module, theorem, and sorry counts are the badges at the top of the FormalSLT README; the module map and the per-result theorem map list which results are formalised.

The Singular Learning Theory layer described in the rest of this page is not formalised in FormalSLT. Its assumptions-and-nonclaims document is explicit about this: hypothesis classes are finite, losses are bounded, Fisher information machinery is not part of the spine, the resolution-of-singularities step is not formalised, and the singular asymptotic of the Bayes free energy is not stated as a Lean theorem. What is present on the PAC-Bayes side is the finite McAllester/Catoni route, not the singular posterior asymptotic of Watanabe.

One naming collision is worth surfacing in one place. "SLT" in the FormalSLT repository name refers to Statistical Learning Theory in the PAC sense (Vapnik-Chervonenkis 1971 onward), not to Watanabe's Singular Learning Theory. Reasonable readers conflate the two; the repository does not target the singular-asymptotic line of work yet. A formalisation lane for it would need at least the real-analytic Laplace integral, the resolution-of-singularities certificate, and a measurable-supremum layer for the posterior; the open-formalization-problems document tracks the gap.

Open problems

The active research frontier around Watanabe SLT splits into three threads, each with its own concrete questions.

Phase transitions and developmental interpretability. The LLC of a trained network is not a single number across training. Several groups, primarily Timaeus and the DevInterp community, have observed that the LLC of a training run undergoes discrete jumps that line up with structural changes in the network, such as the emergence of induction heads in toy transformers. The open question is which structural changes are reliably detected by LLC jumps and which are not, and whether the LLC of a sub-circuit can be used as a quantitative interpretability score. The DevInterp blog at devinterp.com tracks ongoing work; the refined LLC paper on attention-head specialisation is the most cited follow-up.

Computing the RLCT for realistic models. The closed-form RLCT calculations in Watanabe's monograph cover reduced-rank regression, certain mixture models, and small Boltzmann machines. For modern architectures, the RLCT is not known in closed form, and the best available estimates come from the LLC of trained parameters plus theoretical bounds for specific sub-architectures. Sharper bounds for transformer blocks, residual networks, and Bayesian networks with shared parameters would let SLT make sharper generalisation predictions.

Connections to test-time training and posterior tempering. The free-energy formula prices what happens when the posterior is concentrated around a singular set. Two adjacent areas have started to use that language: test-time training frames adaptation as a posterior update around the deployment-time minimum, and tempered Bayes/PAC-Bayes (related to the PAC-Bayes bounds page) treats the inverse temperature as a free parameter. The open direction is whether the LLC of the adapted parameter is a useful signal for when adaptation should stop, and whether singular-aware PAC-Bayes bounds tighten the regular-model bounds for the trained networks where the LLC is small.

Exercises

ExerciseAdvanced

Problem

A two-parameter model has KL function $K(\theta_1, \theta_2) = \theta_1^2 \, \theta_2^4$ near the origin, already a monomial, and the prior is uniform near the origin so the prior exponents are $h_1 = h_2 = 0$ . Using the local monomial rule $\min_i (h_i + 1)/(2 k_i)$ , compute the local RLCT $\lambda$ and its multiplicity $m$ . How does $\lambda$ compare to the regular-model value $d/2$ , and what is the resulting free-energy penalty?

References

Canonical:

Watanabe, S. (2009). Algebraic Geometry and Statistical Learning Theory. Cambridge Monographs on Applied and Computational Mathematics 25, Cambridge University Press. ISBN 978-0-521-86467-1. The zeta function and its meromorphic continuation are Chapter 4 ("Zeta function and singular integral"); the free-energy expansion F_n = n L_n + λ log n - (m-1) log log n + O_p(1) is the main asymptotic of Chapter 6 ("Singular learning theory").
Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research 11, 3571-3594.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6 (2), 461-464.

Current:

Watanabe, S. (2018). Mathematical Theory of Bayesian Statistics. CRC Press. WAIC and its cross-validation equivalence are Chapter 8 ("Information Criteria").

Frontier:

Lau, E., Furman, Z., Wang, G., Murfet, D., Wei, S. (2023). The Local Learning Coefficient: A Singularity-Aware Complexity Measure. arXiv:2308.12108. Published in AISTATS 2025.
Wang, G., et al. (2024). Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient. arXiv:2410.02984.

Next Topics

Last reviewed: June 4, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

7

Asymptotic Statistics: M-Estimators, Delta Method, LANlayer 0B · tier 1
Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
KL Divergencelayer 1 · tier 1
AIC and BIClayer 2 · tier 1
PAC-Bayes Boundslayer 3 · tier 1

Derived topics

0

No published topic currently declares this as a prerequisite.