Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Modern Generalization

Grokking

Models can memorize training data quickly, then generalize much later after continued training. This delayed generalization, called grokking, breaks the assumption that overfitting is a terminal state and connects to weight decay, implicit regularization, and phase transitions in learning.

AdvancedTier 2Current~50 min

Why This Matters

The 1990s-2010s picture of training said: as you train longer, the model first fits the training data, then starts overfitting. Validation performance degrades. You should stop early. Modern practice (post-Nakkiran et al. 2020) routinely trains to convergence. Grokking is one of several results (alongside double descent and benign overfitting) that invalidated the universal early-stopping story, not a unique violation of it.

generalizationtransitionmemorizationplateaugeneralization0%25%50%75%100%AccuracyEpoch03006009001200Train accuracyValidation accuracy

Grokking violates this. In certain settings, a model reaches perfect training accuracy quickly, then shows no improvement on validation for many more epochs. Then, long after you would have stopped training, validation accuracy suddenly jumps to near-perfect. The model goes from memorizing to generalizing, and the transition is abrupt.

This matters because it shows that overfitting is not necessarily terminal. Continued training with the right regularization can cause a qualitative change in what the model has learned. The practical implication: your training schedule, weight initialization, and regularization choices can determine whether generalization happens at all.

The Phenomenon

Proposition

Grokking: Delayed Generalization

Statement

On structured algorithmic tasks (modular arithmetic, permutation composition, polynomial evaluation), neural networks trained with SGD exhibit the following training dynamics:

  1. Phase 1 (memorization): Training accuracy reaches 100% typically within 103\sim 10^3 to 104\sim 10^4 steps in the Power et al. (2022) modular arithmetic settings. Validation accuracy remains near chance.

  2. Phase 2 (plateau): Training loss stays near zero. Validation accuracy stays near chance. This phase lasts 104\sim 10^4 to 10610^6 steps, strongly dependent on train fraction, weight decay, and optimizer.

  3. Phase 3 (generalization): Validation accuracy rises sharply to near 100% over a relatively short window.

The transition from Phase 2 to Phase 3 is a phase transition: the model's internal representation undergoes a qualitative structural change.

Intuition

The network first finds a memorization solution that stores training pairs in its weights. This is not a literal lookup table, but a distributed representation that achieves the same input-output mapping via high-dimensional geometry. This solution has high weight norm because it stores nn unrelated mappings. With weight decay applying constant pressure toward smaller weights, the network is slowly pushed toward lower-norm solutions. Eventually, it crosses a threshold where a compact, generalizing algorithm (e.g., the modular arithmetic circuit) becomes lower-loss than the memorization solution. The transition is sudden because the two solution types are structurally different. There is no smooth interpolation between a distributed memorization and a compact algorithm.

Why It Matters

Grokking shows that optimization time can change what a network knows, not just how well it fits. This has implications for: training schedule design (early stopping may prevent generalization), regularization theory (weight decay is not just preventing overfitting, it is actively shaping the solution landscape), and mechanistic interpretability (you can study how networks transition from memorized to algorithmic representations).

Failure Mode

Grokking is easiest to demonstrate on small, highly structured tasks (modular arithmetic with 100\leq 100 elements). It is unclear how much this phenomenon extends to large-scale, naturalistic training. Some evidence suggests analogous phenomena in larger models, but the clean phase-transition signature becomes less sharp. Do not assume all training runs will exhibit grokking if you just train long enough.

What Drives the Transition

Proposition

Weight Decay as a Sufficient Driver of Grokking

Statement

In the Power et al. (2022) setting, adding 2\ell_2 weight decay is a sufficient mechanism to produce grokking. With weight decay, the effective loss is:

Ltotal=Ldata+λ2w2\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \frac{\lambda}{2} \|w\|^2

After the memorization phase (Ldata0\mathcal{L}_{\text{data}} \approx 0), the dominant gradient signal comes from the weight decay term, which pushes the weights toward the origin. This continuous pressure compresses the representation until the network transitions to a lower-norm solution that generalizes.

Empirical observation (not a theorem): Power et al. (2022) report that the grokking time TgrokT_{\text{grok}} scales roughly as Tgrok1/λT_{\text{grok}} \propto 1/\lambda for small λ\lambda in this setting. This is empirical; no closed-form derivation exists. Alternative framings make the scaling a consequence of other quantities: Liu et al. (2023, "Omnigrok") cast grokking as governed by the ratio of weight norm to a critical value; Varma et al. (2023) cast it as a circuit efficiency ratio between memorizing and generalizing circuits.

Weight decay is sufficient, not necessary. Later work shows grokking can arise from other sources of compression or from a lazy-to-rich transition without explicit 2\ell_2 weight decay. See the FailureMode and the "When Grokking Happens" table for scope.

Intuition

Weight decay is doing implicit architecture search. After memorization, the loss landscape near the memorizing solution looks flat (training loss is zero regardless of small weight changes). But the weight decay gradient keeps pulling the weights inward. Over time, this compression forces the network out of the memorization basin and into a generalization basin. Stronger weight decay means faster compression and faster grokking. Too much weight decay prevents memorization in the first place.

The deeper story is that some implicit or explicit force toward lower-complexity representations is what drives the transition. 2\ell_2 weight decay is the cleanest and most studied version, but it is not the only one.

Why It Matters

This connects grokking to the broader theory of implicit bias. Gradient descent has specific biases: max-margin for logistic loss on separable data (Soudry et al. 2018), min-norm for squared loss. On algorithmic tasks with weight decay (Power 2022, Nanda 2023), the min-norm interpolator correlates with the generalizing algorithm rather than the memorizing solution. This correlation is task-dependent: for natural data, the min-norm interpolator is generally NOT the generalizing algorithm. Grokking on modular arithmetic is a concrete case where the implicit bias aligns with the task's compositional structure.

Failure Mode

Necessity of weight decay is an overclaim. Kumar, Bordelon, Gershman, Pehlevan ("Grokking as the Transition from Lazy to Rich Training Dynamics," arXiv 2310.06110, 2023) show grokking in two-layer MLPs without explicit 2\ell_2 weight decay, driven instead by a transition from the lazy (NTK-like) to rich (feature-learning) regime. Other work demonstrates grokking induced by non-2\ell_2 regularizers (e.g., sparsity penalties, gradient noise) and argues that in some overparameterized settings grokking can appear without any explicit regularization at all, with the implicit bias of SGD providing the compression.

Separately, the 1/λ1/\lambda scaling is approximate and task-dependent. For some tasks, even strong weight decay does not produce grokking if the generalizing solution requires large weights in specific directions.

Mechanistic View

Recent work (Nanda et al., 2023, arXiv 2301.05217) has opened up the internal mechanics of grokking on modular addition. This work builds on the same mechanistic interpretability techniques used to identify induction heads in transformers. Before the transition, the network uses a memorization circuit stored in weights. After the transition, the network implements a sparse trigonometric circuit: for each frequency ω\omega in a small learned set of key frequencies (typically 5\sim 566 out of dmodeld_{\text{model}}), the network computes cos(ω(a+b))\cos(\omega(a+b)) and sin(ω(a+b))\sin(\omega(a+b)) via the identity cos(A+B)=cos(A)cos(B)sin(A)sin(B)\cos(A+B) = \cos(A)\cos(B) - \sin(A)\sin(B), combining [cos(ωa),sin(ωa)][\cos(\omega a), \sin(\omega a)] and [cos(ωb),sin(ωb)][\cos(\omega b), \sin(\omega b)]. The representation is sparse: only a handful of frequencies are used, not a general DFT.

Key observations:

  • The transition involves a competition between two circuits that coexist during the plateau phase
  • Weight decay gradually weakens the memorization circuit while the generalization circuit grows
  • Varma et al. (2023) frame the competition as circuit efficiency under a weight budget: the memorization circuit has high weight norm because it stores nn input-output pairs; the generalization circuit has lower norm because it is a compact algorithm. Weight decay favors the lower-norm circuit. Once the generalization circuit is accurate on training inputs, memorization outputs become interference rather than helpful signal. This is efficiency competition under a weight budget, not neuronal-style mutual inhibition.

This is one of the clearest examples of mechanistic interpretability revealing how learned algorithms emerge.

When Grokking Happens and When It Does Not

ConditionGrokking likely?Why
Small structured dataset, weight decayYesClassic setting from Power et al. (2022)
Large naturalistic dataset, weight decaySometimesHarder to detect; may happen but be masked by noise
Two-layer MLP, no weight decay (Kumar 2024 setting)SometimesLazy-to-rich transition can produce grokking without 2\ell_2 decay
No weight decay, no other compression pressureRareThe default assumption; easiest to break if another regularizer (sparsity, gradient noise, dropout, small-init implicit bias) is present
Dataset too small for memorizationNoModel cannot memorize, so the "Phase 1" does not occur
Very high weight decayNoPrevents memorization entirely; model underfits from the start
Dropout instead of weight decayUnclearPublished grokking results primarily use AdamW; dropout-only effects on grokking are underexplored

Common Confusions

Watch Out

Grokking is not just late convergence

In standard training, validation loss may improve slowly after training loss converges. This is smooth, gradual improvement. Grokking is specifically the sudden phase transition after a long plateau. The sharpness of the transition is the distinguishing feature. If validation accuracy improves smoothly, that is normal training dynamics, not grokking.

Watch Out

Grokking is not evidence that you should always train longer

The conditions for grokking are specific: overparameterized model, structured task, appropriate regularization. On most practical tasks, training past convergence with standard hyperparameters does not produce grokking. It produces memorization. The lesson is that regularization can change the qualitative outcome of training, not that more epochs are universally better.

Watch Out

Grokking does not mean weight decay always helps

Weight decay enables grokking in specific settings, but it can also hurt. If the optimal solution requires large weights (e.g., sparse features with high activation), strong weight decay may prevent the network from ever finding it. The grokking story is about the interaction between weight decay, model capacity, and task structure.

Watch Out

Weight decay is sufficient, not necessary

A common reading of Power et al. (2022) is "no weight decay, no grokking." That overstates what the paper shows. Power et al. demonstrate that in their modular-arithmetic setting 2\ell_2 weight decay reliably produces grokking; they do not prove it is the only route. Kumar et al. (2024) produce grokking in two-layer MLPs without explicit weight decay by controlling the lazy-to-rich transition directly; other papers induce grokking via non-2\ell_2 regularizers (sparsity, gradient noise) or via the implicit bias of small initialization. The mechanistic picture is better stated as "some form of compression pressure drives the transition," with weight decay as the best-studied example.

Exercises

ExerciseCore

Problem

Define the grokking time TgrokT_{\text{grok}} as the step at which validation accuracy first exceeds 95%95\% (the operational definition used in Power et al. 2022). You train a small transformer on modular addition mod 97 with 50% of the data as training data. Training accuracy reaches 100% at step 1,000. At step 10,000, validation accuracy is still 30%. At step 50,000, validation accuracy jumps to 95%. Sketch the expected train loss, train accuracy, and validation accuracy curves. What is TgrokT_{\text{grok}}?

ExerciseAdvanced

Problem

Using the operational definition above (grokking time TgrokT_{\text{grok}} = step at which validation accuracy first exceeds 95%95\%), give a heuristic argument for why TgrokT_{\text{grok}} should scale approximately as 1/λ1/\lambda with weight decay coefficient λ\lambda in the Power et al. (2022) setting. What determines the proportionality constant? (This is an empirical scaling, not a theorem.)

References

Canonical:

  • Power et al., "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets" (ICLR 2022 MATH-AI workshop, arXiv 2201.02177). The original discovery, in the 2\ell_2 weight-decay setting.
  • Nanda et al., "Progress Measures for Grokking via Mechanistic Interpretability" (ICLR 2023). Internal mechanics of the transition on modular addition.

Extensions and reframings:

  • Liu et al., "Omnigrok: Grokking Beyond Algorithmic Data" (ICLR 2023). Extensions to broader settings; argues grokking is controlled by the ratio of weight norm to a critical value, not weight decay per se.
  • Kumar, Bordelon, Gershman, Pehlevan, "Grokking as the Transition from Lazy to Rich Training Dynamics" (arXiv 2310.06110, 2023). Produces grokking in two-layer MLPs without explicit weight decay, via the lazy (NTK-like) to rich (feature-learning) transition. Direct counterexample to the "weight decay is necessary" reading.
  • Thilak et al., "The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon" (2022). Adaptive-optimizer route to grokking.
  • Varma et al., "Explaining grokking through circuit efficiency" (2023). Frames the transition as a competition between memorization and generalization circuits with different efficiencies.

Context (not grokking-specific, but relevant for implicit bias):

  • Chizat & Bach, "On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport" (NeurIPS 2018) and "Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss" (COLT 2020). Lazy vs. rich regime foundations.
  • Soudry et al., "The Implicit Bias of Gradient Descent on Separable Data" (JMLR 2018). Max-margin bias of GD without explicit regularization.

Next Topics

  • Double descent: another violation of classical learning theory; models improve past the interpolation threshold
  • Open problems in ML theory: grokking connects to the broader question of why overparameterized networks generalize

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics