Grokking

Sneiderman, Robby

Modern Generalization

Grokking

Models can memorize training data quickly, then generalize much later after continued training. This delayed generalization, called grokking, breaks the assumption that overfitting is a terminal state and connects to weight decay, implicit regularization, and phase transitions in learning.

AdvancedTier 2CurrentCore spine~50 min

Prerequisites

Regularization Theory Stochastic Gradient Descent Convergence Implicit Bias and Modern Generalization History of AI

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

modern-generalization | layer 4 | tier 2. This page has 4 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Double Descent

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The 1990s-2010s picture of training said: as you train longer, the model first fits the training data, then starts overfitting. Validation performance degrades. You should stop early. Modern practice (post-Nakkiran et al. 2020) routinely trains to convergence. Grokking is one of several results (alongside double descent and benign overfitting) that invalidated the universal early-stopping story, not a unique violation of it.

Classic modular-addition run

Memorization happens early; the real generalization arrives much later

This is the clean grokking signature from the small algorithmic setting: training accuracy saturates quickly, validation stays near chance for a long plateau, then a lower-complexity algorithm finally takes over.

Phase 1

Memorization

The network fits the seen examples fast. On modular arithmetic this can happen while held-out examples remain at chance.

Phase 2

Plateau

The validation curve looks flat, but mechanistic work shows the generalizing circuit is forming underneath while memorizing components still dominate the output.

Phase 3

Cleanup

The late jump is the visible moment when the lower-complexity algorithm finally beats the memorizing solution on held-out inputs.

Grokking violates this. In certain settings, a model reaches perfect training accuracy quickly, then shows no improvement on validation for many more epochs. Then, long after you would have stopped training, validation accuracy suddenly jumps to near-perfect. The model goes from memorizing to generalizing, and the transition is abrupt.

This matters because it shows that overfitting is not necessarily terminal. Continued training with the right regularization can cause a qualitative change in what the model has learned. The practical implication: your training schedule, weight initialization, and regularization choices can determine whether generalization happens at all.

The Phenomenon

Proposition

Grokking: Delayed Generalization

Statement

On structured algorithmic tasks (modular arithmetic, permutation composition, polynomial evaluation), neural networks trained with SGD exhibit the following training dynamics:

Phase 1 (memorization): Training accuracy reaches 100% typically within $\sim 10^3$ to $\sim 10^4$ steps in the Power et al. (2022) modular arithmetic settings. Validation accuracy remains near chance.
Phase 2 (plateau): Training loss stays near zero. Validation accuracy stays near chance. This phase lasts $\sim 10^4$ to $10^6$ steps, strongly dependent on train fraction, weight decay, and optimizer.
Phase 3 (generalization): Validation accuracy rises sharply to near 100% over a relatively short window.

The transition from Phase 2 to Phase 3 is a phase transition: the model's internal representation undergoes a qualitative structural change.

Intuition

The network first finds a memorization solution that stores training pairs in its weights. This is not a literal lookup table, but a distributed representation that achieves the same input-output mapping via high-dimensional geometry. This solution has high weight norm because it stores $n$ unrelated mappings. With weight decay applying constant pressure toward smaller weights, the network is slowly pushed toward lower-norm solutions. Eventually, it crosses a threshold where a compact, generalizing algorithm (e.g., the modular arithmetic circuit) becomes lower-loss than the memorization solution. The transition is sudden because the two solution types are structurally different. There is no smooth interpolation between a distributed memorization and a compact algorithm.

Why It Matters

Grokking shows that optimization time can change what a network knows, not just how well it fits. This has implications for: training schedule design (early stopping may prevent generalization), regularization theory (weight decay is not just preventing overfitting, it is actively shaping the solution landscape), and mechanistic interpretability (you can study how networks transition from memorized to algorithmic representations).

Failure Mode

Grokking is easiest to demonstrate on small, highly structured tasks (modular arithmetic with $\leq 100$ elements). It is unclear how much this phenomenon extends to large-scale, naturalistic training. Some evidence suggests analogous phenomena in larger models, but the clean phase-transition signature becomes less sharp. Do not assume all training runs will exhibit grokking if you just train long enough.

report a correction →

What Drives the Transition

Proposition

Weight Decay as a Sufficient Driver of Grokking

Statement

In the Power et al. (2022) setting, adding $\ell_2$ weight decay is a sufficient mechanism to produce grokking. With weight decay, the effective loss is:

$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \frac{\lambda}{2} \|w\|^2$

After the memorization phase ( $\mathcal{L}_{\text{data}} \approx 0$ ), the dominant gradient signal comes from the weight decay term, which pushes the weights toward the origin. This continuous pressure compresses the representation until the network transitions to a lower-norm solution that generalizes.

Empirical observation (not a theorem): Power et al. (2022) report that the grokking time $T_{\text{grok}}$ scales roughly as $T_{\text{grok}} \propto 1/\lambda$ for small $\lambda$ in this setting. This is empirical; no closed-form derivation exists. Alternative framings make the scaling a consequence of other quantities: Liu et al. (2023, "Omnigrok") cast grokking as governed by the ratio of weight norm to a critical value; Varma et al. (2023) cast it as a circuit efficiency ratio between memorizing and generalizing circuits.

Weight decay is sufficient, not necessary. Later work shows grokking can arise from other sources of compression or from a lazy-to-rich transition without explicit $\ell_2$ weight decay. See the FailureMode and the "When Grokking Happens" table for scope.

Intuition

Weight decay is doing implicit architecture search. After memorization, the loss landscape near the memorizing solution looks flat (training loss is zero regardless of small weight changes). But the weight decay gradient keeps pulling the weights inward. Over time, this compression forces the network out of the memorization basin and into a generalization basin. Stronger weight decay means faster compression and faster grokking. Too much weight decay prevents memorization in the first place.

The deeper story is that some implicit or explicit force toward lower-complexity representations is what drives the transition. $\ell_2$ weight decay is the cleanest and most studied version—not the only one.

Why It Matters

This connects grokking to the broader theory of implicit bias. Gradient descent has specific biases: max-margin for logistic loss on separable data (Soudry et al. 2018), min-norm for squared loss. On algorithmic tasks with weight decay (Power 2022, Nanda 2023), the min-norm interpolator correlates with the generalizing algorithm rather than the memorizing solution. This correlation is task-dependent: for natural data, the min-norm interpolator is generally NOT the generalizing algorithm. Grokking on modular arithmetic is a concrete case where the implicit bias aligns with the task's compositional structure.

Failure Mode

Necessity of weight decay is an overclaim. Kumar, Bordelon, Gershman, Pehlevan ("Grokking as the Transition from Lazy to Rich Training Dynamics," arXiv 2310.06110, 2023) show grokking in two-layer MLPs without explicit $\ell_2$ weight decay, driven instead by a transition from the lazy (NTK-like) to rich (feature-learning) regime. Other work demonstrates grokking induced by non- $\ell_2$ regularizers (e.g., sparsity penalties, gradient noise) and argues that in some overparameterized settings grokking can appear without any explicit regularization at all, with the implicit bias of SGD providing the compression.

Separately, the $1/\lambda$ scaling is approximate and task-dependent. For some tasks, even strong weight decay does not produce grokking if the generalizing solution requires large weights in specific directions.

report a correction →

Mechanistic View

Recent work (Nanda et al., 2023, arXiv 2301.05217) has opened up the internal mechanics of grokking on modular addition. This work builds on the same mechanistic interpretability techniques used to identify induction heads in transformers. Before the transition, the network uses a memorization circuit stored in weights. After the transition, the network implements a sparse trigonometric circuit: for each frequency $\omega$ in a small learned set of key frequencies (typically $\sim 5$ – $6$ out of $d_{\text{model}}$ ), the network computes $\cos(\omega(a+b))$ and $\sin(\omega(a+b))$ via the identity

\cos(A+B)=\cos(A)\cos(B)-\sin(A)\sin(B).

It combines $[\cos(\omega a), \sin(\omega a)]$ and $[\cos(\omega b), \sin(\omega b)]$ . The representation is sparse: only a handful of frequencies are used, not a general DFT.

Key observations:

The transition involves a competition between two circuits that coexist during the plateau phase
Weight decay gradually weakens the memorization circuit while the generalization circuit grows
Varma et al. (2023) frame the competition as circuit efficiency under a weight budget: the memorization circuit has high weight norm because it stores $n$ input-output pairs; the generalization circuit has lower norm because it is a compact algorithm. Weight decay favors the lower-norm circuit. Once the generalization circuit is accurate on training inputs, memorization outputs become interference rather than helpful signal. This is efficiency competition under a weight budget, not neuronal-style mutual inhibition.

This is one of the clearest examples of mechanistic interpretability revealing how learned algorithms emerge.

In Progress: Non-Abelian Grokking / Group Composition Lab

This is the most natural Grokking extension for TheoremPath, but it should be framed honestly: it is a research-forward build target, not a shipped claim. The question is simple enough for non-specialists to grasp:

Can a neural network discover a rule instead of memorizing a lookup table?

The modular-addition setting is the clean warm-up. A stronger version replaces clock arithmetic with finite-group composition, where the target operation can be order-sensitive. In the dihedral group $D_4$ , for example, rotating a square and then reflecting it need not match reflecting it first and then rotating it. That makes the task more serious than modular addition because a bag-of-inputs shortcut is no longer enough; the model has to learn ordered composition.

The current lab plan is:

Baseline: $Z_p$ modular addition, to reproduce the classic grokking story on a familiar algebraic task
First non-abelian case: $D_4$ or $S_3$ , where order matters and the Cayley table is still small enough to visualize cleanly
Sequence setting: predict $g_1 g_2 \cdots g_k$ for $k = 3, 4, 5, \ldots$ rather than only binary composition
Architecture comparison: embedding MLP vs GRU/RNN vs tiny transformer, so the user can see when depth or recurrence actually helps
Mechanistic panel: learned Cayley table, order-sensitivity tests, length generalization, and later Fourier / irrep diagnostics plus targeted ablations

The payoff is sharper than "another modular arithmetic demo." A user would be able to watch a model memorize examples, fail on held-out compositions, then eventually recover the underlying algebraic rule. For $D_4$ , the visual story is unusually clean: the same square can be rotated and reflected in different orders, and the lab can show both the true Cayley table and the model-predicted table over training.

This is hard, but it is not fantasy. The main difficulty is not data; finite groups give exact labels. The difficulty is building the right interpretability surface. Recent work on sequential group composition makes the bridge much more concrete: the task can be analyzed representation by representation, shallow two-layer models need width exponential in the sequence length $k$ to represent the full rule, and deeper or recurrent architectures exploit associativity far more efficiently. That is exactly the theorem-to-experiment bridge we want.

The realistic MVP is therefore:

precompute a small set of hero runs in Python rather than training everything live in the browser,
ship the interactive puzzle, training replay, Cayley-table error heatmap, and order-sensitivity checks first,
then add the heavier representation-theoretic diagnostics once the product surface already teaches the core idea.

If that lands, it becomes a genuinely distinctive TheoremPath artifact: not just "grokking happens," but "here is a controlled setting where a network transitions from memorizing examples to learning algebraic structure, and here is the mechanistic evidence for what it learned."

When Grokking Happens and When It Does Not

Condition	Grokking likely?	Why
Small structured dataset, weight decay	Yes	Classic setting from Power et al. (2022)
Large naturalistic dataset, weight decay	Sometimes	Harder to detect; may happen but be masked by noise
Two-layer MLP, no weight decay (Kumar 2024 setting)	Sometimes	Lazy-to-rich transition can produce grokking without $\ell_2$ decay
No weight decay, no other compression pressure	Rare	The default assumption; easiest to break if another regularizer (sparsity, gradient noise, dropout, small-init implicit bias) is present
Dataset too small for memorization	No	Model cannot memorize, so the "Phase 1" does not occur
Very high weight decay	No	Prevents memorization entirely; model underfits from the start
Dropout instead of weight decay	Unclear	Published grokking results primarily use AdamW; dropout-only effects on grokking are underexplored

Common Confusions

Watch Out

Grokking is not just late convergence

In standard training, validation loss may improve slowly after training loss converges. This is smooth, gradual improvement. Grokking is specifically the sudden phase transition after a long plateau. The sharpness of the transition is the distinguishing feature. If validation accuracy improves smoothly, that is normal training dynamics, not grokking.

Watch Out

Grokking is not evidence that you should always train longer

The conditions for grokking are specific: overparameterized model, structured task, appropriate regularization. On most practical tasks, training past convergence with standard hyperparameters does not produce grokking. It produces memorization. The lesson is that regularization can change the qualitative outcome of training, not that more epochs are universally better.

Watch Out

Grokking does not mean weight decay always helps

Weight decay enables grokking in specific settings, but it can also hurt. If the optimal solution requires large weights (e.g., sparse features with high activation), strong weight decay may prevent the network from ever finding it. The grokking story is about the interaction between weight decay, model capacity, and task structure.

Watch Out

Weight decay is sufficient, not necessary

A common reading of Power et al. (2022) is "no weight decay, no grokking." That overstates what the paper shows. Power et al. demonstrate that in their modular-arithmetic setting $\ell_2$ weight decay reliably produces grokking; they do not prove it is the only route. Kumar et al. (2024) produce grokking in two-layer MLPs without explicit weight decay by controlling the lazy-to-rich transition directly; other papers induce grokking via non- $\ell_2$ regularizers (sparsity, gradient noise) or via the implicit bias of small initialization. The mechanistic picture is better stated as "some form of compression pressure drives the transition," with weight decay as the best-studied example.

Exercises

ExerciseCore

Problem

Define the grokking time $T_{\text{grok}}$ as the step at which validation accuracy first exceeds $95\%$ (the operational definition used in Power et al. 2022). You train a small transformer on modular addition mod 97 with 50% of the data as training data. Training accuracy reaches 100% at step 1,000. At step 10,000, validation accuracy is still 30%. At step 50,000, validation accuracy jumps to 95%. Sketch the expected train loss, train accuracy, and validation accuracy curves. What is $T_{\text{grok}}$ ?

ExerciseAdvanced

Problem

Using the operational definition above (grokking time $T_{\text{grok}}$ = step at which validation accuracy first exceeds $95\%$ ), give a heuristic argument for why $T_{\text{grok}}$ should scale approximately as $1/\lambda$ with weight decay coefficient $\lambda$ in the Power et al. (2022) setting. What determines the proportionality constant? (This is an empirical scaling, not a theorem.)

References

Canonical:

Power et al., "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets" (ICLR 2022 MATH-AI workshop, arXiv 2201.02177). The original discovery, in the $\ell_2$ weight-decay setting.
Nanda et al., "Progress Measures for Grokking via Mechanistic Interpretability" (ICLR 2023). Internal mechanics of the transition on modular addition.

Extensions and reframings:

Liu et al., "Omnigrok: Grokking Beyond Algorithmic Data" (ICLR 2023). Extensions to broader settings; argues grokking is controlled by the ratio of weight norm to a critical value, not weight decay per se.
Kumar, Bordelon, Gershman, Pehlevan, "Grokking as the Transition from Lazy to Rich Training Dynamics" (arXiv 2310.06110, 2023). Produces grokking in two-layer MLPs without explicit weight decay, via the lazy (NTK-like) to rich (feature-learning) transition. Direct counterexample to the "weight decay is necessary" reading.
Thilak et al., "The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon" (2022). Adaptive-optimizer route to grokking.
Varma et al., "Explaining grokking through circuit efficiency" (2023). Frames the transition as a competition between memorization and generalization circuits with different efficiencies.
Marchetti, Kunin, Myers, Acosta, Miolane, "Sequential Group Composition: A Window into the Mechanics of Deep Learning" (arXiv 2602.03655, 2026). The cleanest current bridge from grokking-style algorithmic tasks to ordered finite-group composition and irreducible-representation learning.

Context (not grokking-specific, but relevant for implicit bias):

Chizat & Bach, "On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport" (NeurIPS 2018) and "Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss" (COLT 2020). Lazy vs. rich regime foundations.
Soudry et al., "The Implicit Bias of Gradient Descent on Separable Data" (JMLR 2018). Max-margin bias of GD without explicit regularization.

Next Topics

Double descent: another violation of classical learning theory; models improve past the interpolation threshold
Open problems in ML theory: grokking connects to the broader question of why overparameterized networks generalize

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Stochastic Gradient Descent Convergencelayer 2 · tier 1
Implicit Bias and Modern Generalizationlayer 4 · tier 1
Regularization Theorylayer 2 · tier 2
History of Artificial Intelligencelayer 5 · tier 2

Derived topics

2

Double Descentlayer 4 · tier 2
Open Problems in ML Theorylayer 5 · tier 3

Graph-backed continuations

Double Descent Open Problems in ML Theory