Modern Generalization
Grokking
Models can memorize training data quickly, then generalize much later after continued training. This delayed generalization, called grokking, breaks the assumption that overfitting is a terminal state and connects to weight decay, implicit regularization, and phase transitions in learning.
Prerequisites
Why This Matters
The 1990s-2010s picture of training said: as you train longer, the model first fits the training data, then starts overfitting. Validation performance degrades. You should stop early. Modern practice (post-Nakkiran et al. 2020) routinely trains to convergence. Grokking is one of several results (alongside double descent and benign overfitting) that invalidated the universal early-stopping story, not a unique violation of it.
Grokking violates this. In certain settings, a model reaches perfect training accuracy quickly, then shows no improvement on validation for many more epochs. Then, long after you would have stopped training, validation accuracy suddenly jumps to near-perfect. The model goes from memorizing to generalizing, and the transition is abrupt.
This matters because it shows that overfitting is not necessarily terminal. Continued training with the right regularization can cause a qualitative change in what the model has learned. The practical implication: your training schedule, weight initialization, and regularization choices can determine whether generalization happens at all.
The Phenomenon
Grokking: Delayed Generalization
Statement
On structured algorithmic tasks (modular arithmetic, permutation composition, polynomial evaluation), neural networks trained with SGD exhibit the following training dynamics:
-
Phase 1 (memorization): Training accuracy reaches 100% typically within to steps in the Power et al. (2022) modular arithmetic settings. Validation accuracy remains near chance.
-
Phase 2 (plateau): Training loss stays near zero. Validation accuracy stays near chance. This phase lasts to steps, strongly dependent on train fraction, weight decay, and optimizer.
-
Phase 3 (generalization): Validation accuracy rises sharply to near 100% over a relatively short window.
The transition from Phase 2 to Phase 3 is a phase transition: the model's internal representation undergoes a qualitative structural change.
Intuition
The network first finds a memorization solution that stores training pairs in its weights. This is not a literal lookup table, but a distributed representation that achieves the same input-output mapping via high-dimensional geometry. This solution has high weight norm because it stores unrelated mappings. With weight decay applying constant pressure toward smaller weights, the network is slowly pushed toward lower-norm solutions. Eventually, it crosses a threshold where a compact, generalizing algorithm (e.g., the modular arithmetic circuit) becomes lower-loss than the memorization solution. The transition is sudden because the two solution types are structurally different. There is no smooth interpolation between a distributed memorization and a compact algorithm.
Why It Matters
Grokking shows that optimization time can change what a network knows, not just how well it fits. This has implications for: training schedule design (early stopping may prevent generalization), regularization theory (weight decay is not just preventing overfitting, it is actively shaping the solution landscape), and mechanistic interpretability (you can study how networks transition from memorized to algorithmic representations).
Failure Mode
Grokking is easiest to demonstrate on small, highly structured tasks (modular arithmetic with elements). It is unclear how much this phenomenon extends to large-scale, naturalistic training. Some evidence suggests analogous phenomena in larger models, but the clean phase-transition signature becomes less sharp. Do not assume all training runs will exhibit grokking if you just train long enough.
What Drives the Transition
Weight Decay as a Sufficient Driver of Grokking
Statement
In the Power et al. (2022) setting, adding weight decay is a sufficient mechanism to produce grokking. With weight decay, the effective loss is:
After the memorization phase (), the dominant gradient signal comes from the weight decay term, which pushes the weights toward the origin. This continuous pressure compresses the representation until the network transitions to a lower-norm solution that generalizes.
Empirical observation (not a theorem): Power et al. (2022) report that the grokking time scales roughly as for small in this setting. This is empirical; no closed-form derivation exists. Alternative framings make the scaling a consequence of other quantities: Liu et al. (2023, "Omnigrok") cast grokking as governed by the ratio of weight norm to a critical value; Varma et al. (2023) cast it as a circuit efficiency ratio between memorizing and generalizing circuits.
Weight decay is sufficient, not necessary. Later work shows grokking can arise from other sources of compression or from a lazy-to-rich transition without explicit weight decay. See the FailureMode and the "When Grokking Happens" table for scope.
Intuition
Weight decay is doing implicit architecture search. After memorization, the loss landscape near the memorizing solution looks flat (training loss is zero regardless of small weight changes). But the weight decay gradient keeps pulling the weights inward. Over time, this compression forces the network out of the memorization basin and into a generalization basin. Stronger weight decay means faster compression and faster grokking. Too much weight decay prevents memorization in the first place.
The deeper story is that some implicit or explicit force toward lower-complexity representations is what drives the transition. weight decay is the cleanest and most studied version, but it is not the only one.
Why It Matters
This connects grokking to the broader theory of implicit bias. Gradient descent has specific biases: max-margin for logistic loss on separable data (Soudry et al. 2018), min-norm for squared loss. On algorithmic tasks with weight decay (Power 2022, Nanda 2023), the min-norm interpolator correlates with the generalizing algorithm rather than the memorizing solution. This correlation is task-dependent: for natural data, the min-norm interpolator is generally NOT the generalizing algorithm. Grokking on modular arithmetic is a concrete case where the implicit bias aligns with the task's compositional structure.
Failure Mode
Necessity of weight decay is an overclaim. Kumar, Bordelon, Gershman, Pehlevan ("Grokking as the Transition from Lazy to Rich Training Dynamics," arXiv 2310.06110, 2023) show grokking in two-layer MLPs without explicit weight decay, driven instead by a transition from the lazy (NTK-like) to rich (feature-learning) regime. Other work demonstrates grokking induced by non- regularizers (e.g., sparsity penalties, gradient noise) and argues that in some overparameterized settings grokking can appear without any explicit regularization at all, with the implicit bias of SGD providing the compression.
Separately, the scaling is approximate and task-dependent. For some tasks, even strong weight decay does not produce grokking if the generalizing solution requires large weights in specific directions.
Mechanistic View
Recent work (Nanda et al., 2023, arXiv 2301.05217) has opened up the internal mechanics of grokking on modular addition. This work builds on the same mechanistic interpretability techniques used to identify induction heads in transformers. Before the transition, the network uses a memorization circuit stored in weights. After the transition, the network implements a sparse trigonometric circuit: for each frequency in a small learned set of key frequencies (typically – out of ), the network computes and via the identity , combining and . The representation is sparse: only a handful of frequencies are used, not a general DFT.
Key observations:
- The transition involves a competition between two circuits that coexist during the plateau phase
- Weight decay gradually weakens the memorization circuit while the generalization circuit grows
- Varma et al. (2023) frame the competition as circuit efficiency under a weight budget: the memorization circuit has high weight norm because it stores input-output pairs; the generalization circuit has lower norm because it is a compact algorithm. Weight decay favors the lower-norm circuit. Once the generalization circuit is accurate on training inputs, memorization outputs become interference rather than helpful signal. This is efficiency competition under a weight budget, not neuronal-style mutual inhibition.
This is one of the clearest examples of mechanistic interpretability revealing how learned algorithms emerge.
When Grokking Happens and When It Does Not
| Condition | Grokking likely? | Why |
|---|---|---|
| Small structured dataset, weight decay | Yes | Classic setting from Power et al. (2022) |
| Large naturalistic dataset, weight decay | Sometimes | Harder to detect; may happen but be masked by noise |
| Two-layer MLP, no weight decay (Kumar 2024 setting) | Sometimes | Lazy-to-rich transition can produce grokking without decay |
| No weight decay, no other compression pressure | Rare | The default assumption; easiest to break if another regularizer (sparsity, gradient noise, dropout, small-init implicit bias) is present |
| Dataset too small for memorization | No | Model cannot memorize, so the "Phase 1" does not occur |
| Very high weight decay | No | Prevents memorization entirely; model underfits from the start |
| Dropout instead of weight decay | Unclear | Published grokking results primarily use AdamW; dropout-only effects on grokking are underexplored |
Common Confusions
Grokking is not just late convergence
In standard training, validation loss may improve slowly after training loss converges. This is smooth, gradual improvement. Grokking is specifically the sudden phase transition after a long plateau. The sharpness of the transition is the distinguishing feature. If validation accuracy improves smoothly, that is normal training dynamics, not grokking.
Grokking is not evidence that you should always train longer
The conditions for grokking are specific: overparameterized model, structured task, appropriate regularization. On most practical tasks, training past convergence with standard hyperparameters does not produce grokking. It produces memorization. The lesson is that regularization can change the qualitative outcome of training, not that more epochs are universally better.
Grokking does not mean weight decay always helps
Weight decay enables grokking in specific settings, but it can also hurt. If the optimal solution requires large weights (e.g., sparse features with high activation), strong weight decay may prevent the network from ever finding it. The grokking story is about the interaction between weight decay, model capacity, and task structure.
Weight decay is sufficient, not necessary
A common reading of Power et al. (2022) is "no weight decay, no grokking." That overstates what the paper shows. Power et al. demonstrate that in their modular-arithmetic setting weight decay reliably produces grokking; they do not prove it is the only route. Kumar et al. (2024) produce grokking in two-layer MLPs without explicit weight decay by controlling the lazy-to-rich transition directly; other papers induce grokking via non- regularizers (sparsity, gradient noise) or via the implicit bias of small initialization. The mechanistic picture is better stated as "some form of compression pressure drives the transition," with weight decay as the best-studied example.
Exercises
Problem
Define the grokking time as the step at which validation accuracy first exceeds (the operational definition used in Power et al. 2022). You train a small transformer on modular addition mod 97 with 50% of the data as training data. Training accuracy reaches 100% at step 1,000. At step 10,000, validation accuracy is still 30%. At step 50,000, validation accuracy jumps to 95%. Sketch the expected train loss, train accuracy, and validation accuracy curves. What is ?
Problem
Using the operational definition above (grokking time = step at which validation accuracy first exceeds ), give a heuristic argument for why should scale approximately as with weight decay coefficient in the Power et al. (2022) setting. What determines the proportionality constant? (This is an empirical scaling, not a theorem.)
References
Canonical:
- Power et al., "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets" (ICLR 2022 MATH-AI workshop, arXiv 2201.02177). The original discovery, in the weight-decay setting.
- Nanda et al., "Progress Measures for Grokking via Mechanistic Interpretability" (ICLR 2023). Internal mechanics of the transition on modular addition.
Extensions and reframings:
- Liu et al., "Omnigrok: Grokking Beyond Algorithmic Data" (ICLR 2023). Extensions to broader settings; argues grokking is controlled by the ratio of weight norm to a critical value, not weight decay per se.
- Kumar, Bordelon, Gershman, Pehlevan, "Grokking as the Transition from Lazy to Rich Training Dynamics" (arXiv 2310.06110, 2023). Produces grokking in two-layer MLPs without explicit weight decay, via the lazy (NTK-like) to rich (feature-learning) transition. Direct counterexample to the "weight decay is necessary" reading.
- Thilak et al., "The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon" (2022). Adaptive-optimizer route to grokking.
- Varma et al., "Explaining grokking through circuit efficiency" (2023). Frames the transition as a competition between memorization and generalization circuits with different efficiencies.
Context (not grokking-specific, but relevant for implicit bias):
- Chizat & Bach, "On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport" (NeurIPS 2018) and "Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss" (COLT 2020). Lazy vs. rich regime foundations.
- Soudry et al., "The Implicit Bias of Gradient Descent on Separable Data" (JMLR 2018). Max-margin bias of GD without explicit regularization.
Next Topics
- Double descent: another violation of classical learning theory; models improve past the interpolation threshold
- Open problems in ML theory: grokking connects to the broader question of why overparameterized networks generalize
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Regularization TheoryLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Bias-Variance TradeoffLayer 2
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Common Probability DistributionsLayer 0A
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Stochastic Gradient Descent ConvergenceLayer 2
- Gradient Descent VariantsLayer 1
- Implicit Bias and Modern GeneralizationLayer 4
- Linear RegressionLayer 1
- Maximum Likelihood EstimationLayer 0B
- VC DimensionLayer 2
- Rademacher ComplexityLayer 3