Modern Generalization
Open Problems in ML Theory
A curated list of genuinely open problems in machine learning theory: why overparameterized networks generalize, the right complexity measure for deep learning, feature learning beyond NTK, why scaling laws hold, emergent abilities, transformer-specific theory, and post-training theory.
Prerequisites
Why This Matters
Machine learning works spectacularly well. We do not understand why.
This is not a rhetorical statement. The central results of classical learning theory (VC dimension, Rademacher complexity, PAC-Bayes) predict that overparameterized neural networks should overfit catastrophically. They do not. This gap between theory and practice is the defining intellectual challenge of modern ML theory.
This topic is a map of the frontier. Each open problem is presented with what is known, what is not, and where the field is heading. If you want to do research in ML theory, one of these problems is likely where you will start.
Mental Model
Classical learning theory says: complexity of the hypothesis class determines generalization. More parameters means worse generalization, unless you have proportionally more data. Modern deep learning violates this: models with millions or billions of parameters generalize well with relatively little data.
Something is controlling generalization that our current theory does not capture. Finding that something is the meta-problem that unifies all the open problems below.
Problem 1: Why Do Overparameterized Networks Generalize?
Rethinking Generalization (Zhang et al. 2017)
Statement
Standard neural network architectures (CNNs, ResNets) can fit a training set of images with completely random labels, achieving zero training error. The same architectures, with the same hyperparameters, generalize well on real labels.
This means: for any uniform convergence bound that depends only on the hypothesis class and sample size , must be vacuous (close to 1), because contains both the good real-label function and the random-label function.
Intuition
The network architecture can represent garbage (random labels) just as easily as signal (real labels). So the architecture alone cannot explain generalization. Something about the training procedure (SGD, its initialization, the learning rate schedule) must be selecting good solutions from the vast space of interpolating solutions. This is the implicit bias of SGD.
Why It Matters
This result forced the field to abandon hypothesis-class-based generalization theory for deep learning and look instead at algorithm-dependent explanations. Every open problem on this page traces back to this observation.
Failure Mode
The proposition is an experimental observation, not a mathematical theorem. It does not tell you what the right explanation IS, only what it is not (hypothesis class complexity alone).
What is known. The Zhang et al. (2017) random labels experiment showed that neural networks can memorize random labels (zero training error on noise) while also generalizing well on real data. This means the hypothesis class itself (all functions representable by the network) is too rich. Generalization must come from the training algorithm, not the architecture alone.
What is partially understood. Several partial explanations exist:
- Implicit bias of gradient descent: for linear models, gradient descent converges to the minimum-norm solution. For deep networks, gradient descent seems to prefer "simple" functions, but formalizing "simple" is hard.
- Double descent: test error first decreases, then increases (classical), then decreases again as model size grows past the interpolation threshold. This is well documented empirically but the mechanism is not fully understood for deep networks.
- Benign overfitting: in some settings, interpolating the training data (including noise) is compatible with good generalization. Proven for linear regression and some kernel methods, but not for deep networks in generality.
What is not known. There is no generalization bound for deep neural networks trained with SGD on realistic data that is both (a) non-vacuous and (b) does not require assumptions that are impossible to verify in practice. PAC-Bayes bounds come closest, but they require careful construction of a posterior distribution and often give loose numerical values.
Who is working on it. Essentially everyone in ML theory. Key groups include those at Princeton (Arora, Ma), MIT (Moitra, Bresler), Stanford (Liang, Zou), Hebrew University (Shalev-Shwartz), and many others.
Problem 2: What Is the Right Complexity Measure?
What is known. Classical measures fail for deep learning:
- Parameter count: networks with parameters generalize, so raw parameter count is not the right measure
- VC dimension: scales with parameter count, gives vacuous bounds
- Rademacher complexity: same problem for overparameterized networks
- PAC-Bayes: gives non-vacuous bounds in some cases, but requires choosing a prior and posterior carefully
Candidate measures that partially work:
- Norm-based bounds: controlling (product of layer norms) gives tighter bounds, but these are still loose and depend on depth in ways that do not match empirical generalization
- Compression-based bounds: if a network can be compressed to a smaller representation without losing accuracy, the compressed size determines generalization. This is theoretically clean but the right notion of compression is debated
- Fisher-Rao norm: a Riemannian geometry approach that accounts for the parameterization of the function, not just the function itself
What is not known. No single complexity measure has been shown to (a) predict generalization across architectures, datasets, and training procedures, and (b) have a rigorous theoretical justification. The field suspects the answer involves both the model and the training algorithm jointly, not the model alone.
Who is working on it. Neyshabur, Bartlett, Dziugaite, Roy, Arora, among others. The Predicting Generalization in Deep Learning (PGDL) competition benchmarked many proposed measures and found no clear winner.
Problem 3: Feature Learning Beyond NTK
What is known. The Neural Tangent Kernel (NTK) regime provides a mathematically tractable theory of neural network training: in the infinite-width limit, network training is equivalent to kernel regression with a fixed kernel (the NTK). This gives convergence guarantees and generalization bounds.
The problem with NTK. The NTK regime describes networks that do not learn features. The kernel is fixed at initialization. But feature learning. the ability of the network to learn useful intermediate representations. is widely believed to be the key advantage of deep learning over kernel methods.
Partial progress:
- Mean-field theory: treats neurons as particles in a probability distribution and studies the evolution of this distribution during training. Allows feature learning but is mathematically harder than NTK.
- Tensor programs (Yang, 2020-2024): a framework that tracks how random tensors evolve during training, unifying NTK, mean-field, and other limits
- Rich regime vs lazy regime: networks can be in a regime where features change substantially (rich) or barely change (lazy/NTK). The rich regime is where feature learning happens, but it is much harder to analyze
What is not known. There is no rigorous theory of what features deep networks learn on natural data, why certain architectures learn better features than others, or how feature quality relates to generalization. The fundamental question: what makes a learned feature "good"?
Who is working on it. Yang (tensor programs), Mei and Montanari (mean field), Bach and Chizat (lazy vs rich regimes), Allen-Zhu and Li (feature learning provably helps).
Problem 4: Why Do Scaling Laws Hold?
What is known. Empirical scaling laws (Kaplan et al. 2020, Hoffmann et al. 2022) show that test loss follows a power law in model size, dataset size, and compute:
where is model size and for language models. Similar power laws hold for dataset size and compute budget. These laws hold over many orders of magnitude with remarkable precision.
What is partially understood:
- Power laws in test loss as a function of dataset size are predicted by classical learning theory for certain model classes (bias-variance decomposition gives polynomial rates)
- For simple models (linear regression, kernel regression), scaling exponents can be computed exactly and depend on the spectrum of the data covariance
- Hutter (2021) proposed a framework based on quantization of the loss landscape that produces power laws
What is not known. No theory predicts the specific exponents observed for large language models from first principles. We do not know: (a) why the exponents are what they are, (b) whether the laws will continue to hold at larger scales, (c) what determines the irreducible loss , (d) why the relationship is so clean (a simple power law rather than something more complex).
Who is working on it. Scaling laws are studied by Kaplan, McCandlish, Henighan (OpenAI), Hoffmann et al. (DeepMind/Chinchilla), Sharma and Kaplan, Bahri et al. (Google), and many academic groups.
Problem 5: Are Emergent Abilities Real?
What is known. Wei et al. (2022) documented "emergent abilities": capabilities that are absent in small models but appear suddenly as models scale past a critical size. Examples included multi-step arithmetic, chain- of-thought reasoning, and certain benchmark tasks that jump from near-zero to high accuracy at a specific model scale.
The controversy. Schaeffer et al. (2023) argued that emergence is a mirage caused by nonlinear evaluation metrics. When you use linear or continuous metrics instead of discontinuous ones (like exact-match accuracy), the smooth improvement is visible at all scales. there is no sharp phase transition.
What is genuinely unresolved:
- Even if the metric argument explains some cases, do genuinely qualitative capability jumps exist? Can a sufficiently large model do something structurally different from what smaller models do, or is it always gradual improvement?
- What is the right definition of "emergence" for ML systems? Is it a phase transition in a statistical mechanics sense, or something else?
- If emergence is real, can we predict when it will happen? This matters enormously for AI safety: unpredictable capability jumps make it hard to assess risks before deployment.
Who is working on it. Wei et al. (Google), Schaeffer, Miranda, Koyejo (Stanford), Arora et al. (Princeton), and the AI safety community broadly.
Problem 6: Transformer-Specific Theory
What is known. Transformers dominate modern deep learning, but most theoretical results apply to generic neural networks (MLPs, ResNets) or to simplified models. Transformer-specific theory is sparse.
Partial results:
- Expressive power: transformers can simulate Turing machines (given enough layers and precision), but this does not explain what they learn in practice
- In-context learning: transformers can learn to implement learning algorithms (like gradient descent) in their forward pass. Garg et al. (2022) showed this empirically; Akyurek et al. (2023) and Von Oswald et al. (2023) analyzed the mechanism
- Attention patterns: some work characterizes what attention heads learn (induction heads, positional heads), but a complete taxonomy is missing
- Context length generalization: why transformers struggle to generalize to sequence lengths longer than those seen in training, and whether positional encoding design can fix this
What is not known:
- Why does the transformer architecture work so much better than alternatives for language? Is it the attention mechanism, the residual stream, the layer normalization, or the combination?
- What is the right theoretical model of a transformer? Is it a kernel machine, a dynamical system, a program executor, or something new?
- Can we characterize the functions that transformers can learn efficiently (not just represent)?
- Why do transformers exhibit mesa-optimization (learning to implement optimization algorithms internally)?
Who is working on it. Elhage, Olsson et al. (Anthropic, mechanistic interpretability), Garg, Brown (in-context learning), Edelman, Goel, Kakade (expressivity), Jelassi, Li (optimization), and many others.
Problem 7: Theory of Post-Training
What is known. Modern LLMs go through multiple training stages: pretraining (next-token prediction on large corpora), supervised fine-tuning (SFT on curated instruction-response pairs), reinforcement learning from human feedback (RLHF or variants), and sometimes additional stages like DPO, constitutional AI, or iterative refinement.
What is barely understood:
- Why does RLHF work? The reward model is a noisy proxy for human preferences, trained on limited comparison data, yet it dramatically improves model behavior. Why does optimizing a proxy reward not cause catastrophic reward hacking more often?
- What does fine-tuning change? Does SFT teach new knowledge or just change the format/style of outputs? Evidence suggests the latter (knowledge is mostly from pretraining), but this is not proven.
- Alignment tax: how much capability (if any) is lost through safety training? Is there a fundamental tradeoff between helpfulness and safety, or can both be improved simultaneously?
- Why does DPO work? Direct Preference Optimization bypasses the reward model entirely and optimizes the policy directly on preference data. It works surprisingly well despite being a much simpler procedure than RLHF. The theoretical justification exists (it optimizes the same objective as RLHF under certain assumptions), but why it works as well as it does in practice is not fully understood.
What is not known:
- Is there a theory of what post-training should look like? Are the current stages (pre-train, SFT, RLHF) the right decomposition, or is there a more principled pipeline?
- How do we think about the interaction between pretraining data and post- training objectives? The model's capabilities are bounded by pretraining, but post-training can elicit or suppress them.
- Can we formalize what it means for a model to be "aligned" in a way that admits theoretical analysis?
Who is working on it. Anthropic (constitutional AI, alignment theory), OpenAI (RLHF theory), DeepMind (reward modeling), Rafailov et al. (DPO theory at Stanford), and the broader alignment research community.
Common Confusions
These are not just hard problems. THEY are genuinely open
An open problem in ML theory is not a hard engineering challenge. It is a question where the research community does not have a satisfactory answer and where leading researchers actively disagree about what the answer might be. Progress on any of these would be a significant contribution.
Empirical evidence is not a theoretical explanation
Observing that large models generalize, that scaling laws hold, or that RLHF improves model behavior is not a theoretical understanding. Theory requires identifying the mechanism, stating precise assumptions, and proving that the mechanism produces the observed behavior under those assumptions. We have strong empirical evidence for many phenomena but theoretical explanations for very few.
The NTK is not wrong. It is incomplete
The Neural Tangent Kernel regime provides valid and rigorous results for a specific limit (infinite width, small learning rate). The limitation is that this regime does not capture feature learning, which is the practically important regime. NTK theory is a correct description of a setting that does not fully match practice.
Summary
- The central mystery: overparameterized networks generalize, and we do not fully know why
- No existing complexity measure reliably predicts generalization for deep networks
- NTK theory is rigorous but does not capture feature learning
- Scaling laws are empirically robust but theoretically unexplained
- Whether emergent abilities are real or a measurement artifact is debated
- Transformer-specific theory is in its infancy
- Post-training (RLHF, DPO, SFT) works but lacks theoretical foundations
- Each of these problems is an active area of research with many groups contributing
Exercises
Problem
The Zhang et al. (2017) random labels experiment shows that a neural network can fit random labels perfectly. Explain why this breaks classical generalization theory, and describe two different theoretical frameworks that attempt to resolve the contradiction.
Problem
Explain the difference between the "lazy" (NTK) and "rich" (feature learning) regimes of neural network training. What determines which regime a network is in, and why does it matter for theory?
Problem
Scaling laws show with for language models. What would it mean if were much larger or much smaller? What determines ?
References
Canonical:
- Zhang et al., "Understanding Deep Learning Requires Rethinking Generalization" (ICLR 2017)
- Neyshabur et al., "Exploring Generalization in Deep Learning" (NeurIPS 2017)
Current:
- Kaplan et al., "Scaling Laws for Neural Language Models" (2020)
- Wei et al., "Emergent Abilities of Large Language Models" (2022)
- Schaeffer et al., "Are Emergent Abilities of Large Language Models a Mirage?" (NeurIPS 2023)
- Yang, "Tensor Programs" series (2020-2024)
- Neel Nanda et al., "Progress Measures for Grokking via Mechanistic Interpretability" (ICLR 2023)
Next Topics
This is the frontier. Every problem listed above is an active research direction with open positions for new researchers.
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Implicit Bias and Modern GeneralizationLayer 4
- Gradient Descent VariantsLayer 1
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Linear RegressionLayer 1
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- VC DimensionLayer 2
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Rademacher ComplexityLayer 3
- Scaling LawsLayer 4