Beta. Content is under active construction and has not been peer-reviewed. Report errors on
GitHub
.
Disclaimer
Theorem
Path
Curriculum
Paths
Demos
Diagnostic
Search
Quiz Hub
/
Optimizer Theory: SGD, Adam, and Muon
Optimizer Theory: SGD, Adam, and Muon
3 questions
Difficulty 7-7
View topic
Advanced
0 / 3
3 advanced
Adapts to your performance
1 / 3
advanced (7/10)
spot the error
Reddi et al. (2018) showed that Adam can diverge on simple convex problems where SGD converges. What is the root cause of Adam's divergence?
Hide and think first
A.
The EMA denominator
v
t
can shrink between rare large gradients, spiking the effective learning rate
B.
The momentum term
m
t
overshoots the optimum because bias correction amplifies early gradients
C.
Adam uses no Hessian information, unlike true second-order methods that would converge here
D.
The divergence is caused by the missing learning-rate warmup, which a warmup schedule would fix
Submit Answer