References

References by role.

Last updated: April 23, 2026

TheoremPath uses sources for specific work: definitions, assumptions, proof tools, examples, implementation details, paper context, and practice checks. Topic pages link claims, examples, exercises, and technical notes to the sources behind them.

Tags mark how a source is used: core sources carry the main path, reference sources support recurring topics, and advanced sources appear when a page needs specialized machinery.

Core spine

Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
core
Learnability, PAC learning, ERM, uniform convergence, VC dimension, stability, online learning, and Rademacher complexity.
Chapters 2-6, 13, 21, 26-28.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer.
core
Classical statistical ML, bias-variance, regularization, model assessment, trees, boosting, SVMs, random forests, and high-dimensional classical methods.
Chapters 2-4, 7, 9, 10, 12, 15, 18.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
core
Neural-network foundations, optimization, regularization, feedforward nets, convolutional nets, sequence models, and selected generative modeling topics.
Chapters 2-8, with 9, 10, 14, and 20 used selectively.
Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press.
core
Sub-Gaussian and sub-exponential variables, concentration, random vectors, Johnson-Lindenstrauss, epsilon nets, and random matrices.
Chapters 2-5, with 6-8 and 10 used for advanced geometry and sparsity.
Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press.
core
Sparse models, covariance estimation, matrix estimation, minimax rates, lower bounds, and kernel methods.
Chapters 2, 4-9, 12, 13, 15.
Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
core
Convexity, duality, KKT conditions, conditioning, optimization geometry, and the optimization language behind ridge, lasso, SVMs, and proof techniques.
Chapters 2-5, 9, and Appendix A.

Modern deep learning, NLP, and language modeling

Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press.
reference
Modern deep-learning explanations, including contemporary architectures and a practical bridge from mathematical notation to model behavior.
Bishop, C. M. and Bishop, H. (2024). Deep Learning: Foundations and Concepts. Springer.
reference
Modern conceptual deep-learning reference for architectures, representation learning, latent-variable models, flows, transformers, and graph neural networks.
Jurafsky, D. and Martin, J. H. Speech and Language Processing. 3rd ed. draft.
reference
NLP and language-modeling spine: tokens, n-grams, classification, embeddings, neural networks, transformers, LLMs, generation, and evaluation.
Stanford CS336, Language Modeling from Scratch.
advanced
Implementation-heavy language-modeling path: tokenization, transformer construction, training, profiling, data cleaning, scaling laws, inference, and post-training topics.

Support references

Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of Machine Learning. 2nd ed. MIT Press.
reference
Used to cross-check PAC learning, VC dimension, Rademacher complexity, kernels, and online learning.
Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press.
advanced
Used for concentration tools beyond the basic high-dimensional probability route.
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. 2nd ed. MIT Press.
reference
Used for MDPs, Bellman equations, dynamic programming, temporal-difference learning, and policy gradients.
Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory. 2nd ed. Wiley.
reference
Used for entropy, KL divergence, mutual information, Fano-style arguments, and information-theoretic framing.
Casella, G. and Berger, R. L. (2002). Statistical Inference. 2nd ed. Duxbury.
reference
Used for likelihood, sufficiency, estimation, testing, confidence intervals, and classical inference.
Durrett, R. (2019). Probability: Theory and Examples. 5th ed. Cambridge University Press.
advanced
Used when measure-theoretic probability, convergence, and martingale arguments need more rigor.

Targeted references

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
reference
Used selectively for probabilistic ML, EM, mixture models, variational inference, Gaussian processes, and graphical models.
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
reference
Used for probabilistic modeling, supervised learning, Bayesian methods, and graphical-model foundations.
Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press.
advanced
Used for latent-variable models, approximate inference, graphical models, and deep generative modeling.
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley.
advanced
Used for historical and technical context on VC theory and structural risk minimization.
van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press.
advanced
Used when asymptotic theory matters directly.
Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer.
advanced
Used for nonparametric estimation topics.
Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley.
advanced
Used for deeper MDP theory.
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
advanced
Used when graphical-model structure is central.

Numerics, causality, and sequential decisions

Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. 2nd ed. Springer.
reference
Used for gradient methods, line search, trust regions, quasi-Newton methods, constrained optimization, convergence behavior, and the numerical side of ML training and estimation.
Trefethen, L. N. and Bau, D. (1997). Numerical Linear Algebra. SIAM.
reference
Used for conditioning, stability, least squares, QR, SVD, eigendecompositions, Krylov methods, PCA, embeddings, and large-scale matrix computation.
Lattimore, T. and Szepesvari, C. (2020). Bandit Algorithms. Cambridge University Press.
advanced
Used for stochastic, adversarial, linear, and contextual bandits; regret; confidence bounds; Thompson sampling; and the bridge from supervised prediction to reinforcement learning.
Hernan, M. A. and Robins, J. M. (2020). Causal Inference: What If. Chapman & Hall/CRC.
advanced
Used when a topic is about interventions rather than prediction: potential outcomes, confounding, identification, weighting, g-methods, transportability, and applied causal reasoning.
Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge University Press.
advanced
Used when directed acyclic graphs, structural causal models, do-calculus, mediation, or formal identifiability arguments are central.

Causal sources are conditional, not universal background. Hernan and Robins is the default for intervention and estimation questions; Pearl is used when graphical or structural identification is central.

Systems and practice

Huyen, C. (2022). Designing Machine Learning Systems. O'Reilly.
reference
Used for production ML, data pipelines, evaluation, deployment, monitoring, and iteration.
Huyen, C. (2025). AI Engineering. O'Reilly.
reference
Used for LLM applications, evaluation, retrieval, tool-use systems, deployment, and engineering tradeoffs.

Paper-first LLM and post-training layer

Modern language-modeling and post-training pages use primary papers where textbooks lag. Papers are cited where their claims are used, so the citation stays close to the architecture, objective, benchmark, or implementation detail it supports.

Architecture: Transformer architecture, attention variants, positional methods, sequence-length scaling, and mixture-of-experts designs are cited where they change the model claim.
Systems: Efficient attention, memory movement, kernels, inference systems, KV-cache design, quantization, batching, and latency-throughput tradeoffs are cited where computation changes the result in practice.
Data and evaluation: Data filtering, deduplication, contamination, benchmark validity, metric scope, and evaluation protocol papers are cited where evidence quality is part of the claim.
Post-training: Supervised fine-tuning, preference optimization, RLHF, DPO-style objectives, reasoning-oriented post-training, reward models, and verifiers are cited where the page states the objective and evaluated setting.

Source standards

A source should anchor real pages, exercises, diagnostics, source checks, or implementation notes. If it does not change what gets built, it does not belong here.
Topic pages should distinguish theorem sources, implementation sources, systems sources, benchmark or evaluation sources, and historical or contextual sources.
Advanced pages should make assumptions explicit, name the object being optimized or estimated, state a boundary case or failure mode, and include a diagnostic or exercise that checks use rather than recognition.