References
References by role.
Last updated: April 23, 2026
TheoremPath uses sources for specific work: definitions, assumptions, proof tools, examples, implementation details, paper context, and practice checks. Topic pages link claims, examples, exercises, and technical notes to the sources behind them.
Tags mark how a source is used: core sources carry the main path, reference sources support recurring topics, and advanced sources appear when a page needs specialized machinery.
Core spine
Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
coreLearnability, PAC learning, ERM, uniform convergence, VC dimension, stability, online learning, and Rademacher complexity.
Chapters 2-6, 13, 21, 26-28.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer.
coreClassical statistical ML, bias-variance, regularization, model assessment, trees, boosting, SVMs, random forests, and high-dimensional classical methods.
Chapters 2-4, 7, 9, 10, 12, 15, 18.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
coreNeural-network foundations, optimization, regularization, feedforward nets, convolutional nets, sequence models, and selected generative modeling topics.
Chapters 2-8, with 9, 10, 14, and 20 used selectively.
Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press.
coreSub-Gaussian and sub-exponential variables, concentration, random vectors, Johnson-Lindenstrauss, epsilon nets, and random matrices.
Chapters 2-5, with 6-8 and 10 used for advanced geometry and sparsity.
Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press.
coreSparse models, covariance estimation, matrix estimation, minimax rates, lower bounds, and kernel methods.
Chapters 2, 4-9, 12, 13, 15.
Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
coreConvexity, duality, KKT conditions, conditioning, optimization geometry, and the optimization language behind ridge, lasso, SVMs, and proof techniques.
Chapters 2-5, 9, and Appendix A.
Modern deep learning, NLP, and language modeling
Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press.
referenceModern deep-learning explanations, including contemporary architectures and a practical bridge from mathematical notation to model behavior.
Bishop, C. M. and Bishop, H. (2024). Deep Learning: Foundations and Concepts. Springer.
referenceModern conceptual deep-learning reference for architectures, representation learning, latent-variable models, flows, transformers, and graph neural networks.
Jurafsky, D. and Martin, J. H. Speech and Language Processing. 3rd ed. draft.
referenceNLP and language-modeling spine: tokens, n-grams, classification, embeddings, neural networks, transformers, LLMs, generation, and evaluation.
Implementation-heavy language-modeling path: tokenization, transformer construction, training, profiling, data cleaning, scaling laws, inference, and post-training topics.
Support references
Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of Machine Learning. 2nd ed. MIT Press.
referenceUsed to cross-check PAC learning, VC dimension, Rademacher complexity, kernels, and online learning.
Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press.
advancedUsed for concentration tools beyond the basic high-dimensional probability route.
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. 2nd ed. MIT Press.
referenceUsed for MDPs, Bellman equations, dynamic programming, temporal-difference learning, and policy gradients.
Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory. 2nd ed. Wiley.
referenceUsed for entropy, KL divergence, mutual information, Fano-style arguments, and information-theoretic framing.
Casella, G. and Berger, R. L. (2002). Statistical Inference. 2nd ed. Duxbury.
referenceUsed for likelihood, sufficiency, estimation, testing, confidence intervals, and classical inference.
Durrett, R. (2019). Probability: Theory and Examples. 5th ed. Cambridge University Press.
advancedUsed when measure-theoretic probability, convergence, and martingale arguments need more rigor.
Targeted references
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
referenceUsed selectively for probabilistic ML, EM, mixture models, variational inference, Gaussian processes, and graphical models.
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
referenceUsed for probabilistic modeling, supervised learning, Bayesian methods, and graphical-model foundations.
Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press.
advancedUsed for latent-variable models, approximate inference, graphical models, and deep generative modeling.
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley.
advancedUsed for historical and technical context on VC theory and structural risk minimization.
van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press.
advancedUsed when asymptotic theory matters directly.
Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer.
advancedUsed for nonparametric estimation topics.
Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley.
advancedUsed for deeper MDP theory.
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
advancedUsed when graphical-model structure is central.
Numerics, causality, and sequential decisions
Used for gradient methods, line search, trust regions, quasi-Newton methods, constrained optimization, convergence behavior, and the numerical side of ML training and estimation.
Used for conditioning, stability, least squares, QR, SVD, eigendecompositions, Krylov methods, PCA, embeddings, and large-scale matrix computation.
Used for stochastic, adversarial, linear, and contextual bandits; regret; confidence bounds; Thompson sampling; and the bridge from supervised prediction to reinforcement learning.
Used when a topic is about interventions rather than prediction: potential outcomes, confounding, identification, weighting, g-methods, transportability, and applied causal reasoning.
Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge University Press.
advancedUsed when directed acyclic graphs, structural causal models, do-calculus, mediation, or formal identifiability arguments are central.
Causal sources are conditional, not universal background. Hernan and Robins is the default for intervention and estimation questions; Pearl is used when graphical or structural identification is central.
Systems and practice
Huyen, C. (2022). Designing Machine Learning Systems. O'Reilly.
referenceUsed for production ML, data pipelines, evaluation, deployment, monitoring, and iteration.
Huyen, C. (2025). AI Engineering. O'Reilly.
referenceUsed for LLM applications, evaluation, retrieval, tool-use systems, deployment, and engineering tradeoffs.
Paper-first LLM and post-training layer
Modern language-modeling and post-training pages use primary papers where textbooks lag. Papers are cited where their claims are used, so the citation stays close to the architecture, objective, benchmark, or implementation detail it supports.
- Architecture: Transformer architecture, attention variants, positional methods, sequence-length scaling, and mixture-of-experts designs are cited where they change the model claim.
- Systems: Efficient attention, memory movement, kernels, inference systems, KV-cache design, quantization, batching, and latency-throughput tradeoffs are cited where computation changes the result in practice.
- Data and evaluation: Data filtering, deduplication, contamination, benchmark validity, metric scope, and evaluation protocol papers are cited where evidence quality is part of the claim.
- Post-training: Supervised fine-tuning, preference optimization, RLHF, DPO-style objectives, reasoning-oriented post-training, reward models, and verifiers are cited where the page states the objective and evaluated setting.
Source standards
- A source should anchor real pages, exercises, diagnostics, source checks, or implementation notes. If it does not change what gets built, it does not belong here.
- Topic pages should distinguish theorem sources, implementation sources, systems sources, benchmark or evaluation sources, and historical or contextual sources.
- Advanced pages should make assumptions explicit, name the object being optimized or estimated, state a boundary case or failure mode, and include a diagnostic or exercise that checks use rather than recognition.