Skip to main content

References

References by role.

Last updated: April 23, 2026

TheoremPath uses sources for specific work: definitions, assumptions, proof tools, examples, implementation details, paper context, and practice checks. Topic pages link claims, examples, exercises, and technical notes to the sources behind them.

Tags mark how a source is used: core sources carry the main path, reference sources support recurring topics, and advanced sources appear when a page needs specialized machinery.

Core spine

  • Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.

    core

    Learnability, PAC learning, ERM, uniform convergence, VC dimension, stability, online learning, and Rademacher complexity.

    Chapters 2-6, 13, 21, 26-28.

  • Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer.

    core

    Classical statistical ML, bias-variance, regularization, model assessment, trees, boosting, SVMs, random forests, and high-dimensional classical methods.

    Chapters 2-4, 7, 9, 10, 12, 15, 18.

  • Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.

    core

    Neural-network foundations, optimization, regularization, feedforward nets, convolutional nets, sequence models, and selected generative modeling topics.

    Chapters 2-8, with 9, 10, 14, and 20 used selectively.

  • Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press.

    core

    Sub-Gaussian and sub-exponential variables, concentration, random vectors, Johnson-Lindenstrauss, epsilon nets, and random matrices.

    Chapters 2-5, with 6-8 and 10 used for advanced geometry and sparsity.

  • Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press.

    core

    Sparse models, covariance estimation, matrix estimation, minimax rates, lower bounds, and kernel methods.

    Chapters 2, 4-9, 12, 13, 15.

  • Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.

    core

    Convexity, duality, KKT conditions, conditioning, optimization geometry, and the optimization language behind ridge, lasso, SVMs, and proof techniques.

    Chapters 2-5, 9, and Appendix A.

Modern deep learning, NLP, and language modeling

  • Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press.

    reference

    Modern deep-learning explanations, including contemporary architectures and a practical bridge from mathematical notation to model behavior.

  • Bishop, C. M. and Bishop, H. (2024). Deep Learning: Foundations and Concepts. Springer.

    reference

    Modern conceptual deep-learning reference for architectures, representation learning, latent-variable models, flows, transformers, and graph neural networks.

  • Jurafsky, D. and Martin, J. H. Speech and Language Processing. 3rd ed. draft.

    reference

    NLP and language-modeling spine: tokens, n-grams, classification, embeddings, neural networks, transformers, LLMs, generation, and evaluation.

  • Implementation-heavy language-modeling path: tokenization, transformer construction, training, profiling, data cleaning, scaling laws, inference, and post-training topics.

Support references

  • Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of Machine Learning. 2nd ed. MIT Press.

    reference

    Used to cross-check PAC learning, VC dimension, Rademacher complexity, kernels, and online learning.

  • Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press.

    advanced

    Used for concentration tools beyond the basic high-dimensional probability route.

  • Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. 2nd ed. MIT Press.

    reference

    Used for MDPs, Bellman equations, dynamic programming, temporal-difference learning, and policy gradients.

  • Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory. 2nd ed. Wiley.

    reference

    Used for entropy, KL divergence, mutual information, Fano-style arguments, and information-theoretic framing.

  • Casella, G. and Berger, R. L. (2002). Statistical Inference. 2nd ed. Duxbury.

    reference

    Used for likelihood, sufficiency, estimation, testing, confidence intervals, and classical inference.

  • Durrett, R. (2019). Probability: Theory and Examples. 5th ed. Cambridge University Press.

    advanced

    Used when measure-theoretic probability, convergence, and martingale arguments need more rigor.

Targeted references

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

    reference

    Used selectively for probabilistic ML, EM, mixture models, variational inference, Gaussian processes, and graphical models.

  • Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.

    reference

    Used for probabilistic modeling, supervised learning, Bayesian methods, and graphical-model foundations.

  • Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press.

    advanced

    Used for latent-variable models, approximate inference, graphical models, and deep generative modeling.

  • Vapnik, V. N. (1998). Statistical Learning Theory. Wiley.

    advanced

    Used for historical and technical context on VC theory and structural risk minimization.

  • van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press.

    advanced

    Used when asymptotic theory matters directly.

  • Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer.

    advanced

    Used for nonparametric estimation topics.

  • Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley.

    advanced

    Used for deeper MDP theory.

  • Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.

    advanced

    Used when graphical-model structure is central.

Numerics, causality, and sequential decisions

Causal sources are conditional, not universal background. Hernan and Robins is the default for intervention and estimation questions; Pearl is used when graphical or structural identification is central.

Systems and practice

  • Huyen, C. (2022). Designing Machine Learning Systems. O'Reilly.

    reference

    Used for production ML, data pipelines, evaluation, deployment, monitoring, and iteration.

  • Huyen, C. (2025). AI Engineering. O'Reilly.

    reference

    Used for LLM applications, evaluation, retrieval, tool-use systems, deployment, and engineering tradeoffs.

Paper-first LLM and post-training layer

Modern language-modeling and post-training pages use primary papers where textbooks lag. Papers are cited where their claims are used, so the citation stays close to the architecture, objective, benchmark, or implementation detail it supports.

  • Architecture: Transformer architecture, attention variants, positional methods, sequence-length scaling, and mixture-of-experts designs are cited where they change the model claim.
  • Systems: Efficient attention, memory movement, kernels, inference systems, KV-cache design, quantization, batching, and latency-throughput tradeoffs are cited where computation changes the result in practice.
  • Data and evaluation: Data filtering, deduplication, contamination, benchmark validity, metric scope, and evaluation protocol papers are cited where evidence quality is part of the claim.
  • Post-training: Supervised fine-tuning, preference optimization, RLHF, DPO-style objectives, reasoning-oriented post-training, reward models, and verifiers are cited where the page states the objective and evaluated setting.

Source standards

  • A source should anchor real pages, exercises, diagnostics, source checks, or implementation notes. If it does not change what gets built, it does not belong here.
  • Topic pages should distinguish theorem sources, implementation sources, systems sources, benchmark or evaluation sources, and historical or contextual sources.
  • Advanced pages should make assumptions explicit, name the object being optimized or estimated, state a boundary case or failure mode, and include a diagnostic or exercise that checks use rather than recognition.