pdf pdf |
|---|
Introduction
I put this collection together after spending a lot of time reading what I think are some of the best books on AI, machine learning, deep learning, probabilistic modeling, optimization, reinforcement learning, transformers, LLMs, validation, and fairness. I want to share this with the community for one simple reason: I want to give people a structured path through the books that actually help them understand things deeply, instead of sending them through random courses, disconnected tutorials, and fragmented content.
This collection is my attempt to gather the most important resources I would recommend to anyone who wants to master AI and ML properly: from mathematics and probability, to statistical learning theory, to deep learning, to transformers and LLM engineering, to decision making, robustness, fairness, and causality. There is a reason why these books are on this list. They make up a full ladder that goes from the basics to advanced research-level thinking. They include both theory and practice, both clear math and good engineering judgment, as well as making decisions and predictions.
I'm sharing this as a community resource for people who want a real plan. I don't think you should see these books as separate references; instead, I think you should see them as part of a whole curriculum. If someone follows the roadmap below with patience and consistency, they can build a much stronger foundation than what most online courses that are all over the place offer.
Why this collection exists
Most people who enter AI and ML do one of three things:
- They jump straight into model-building without mathematics.
- They consume many short courses but never develop deep conceptual structure.
- They know pieces of the field but not the full map.
I wanted to solve exactly that problem.
This collection is for readers who want:
- a mathematically grounded path into ML,
- a principled understanding of probability and inference,
- real clarity on optimization and generalization,
- a modern understanding of deep learning and transformers,
- a path into RL, uncertainty, validation, and causality,
- and a serious alternative to random fragmented learning.
Book-by-book explanation
Mathematics for Machine Learning
For those who wish to overcome their fear of the math involved in machine learning, I think this is one of the best starting points. It connects the precise mathematical underpinnings—linear algebra, analytic geometry, matrix decompositions, calculus, optimization, probability, and statistics—that are most important for machine learning. Additionally, it establishes a direct connection between those tools and fundamental machine learning techniques like SVMs, PCA, linear regression, and Gaussian mixture models. It is therefore perfect as a "mathematical bridge book" between real machine learning and pure mathematics.
Foundations of Machine Learning
One of the collection's important theory books is this one. It advances generalization theory, learnability, complexity, formal guarantees, and the PAC learning framework. In my opinion, it is the book that teaches readers how to rigorously consider what it means for a model to be statistically justified, learn, and generalize. This is a theory-first foundation rather than a practice-first book.
Understanding Machine Learning: From Theory to Algorithms
This book is a great addition to Foundations of Machine Learning. It focuses on how those principles become algorithms while providing a principled explanation of the concepts underlying learning theory. ERM, convexity, stability, stochastic gradient descent, neural networks, structured output learning, and theoretical concepts such as compression-based bounds and PAC-Bayes are all covered. It is among the best "bridge books" between rigorous theory and algorithmic implementation, in my opinion.
Pattern Recognition and Machine Learning
One of the classic books on probabilistic machine learning is this one. Because it develops probabilistic intuition at a very deep level—Bayesian methods, graphical models, approximate inference, latent-variable models, kernel methods, and probabilistic pattern recognition—I have included it. For readers who wish to comprehend machine learning through the lenses of uncertainty, density modeling, and Bayesian reasoning, it continues to be among the best texts.
Machine Learning: A Probabilistic Perspective
One of the collection's most extensive and thorough ML books is this one. By combining background math, probability, optimization, linear models, latent-variable models, approximate inference, graphical models, kernel methods, and deep learning into a single probabilistic language, it covers the fundamentals of machine learning. It's among the best "encyclopedic" references for machine learning, in my opinion.
Probabilistic Machine Learning: An Introduction
This is how probabilistic machine learning is currently introduced. Probability, multivariate models, statistics, decision theory, information theory, linear algebra, optimization, neural networks, trees, ensembles, clustering, dimensionality reduction, and learning with fewer labels are among the foundational topics covered. It is, in my opinion, one of the greatest contemporary resources for readers seeking a clear and organized probabilistic perspective of the field.
Probabilistic Machine Learning: Advanced Topics
One of the most significant books in the entire collection, this is the sophisticated sequel. Advanced inference, Bayesian statistics, graphical models, filtering and smoothing, variational inference, Monte Carlo techniques, Bayesian neural networks, Gaussian processes, distribution shift, generative models, representation learning, interpretability, decision making, reinforcement learning, and causality are all included. This book, in my opinion, marks the beginning of probabilistic machine learning as a comprehensive theory of prediction, generation, discovery, and action.
Information Theory, Inference, and Learning Algorithms
This book is special because it sharpens the way we think about entropy, coding, inference, message passing, learning, and compression. It is one of the books that most strongly teaches the unity between information theory and machine learning. Even when it is not directly used in applied pipelines, it changes the way a reader thinks about uncertainty, model selection, and representation.
Convex Optimization
This is the canonical reference for convex sets, convex functions, duality, KKT conditions, and optimization theory. I include it because optimization is not just a tool in ML; it is one of the structural cores of the field. Readers who understand this book well gain a much deeper command over learning objectives, constrained optimization, regularization, dual methods, and algorithmic guarantees.
Algorithms for Optimization
This book complements Convex Optimization by being more algorithm-focused and method-oriented. It covers gradients, bracketing, local descent, first-order methods, second-order methods, stochastic methods, population methods, constrained optimization, duality, LP, QP, and ADMM. I see it as one of the most useful optimization books for practitioners who want algorithmic understanding instead of only theory.
Deep Learning with Python
This is a practical and highly accessible deep learning book. It introduces neural networks, Keras, TensorFlow workflows, computer vision, text, timeseries, generative modeling, and real-world deep learning practices. I include it because it helps readers move from mathematical foundations into real neural network intuition and implementation.
Understanding Deep Learning
This is one of the clearest modern books on deep learning fundamentals. It explains supervised learning, neural networks, deep architectures, loss functions, fitting models, gradient descent, stochastic optimization, initialization, and modern deep learning concepts with unusual clarity. I consider it one of the best deep learning books for readers who want concept-first understanding without sacrificing mathematical structure.
The Principles of Deep Learning Theory
This is a research-oriented theory book focused on initialization, criticality, Gaussian-process limits, finite- and infinite-width analysis, Bayesian inference in neural networks, and the effective theory viewpoint. I include it because it pushes readers beyond practical deep learning into the theory frontier. It is not the first DL book to read, but it is one of the most valuable advanced theory books.
Natural Language Processing with Transformers
This book is one of the strongest practical guides to modern NLP and transformer-based systems. It covers pretrained transformers, Hugging Face workflows, multilingual models, question answering, generation, fine-tuning, deployment, and practical engineering considerations. I consider it essential for moving from general deep learning into modern NLP and transformer applications.
Build a Large Language Model (From Scratch)
This book is valuable because it teaches LLMs by construction. It walks through data preparation, attention, architecture, pretraining, evaluation, loading pretrained weights, and fine-tuning for classifiers and assistants. I include it because building an LLM from scratch is one of the best ways to deeply understand what an LLM really is.
AI Engineering Guidebook
This book focuses on system design patterns for LLMs, RAG, agents, prompt engineering, fine-tuning, and deployment-oriented engineering. I include it because knowing models is not enough; real AI work today also requires understanding inference pipelines, retrieval, orchestration, local deployment, and product architecture.
Reinforcement Learning: An Introduction
This is the foundational RL book. It covers bandits, Markov decision processes, value functions, dynamic programming, Monte Carlo methods, temporal-difference learning, policy gradients, and more. I recommend it as the first serious RL book in the roadmap.
Algorithms for Decision Making
This book connects probabilistic reasoning, inference, utility, MDPs, planning, policy search, policy gradients, actor-critic methods, and policy validation. I see it as a beautiful bridge between probability, planning, and reinforcement learning.
Multi-Agent Reinforcement Learning
This is the modern extension into multi-agent systems and MARL. It combines game-theoretic foundations with contemporary learning methods. I recommend it only after strong single-agent RL understanding, because it adds strategic interaction, coordination, and game structure on top of standard RL ideas.
Algorithms for Validation
This is one of the most important books for trustworthy AI and safety-oriented ML. It covers validation, system modeling, property specification, falsification, reachability, failure estimation, explainability, and runtime monitoring. I include it because average benchmark performance is not enough for serious AI systems.
Fairness and Machine Learning
This book addresses fairness, legitimacy of automated decision making, classification, non-discrimination criteria, and the sociotechnical limits of purely observational fairness frameworks. I include it because real-world ML needs both mathematical and institutional responsibility.
Recommended roadmap
Level 0: Mathematics and optimization foundation
I recommend starting here if the reader wants to build from first principles:
- Mathematics for Machine Learning
- Convex Optimization
- Algorithms for Optimization
This stage builds the language of vectors, matrices, eigendecompositions, gradients, Hessians, constraints, and optimization geometry. Without this stage, later ML understanding often becomes shallow.
Level 1: Core machine learning foundation
After math, I recommend:
- Probabilistic Machine Learning: An Introduction
- Understanding Machine Learning
- Foundations of Machine Learning
This stage gives the reader the foundations of probability, learning theory, empirical risk minimization, statistical thinking, and formal generalization.
Level 2: Probabilistic depth
Then I recommend:
- Pattern Recognition and Machine Learning
- Machine Learning: A Probabilistic Perspective
- Probabilistic Machine Learning: Advanced Topics
- Information Theory, Inference, and Learning Algorithms
This stage builds a complete mature probabilistic worldview. Here the reader learns to think in terms of posterior inference, latent variables, uncertainty, graphical structure, divergence measures, and information flow.
Level 3: Deep learning
Then I recommend:
- Deep Learning with Python
- Understanding Deep Learning
- The Principles of Deep Learning Theory
This stage moves from practice to conceptual depth to research theory. Readers first build intuition, then clean understanding, then more advanced theoretical maturity.
Level 4: NLP, transformers, and LLMs
Then:
- Natural Language Processing with Transformers
- Build a Large Language Model (From Scratch)
- AI Engineering Guidebook
This stage turns deep learning understanding into transformer fluency and finally into LLM systems engineering.
Level 5: Decision making and RL
Then:
- Reinforcement Learning: An Introduction
- Algorithms for Decision Making
- Multi-Agent Reinforcement Learning
This stage extends the reader from prediction into sequential decision making, planning, policy optimization, and strategic multi-agent interaction.
Level 6: Trustworthy and societally grounded AI
Finally:
- Algorithms for Validation
- Fairness and Machine Learning
This stage matters because building a model is not the same as building a safe, robust, interpretable, or fair system.
My master notes and main concept understanding after I read those books
Below I am not trying to rewrite the books in full. I am only extracting the most important concepts and master formulas that I believe form the deep structure across the whole collection as master notes.
1. Learning as optimization
A very large fraction of ML can be written as:
where:
Linear models, logistic regression, neural networks, transformers, numerous probabilistic models, and even portions of reinforcement learning through surrogate objectives are all combined into one template.
The idea is that machine learning is more than just "fitting data." It involves optimizing a tradeoff between managing model complexity and fitting the observed data.
2. Empirical risk, expected risk, and generalization
The true objective is not training performance but expected performance on the underlying data distribution:
Since is unknown, we instead minimize empirical risk:
and perform ERM:
The central question of learning theory is then:
PAC learning, VC dimension, stability, margins, and Rademacher complexity all become significant at this point. The field is about justified generalization rather than merely fitting.
3. Probability as the language of uncertainty
Probability is the language that ties together Bayesian reasoning, inference, decision theory, graphical models, latent-variable models, generative modeling, and uncertainty-aware prediction.
Basic probability identities:
The most important conceptual objects are:
Bayesian updating then becomes:
This pattern appears across PRML, Murphy's books, MacKay, graphical models, Bayesian neural networks, filtering, and causal inference.
4. Likelihood, MLE, and MAP
Given data , the likelihood is:
Taking logs:
Maximum likelihood estimation is:
Maximum a posteriori estimation adds a prior:
Equivalently:
This is the deep bridge between Bayesian reasoning and regularization:
5. Linear regression as the prototype
The simplest but most important predictive model is:
with squared loss:
Closed-form solution:
Ridge regression:
This model matters because it teaches many of the core ideas of the entire field in a clean setting:
- projection geometry,
- Gaussian-noise interpretation,
- bias-variance tradeoff,
- regularization,
- conditioning and numerical stability,
- Bayesian linear modeling.
6. Logistic regression and classification
Binary logistic regression models class probabilities using the sigmoid:
The negative log-likelihood is the binary cross-entropy loss:
Multiclass logistic regression uses softmax:
with loss:
This is one of the most important bridges in all of ML because it connects probability, classification, linear models, gradient-based optimization, and neural network output layers.
7. The exponential family
Many widely used distributions can be written as:
where:
This family matters because it unifies Bernoulli, Gaussian, Poisson, categorical, and many more. It also lies underneath GLMs, conjugacy, variational inference, message passing, and natural gradients.
8. Information theory as a master layer
Entropy:
Cross-entropy:
KL divergence:
Mutual information:
These ideas are not peripheral. They are central across the books:
- cross-entropy is the standard classification loss,
- KL is central in VI, distillation, and approximate Bayes,
- entropy measures uncertainty and exploration,
- mutual information appears in representation learning and bottleneck methods,
- coding and compression ideas connect learning to information structure.
9. Bias–variance tradeoff
In a simplified regression view:
This gives one of the most important conceptual lessons in ML:
- small models underfit because of high bias,
- very flexible models can overfit because of high variance,
- regularization and inductive bias control the tradeoff.
This pattern reappears in linear models, trees, kernels, ensembles, and deep neural networks.
10. Convexity and optimization geometry
A differentiable function is convex if:
For convex differentiable functions:
Convexity matters because local minima are global minima, optimization is more stable, and duality becomes powerful.
For constrained optimization:
the Lagrangian is:
and KKT conditions become fundamental:
These ideas sit underneath SVMs, constrained estimation, dual optimization, and many optimization-based ML methods.
11. Gradient descent, SGD, momentum, and Adam
Standard gradient descent:
Mini-batch stochastic gradient descent:
Momentum:
Adam:
These are the workhorse update rules behind modern deep learning and a large fraction of practical ML.
12. Neural networks as compositional function approximators
A feedforward neural network composes affine maps and nonlinearities:
The real power of deep learning is not just "many parameters." It is hierarchical representation learning through compositional structure.
Backpropagation is the chain rule applied efficiently across this composition:
13. Initialization and trainability
Deep networks are sensitive to activation and gradient scales. Good initialization helps preserve signal and gradient flow.
Xavier / Glorot initialization:
He initialization:
This matters because training failure often comes not from the optimizer alone, but from bad signal propagation through depth. That is one reason the theory books on deep learning pay so much attention to initialization, criticality, and scaling.
14. Variational inference and the ELBO
When exact posterior inference is intractable, variational inference approximates it with a tractable family .
The key identity is:
where the evidence lower bound is:
Equivalently:
Since KL is nonnegative, maximizing the ELBO makes closer to the true posterior.
This one framework powers:
- latent-variable models,
- VAEs,
- amortized inference,
- Bayesian deep learning,
- large-scale approximate Bayesian methods.
15. Gaussian processes and function-space thinking
A Gaussian process defines a distribution over functions:
where is the mean function and
is the kernel.
The conceptual leap here is powerful: instead of putting uncertainty over parameters, I can put uncertainty directly over functions.
Gaussian processes matter because they teach:
- uncertainty-aware prediction,
- the role of kernels,
- Bayesian function-space inference,
- the relationship between infinite-width networks and kernel limits.
16. Bayesian neural networks and predictive uncertainty
In Bayesian neural networks, I put a posterior over parameters:
and then predictive uncertainty becomes:
This is one of the cleanest ways to represent epistemic uncertainty in neural prediction. Approximation strategies include VI, Laplace approximations, MCMC, dropout-based approximations, and deep ensembles.
17. Attention and transformers
The core transformer mechanism is scaled dot-product attention:
Self-attention uses:
Multi-head attention computes several attention maps in parallel:
The key conceptual lesson is that attention lets a model dynamically route information based on relevance rather than fixed local structure. This is one reason transformers generalized so strongly across text, vision, and multimodal systems.
18. Language modeling and LLMs
Autoregressive language modeling factors a sequence as:
Training objective:
Perplexity:
This is the master probabilistic formulation behind GPT-style pretraining. LLMs are best understood as large autoregressive probabilistic sequence models trained at scale.
19. Fine-tuning and distillation
Supervised fine-tuning objective:
Knowledge distillation:
The conceptual point is that a student model can learn not only ground-truth labels but also the teacher's richer soft distribution over outputs.
20. Retrieval-augmented generation and AI engineering
In a simplified RAG pipeline:
This shows the core architecture idea:
- model weights contain parametric memory,
- the retriever/index provides nonparametric memory.
Modern LLM systems often perform best when both are combined. That is why AI engineering is not just about model size or fine-tuning, but also about retrieval, orchestration, evaluation, and system design.
21. Markov decision processes and RL
A Markov decision process is:
Return:
State-value function:
Action-value function:
Bellman expectation equation:
Bellman optimality equation:
These equations form the mathematical backbone of RL and sequential decision making.
22. Temporal-difference learning and policy gradients
TD value update:
Q-learning:
Policy gradient objective:
REINFORCE gradient:
These formulas explain the main split in RL:
- value-based learning,
- policy-based learning,
- actor-critic hybrids.
23. Exploration vs exploitation
In bandits and RL, the system must balance using what it knows and discovering what it does not know.
One canonical exploration rule is UCB:
This captures a deep principle: act according to both current value estimate and uncertainty bonus.
Another key idea is Thompson sampling: sample from the posterior and act optimally under the sample. This naturally links Bayesian uncertainty to exploration.
24. Decision theory
Prediction is not enough; action depends on utility.
Expected utility principle:
or equivalently with losses:
This is one of the deepest ideas in the collection. Many practical systems fail because they optimize prediction accuracy without explicitly reasoning about utility, cost, risk, and downstream decisions.
25. Distribution shift and robustness
One of the most important advanced lessons in modern ML is that train and test distributions often differ.
Under covariate shift:
This leads to reweighting strategies, adaptation, and robust training ideas.
Modern ML needs to account for:
- covariate shift,
- label shift,
- domain adaptation,
- continual learning,
- OOD detection,
- adversarial examples.
26. Generative modeling taxonomy
Modern generative AI can be seen through several major families:
Variational autoencoders
Autoregressive models
Normalizing flows
Diffusion models
Forward corruption:
Learned reverse denoising:
GANs
A generator and discriminator are trained in opposition through adversarial objectives.
The key lesson is that generative modeling is not one technique. It is an ecosystem of probabilistic modeling strategies with different tradeoffs in likelihood, sample quality, inference, and training stability.
27. Representation learning
Representation learning is about learning such that useful structure is preserved and nuisance variation is compressed.
This can be done through:
- supervised learning,
- self-supervised learning,
- generative modeling,
- multiview learning,
- bottleneck objectives.
A strong representation is not just a compressed vector; it is a structure-preserving abstraction that improves downstream learning and transfer.
28. Validation and rare-event thinking
Robust system evaluation requires more than aggregate accuracy.
For rare failure estimation, importance sampling plays a key role:
The larger lesson is that trustworthy AI must account for:
- rare failures,
- adversarial behavior,
- reachability of unsafe states,
- runtime monitoring,
- property violation,
- explainability and post-deployment safety.
29. Fairness criteria
Three central statistical fairness notions are:
Independence
Separation
Sufficiency
A major lesson from the fairness literature is that these criteria are generally not simultaneously satisfiable except under special conditions. That means fairness is not just a matter of choosing one formula; it requires thinking carefully about goals, institutions, social context, and the limits of observational criteria.
30. Causality
Causal reasoning asks not only what is associated, but what would happen under intervention.
The intervention notation is:
When backdoor adjustment is valid:
Causality matters because prediction alone cannot answer interventional questions, policy questions, or many scientific questions. This is one of the most important distinctions between pattern recognition and genuine decision-support intelligence.
That is why I recommend these books as a serious alternative to random course-hopping. Together they form a complete path from fundamentals to advanced AI understanding.
- Downloads last month
- 339