Datasets:

Ujjwal-Tyagi
/

ai-ml-foundations-book-collection

Tasks:

Modalities:

Document

Languages:

English

Size:

< 1K

Tags:

agent

artificial-intelligence

machine-learning

deep-learning

Libraries:

Datasets

License:

Dataset card Data Studio Files Files and versions

xet

Community

Dataset Viewer

Auto-converted to Parquet Duplicate

Split (1)

train · 21 rows

Search is not available for this dataset

pdf pdf

Introduction

I put this collection together after spending a lot of time reading what I think are some of the best books on AI, machine learning, deep learning, probabilistic modeling, optimization, reinforcement learning, transformers, LLMs, validation, and fairness. I want to share this with the community for one simple reason: I want to give people a structured path through the books that actually help them understand things deeply, instead of sending them through random courses, disconnected tutorials, and fragmented content.

This collection is my attempt to gather the most important resources I would recommend to anyone who wants to master AI and ML properly: from mathematics and probability, to statistical learning theory, to deep learning, to transformers and LLM engineering, to decision making, robustness, fairness, and causality. There is a reason why these books are on this list. They make up a full ladder that goes from the basics to advanced research-level thinking. They include both theory and practice, both clear math and good engineering judgment, as well as making decisions and predictions.

I'm sharing this as a community resource for people who want a real plan. I don't think you should see these books as separate references; instead, I think you should see them as part of a whole curriculum. If someone follows the roadmap below with patience and consistency, they can build a much stronger foundation than what most online courses that are all over the place offer.

Why this collection exists

Most people who enter AI and ML do one of three things:

They jump straight into model-building without mathematics.
They consume many short courses but never develop deep conceptual structure.
They know pieces of the field but not the full map.

I wanted to solve exactly that problem.

This collection is for readers who want:

a mathematically grounded path into ML,
a principled understanding of probability and inference,
real clarity on optimization and generalization,
a modern understanding of deep learning and transformers,
a path into RL, uncertainty, validation, and causality,
and a serious alternative to random fragmented learning.

Book-by-book explanation

Mathematics for Machine Learning

For those who wish to overcome their fear of the math involved in machine learning, I think this is one of the best starting points. It connects the precise mathematical underpinnings—linear algebra, analytic geometry, matrix decompositions, calculus, optimization, probability, and statistics—that are most important for machine learning. Additionally, it establishes a direct connection between those tools and fundamental machine learning techniques like SVMs, PCA, linear regression, and Gaussian mixture models. It is therefore perfect as a "mathematical bridge book" between real machine learning and pure mathematics.

Foundations of Machine Learning

One of the collection's important theory books is this one. It advances generalization theory, learnability, complexity, formal guarantees, and the PAC learning framework. In my opinion, it is the book that teaches readers how to rigorously consider what it means for a model to be statistically justified, learn, and generalize. This is a theory-first foundation rather than a practice-first book.

Understanding Machine Learning: From Theory to Algorithms

This book is a great addition to Foundations of Machine Learning. It focuses on how those principles become algorithms while providing a principled explanation of the concepts underlying learning theory. ERM, convexity, stability, stochastic gradient descent, neural networks, structured output learning, and theoretical concepts such as compression-based bounds and PAC-Bayes are all covered. It is among the best "bridge books" between rigorous theory and algorithmic implementation, in my opinion.

Pattern Recognition and Machine Learning

One of the classic books on probabilistic machine learning is this one. Because it develops probabilistic intuition at a very deep level—Bayesian methods, graphical models, approximate inference, latent-variable models, kernel methods, and probabilistic pattern recognition—I have included it. For readers who wish to comprehend machine learning through the lenses of uncertainty, density modeling, and Bayesian reasoning, it continues to be among the best texts.

Machine Learning: A Probabilistic Perspective

One of the collection's most extensive and thorough ML books is this one. By combining background math, probability, optimization, linear models, latent-variable models, approximate inference, graphical models, kernel methods, and deep learning into a single probabilistic language, it covers the fundamentals of machine learning. It's among the best "encyclopedic" references for machine learning, in my opinion.

Probabilistic Machine Learning: An Introduction

This is how probabilistic machine learning is currently introduced. Probability, multivariate models, statistics, decision theory, information theory, linear algebra, optimization, neural networks, trees, ensembles, clustering, dimensionality reduction, and learning with fewer labels are among the foundational topics covered. It is, in my opinion, one of the greatest contemporary resources for readers seeking a clear and organized probabilistic perspective of the field.

Probabilistic Machine Learning: Advanced Topics

One of the most significant books in the entire collection, this is the sophisticated sequel. Advanced inference, Bayesian statistics, graphical models, filtering and smoothing, variational inference, Monte Carlo techniques, Bayesian neural networks, Gaussian processes, distribution shift, generative models, representation learning, interpretability, decision making, reinforcement learning, and causality are all included. This book, in my opinion, marks the beginning of probabilistic machine learning as a comprehensive theory of prediction, generation, discovery, and action.

Information Theory, Inference, and Learning Algorithms

This book is special because it sharpens the way we think about entropy, coding, inference, message passing, learning, and compression. It is one of the books that most strongly teaches the unity between information theory and machine learning. Even when it is not directly used in applied pipelines, it changes the way a reader thinks about uncertainty, model selection, and representation.

Convex Optimization

This is the canonical reference for convex sets, convex functions, duality, KKT conditions, and optimization theory. I include it because optimization is not just a tool in ML; it is one of the structural cores of the field. Readers who understand this book well gain a much deeper command over learning objectives, constrained optimization, regularization, dual methods, and algorithmic guarantees.

Algorithms for Optimization

This book complements Convex Optimization by being more algorithm-focused and method-oriented. It covers gradients, bracketing, local descent, first-order methods, second-order methods, stochastic methods, population methods, constrained optimization, duality, LP, QP, and ADMM. I see it as one of the most useful optimization books for practitioners who want algorithmic understanding instead of only theory.

Deep Learning with Python

This is a practical and highly accessible deep learning book. It introduces neural networks, Keras, TensorFlow workflows, computer vision, text, timeseries, generative modeling, and real-world deep learning practices. I include it because it helps readers move from mathematical foundations into real neural network intuition and implementation.

Understanding Deep Learning

This is one of the clearest modern books on deep learning fundamentals. It explains supervised learning, neural networks, deep architectures, loss functions, fitting models, gradient descent, stochastic optimization, initialization, and modern deep learning concepts with unusual clarity. I consider it one of the best deep learning books for readers who want concept-first understanding without sacrificing mathematical structure.

The Principles of Deep Learning Theory

This is a research-oriented theory book focused on initialization, criticality, Gaussian-process limits, finite- and infinite-width analysis, Bayesian inference in neural networks, and the effective theory viewpoint. I include it because it pushes readers beyond practical deep learning into the theory frontier. It is not the first DL book to read, but it is one of the most valuable advanced theory books.

Natural Language Processing with Transformers

This book is one of the strongest practical guides to modern NLP and transformer-based systems. It covers pretrained transformers, Hugging Face workflows, multilingual models, question answering, generation, fine-tuning, deployment, and practical engineering considerations. I consider it essential for moving from general deep learning into modern NLP and transformer applications.

Build a Large Language Model (From Scratch)

This book is valuable because it teaches LLMs by construction. It walks through data preparation, attention, architecture, pretraining, evaluation, loading pretrained weights, and fine-tuning for classifiers and assistants. I include it because building an LLM from scratch is one of the best ways to deeply understand what an LLM really is.

AI Engineering Guidebook

This book focuses on system design patterns for LLMs, RAG, agents, prompt engineering, fine-tuning, and deployment-oriented engineering. I include it because knowing models is not enough; real AI work today also requires understanding inference pipelines, retrieval, orchestration, local deployment, and product architecture.

Reinforcement Learning: An Introduction

This is the foundational RL book. It covers bandits, Markov decision processes, value functions, dynamic programming, Monte Carlo methods, temporal-difference learning, policy gradients, and more. I recommend it as the first serious RL book in the roadmap.

Algorithms for Decision Making

This book connects probabilistic reasoning, inference, utility, MDPs, planning, policy search, policy gradients, actor-critic methods, and policy validation. I see it as a beautiful bridge between probability, planning, and reinforcement learning.

Multi-Agent Reinforcement Learning

This is the modern extension into multi-agent systems and MARL. It combines game-theoretic foundations with contemporary learning methods. I recommend it only after strong single-agent RL understanding, because it adds strategic interaction, coordination, and game structure on top of standard RL ideas.

Algorithms for Validation

This is one of the most important books for trustworthy AI and safety-oriented ML. It covers validation, system modeling, property specification, falsification, reachability, failure estimation, explainability, and runtime monitoring. I include it because average benchmark performance is not enough for serious AI systems.

Fairness and Machine Learning

This book addresses fairness, legitimacy of automated decision making, classification, non-discrimination criteria, and the sociotechnical limits of purely observational fairness frameworks. I include it because real-world ML needs both mathematical and institutional responsibility.

Recommended roadmap

Level 0: Mathematics and optimization foundation

I recommend starting here if the reader wants to build from first principles:

Mathematics for Machine Learning
Convex Optimization
Algorithms for Optimization

This stage builds the language of vectors, matrices, eigendecompositions, gradients, Hessians, constraints, and optimization geometry. Without this stage, later ML understanding often becomes shallow.

Level 1: Core machine learning foundation

After math, I recommend:

Probabilistic Machine Learning: An Introduction
Understanding Machine Learning
Foundations of Machine Learning

This stage gives the reader the foundations of probability, learning theory, empirical risk minimization, statistical thinking, and formal generalization.

Level 2: Probabilistic depth

Then I recommend:

Pattern Recognition and Machine Learning
Machine Learning: A Probabilistic Perspective
Probabilistic Machine Learning: Advanced Topics
Information Theory, Inference, and Learning Algorithms

This stage builds a complete mature probabilistic worldview. Here the reader learns to think in terms of posterior inference, latent variables, uncertainty, graphical structure, divergence measures, and information flow.

Level 3: Deep learning

Then I recommend:

Deep Learning with Python
Understanding Deep Learning
The Principles of Deep Learning Theory

This stage moves from practice to conceptual depth to research theory. Readers first build intuition, then clean understanding, then more advanced theoretical maturity.

Level 4: NLP, transformers, and LLMs

Then:

Natural Language Processing with Transformers
Build a Large Language Model (From Scratch)
AI Engineering Guidebook

This stage turns deep learning understanding into transformer fluency and finally into LLM systems engineering.

Level 5: Decision making and RL

Then:

Reinforcement Learning: An Introduction
Algorithms for Decision Making
Multi-Agent Reinforcement Learning

This stage extends the reader from prediction into sequential decision making, planning, policy optimization, and strategic multi-agent interaction.

Level 6: Trustworthy and societally grounded AI

Finally:

Algorithms for Validation
Fairness and Machine Learning

This stage matters because building a model is not the same as building a safe, robust, interpretable, or fair system.

My master notes and main concept understanding after I read those books

Below I am not trying to rewrite the books in full. I am only extracting the most important concepts and master formulas that I believe form the deep structure across the whole collection as master notes.

1. Learning as optimization

A very large fraction of ML can be written as:

$\min_{\theta}\frac{1}{n}\sum_{i=1}^{n}\ell\big(f_{\theta}(x_i),y_i\big)+\lambda\Omega(\theta)$

where:

$f_theta$ is the model,
$\ell$ is the loss function,
$\Omega(\theta)$ is a regularizer,
$\lambda$ controls complexity.

Linear models, logistic regression, neural networks, transformers, numerous probabilistic models, and even portions of reinforcement learning through surrogate objectives are all combined into one template.

The idea is that machine learning is more than just "fitting data." It involves optimizing a tradeoff between managing model complexity and fitting the observed data.

2. Empirical risk, expected risk, and generalization

The true objective is not training performance but expected performance on the underlying data distribution:

$R(f)=\mathbb{E}_{(x,y)\sim\mathcal{D}}[\ell(f(x),y)]$

Since $\mathcal{D}$ is unknown, we instead minimize empirical risk:

$\hat{R}_n(f)=\frac{1}{n}\sum_{i=1}^{n}\ell(f(x_i),y_i)$

and perform ERM:

$\hat{f}=\arg\min_{f\in\mathcal{F}}\hat{R}_n(f)$

The central question of learning theory is then:

$\text{How close is }\hat{R}_n(f)\text{ to }R(f)?$

PAC learning, VC dimension, stability, margins, and Rademacher complexity all become significant at this point. The field is about justified generalization rather than merely fitting.

3. Probability as the language of uncertainty

Probability is the language that ties together Bayesian reasoning, inference, decision theory, graphical models, latent-variable models, generative modeling, and uncertainty-aware prediction.

Basic probability identities:

$P(A\mid B)=\frac{P(B\mid A)P(A)}{P(B)}$

$P(x)=\sum_zP(x,z)\quad\text{or}\quad p(x)=\int p(x,z),dz$

$p(x,z)=p(x\mid z)p(z)$

The most important conceptual objects are:

prior: $p(z)$
likelihood: $p(x|z)$
posterior: $p(z|x)$
evidence / marginal likelihood: $p(x)$

Bayesian updating then becomes:

$p(z\mid x)=\frac{p(x\mid z)p(z)}{p(x)}$

This pattern appears across PRML, Murphy's books, MacKay, graphical models, Bayesian neural networks, filtering, and causal inference.

4. Likelihood, MLE, and MAP

Given data $D$ , the likelihood is:

$\mathcal{L}(\theta)=p(D\mid\theta)=\prod_{i=1}^{n}p(y_i\mid x_i,\theta)$

Taking logs:

$\log\mathcal{L}(\theta)=\sum_{i=1}^{n}\log p(y_i\mid x_i,\theta)$

Maximum likelihood estimation is:

$\hat{\theta}_{\text{MLE}}=\arg\max_{\theta}\log\mathcal{L}(\theta)$

Maximum a posteriori estimation adds a prior:

$\hat{\theta}_{\text{MAP}}=\arg\max_{\theta}\left[\log p(D\mid\theta)+\log p(\theta)\right]$

Equivalently:

$\hat{\theta}_{\text{MAP}}=\arg\min_{\theta}\left[-\log p(D\mid\theta)-\log p(\theta)\right]$

This is the deep bridge between Bayesian reasoning and regularization:

Gaussian prior $L2$ regularization
Laplace prior $L1$ regularization

5. Linear regression as the prototype

The simplest but most important predictive model is:

$\hat{y}=w^\top x+b$

with squared loss:

$\mathcal{L}(w,b)=\frac{1}{n}\sum_{i=1}^{n}(y_i-w^\top x_i-b)^2$

Closed-form solution:

$\hat{w}=(X^\top X)^{-1}X^\top y$

Ridge regression:

$\hat{w}=(X^\top X+\lambda I)^{-1}X^\top y$

This model matters because it teaches many of the core ideas of the entire field in a clean setting:

projection geometry,
Gaussian-noise interpretation,
bias-variance tradeoff,
regularization,
conditioning and numerical stability,
Bayesian linear modeling.

6. Logistic regression and classification

Binary logistic regression models class probabilities using the sigmoid:

$P(y=1\mid x)=\sigma(w^\top x+b)$

$\sigma(z)=\frac{1}{1+e^{-z}}$

The negative log-likelihood is the binary cross-entropy loss:

$\mathcal{L}(w,b)=-\sum_{i=1}^{n}\left[y_i\log\hat{p}_i+(1-y_i)\log(1-\hat{p}_i)\right]$

Multiclass logistic regression uses softmax:

$P(y=k\mid x)=\frac{e^{z_k}}{\sum_je^{z_j}}$

with loss:

$\mathcal{L}=-\sum_{i=1}^{n}\log P(y_i\mid x_i)$

This is one of the most important bridges in all of ML because it connects probability, classification, linear models, gradient-based optimization, and neural network output layers.

7. The exponential family

Many widely used distributions can be written as:

$p(x\mid\eta)=h(x)\exp\big(\eta^\top T(x)-A(\eta)\big)$

where:

$T(x)$ are sufficient statistics,
$\eta$ are natural parameters,
$A(eta)$ is the log-partition function.

This family matters because it unifies Bernoulli, Gaussian, Poisson, categorical, and many more. It also lies underneath GLMs, conjugacy, variational inference, message passing, and natural gradients.

8. Information theory as a master layer

Entropy:

$H(X)=-\sum_xp(x)\log p(x)$

Cross-entropy:

$H(p,q)=-\sum_xp(x)\log q(x)$

KL divergence:

$D_{\mathrm{KL}}(p|q)=\sum_xp(x)\log\frac{p(x)}{q(x)}$

Mutual information:

$I(X;Y)=\sum_{x,y}p(x,y)\log\frac{p(x,y)}{p(x)p(y)}$

These ideas are not peripheral. They are central across the books:

cross-entropy is the standard classification loss,
KL is central in VI, distillation, and approximate Bayes,
entropy measures uncertainty and exploration,
mutual information appears in representation learning and bottleneck methods,
coding and compression ideas connect learning to information structure.

9. Bias–variance tradeoff

In a simplified regression view:

$\mathbb{E}\big[(y-\hat{f}(x))^2\big]=\text{Bias}^2+\text{Variance}+\text{Noise}$

This gives one of the most important conceptual lessons in ML:

small models underfit because of high bias,
very flexible models can overfit because of high variance,
regularization and inductive bias control the tradeoff.

This pattern reappears in linear models, trees, kernels, ensembles, and deep neural networks.

10. Convexity and optimization geometry

A differentiable function is convex if:

$f(\theta x+(1-\theta)y)\leq\theta f(x)+(1-\theta)f(y)\quad\forall\theta\in[0,1]$

For convex differentiable functions:

$f(y)\geq f(x)+\nabla f(x)^\top(y-x)$

Convexity matters because local minima are global minima, optimization is more stable, and duality becomes powerful.

For constrained optimization:

$\min_xf(x)\quad\text{s.t. }g_i(x)\leq0,;h_j(x)=0$

the Lagrangian is:

$\mathcal{L}(x,\lambda,\nu)=f(x)+\sum_i\lambda_ig_i(x)+\sum_j\nu_jh_j(x)$

and KKT conditions become fundamental:

$g_i(x^\star)\leq0,\quad h_j(x^\star)=0$

$\lambda_i^\star\geq0$

$\lambda_i^\star g_i(x^\star)=0$

$\nabla_x\mathcal{L}(x^\star,\lambda^\star,\nu^\star)=0$

These ideas sit underneath SVMs, constrained estimation, dual optimization, and many optimization-based ML methods.

11. Gradient descent, SGD, momentum, and Adam

Standard gradient descent:

$\theta_{t+1}=\theta_t-\eta\nabla_\theta\mathcal{L}(\theta_t)$

Mini-batch stochastic gradient descent:

$\theta_{t+1}=\theta_t-\eta\nabla_\theta\mathcal{L}_{\mathcal{B}_t}(\theta_t)$

Momentum:

$v_{t+1}=\beta v_t+\nabla_\theta\mathcal{L}(\theta_t),\qquad\theta_{t+1}=\theta_t-\eta v_{t+1}$

Adam:

$m_t=\beta_1m_{t-1}+(1-\beta_1)g_t$

$v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2$

$\hat{m}_t=\frac{m_t}{1-\beta_1^t},\qquad\hat{v}_t=\frac{v_t}{1-\beta_2^t}$

$\theta_t=\theta_{t-1}-\eta\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}$

These are the workhorse update rules behind modern deep learning and a large fraction of practical ML.

12. Neural networks as compositional function approximators

A feedforward neural network composes affine maps and nonlinearities:

$h^{(1)}=\phi(W^{(1)}x+b^{(1)})$

$h^{(l)}=\phi(W^{(l)}h^{(l-1)}+b^{(l)})$

$\hat{y}=W^{(L)}h^{(L-1)}+b^{(L)}$

The real power of deep learning is not just "many parameters." It is hierarchical representation learning through compositional structure.

Backpropagation is the chain rule applied efficiently across this composition:

$\frac{\partial\mathcal{L}}{\partial x}=\frac{\partial\mathcal{L}}{\partial f}\frac{\partial f}{\partial g}\frac{\partial g}{\partial h}\frac{\partial h}{\partial x}$

For a dense layer $z=Wx+b$ , $a=phi(z)$ :

$\frac{\partial\mathcal{L}}{\partial W}=\delta x^\top$

$\delta=\frac{\partial\mathcal{L}}{\partial z}=\frac{\partial\mathcal{L}}{\partial a}\odot\phi'(z)$

13. Initialization and trainability

Deep networks are sensitive to activation and gradient scales. Good initialization helps preserve signal and gradient flow.

Xavier / Glorot initialization:

$\mathrm{Var}(W)\approx\frac{2}{n_{\text{in}}+n_{\text{out}}}$

He initialization:

$\mathrm{Var}(W)\approx\frac{2}{n_{\text{in}}}$

This matters because training failure often comes not from the optimizer alone, but from bad signal propagation through depth. That is one reason the theory books on deep learning pay so much attention to initialization, criticality, and scaling.

14. Variational inference and the ELBO

When exact posterior inference is intractable, variational inference approximates it with a tractable family $q(z)$ .

The key identity is:

$\log p(x)=\mathcal{L}(q)+D_{\mathrm{KL}}\big(q(z)|p(z\mid x)\big)$

where the evidence lower bound is:

$\mathcal{L}(q)=\mathbb{E}_{q(z)}[\log p(x,z)-\log q(z)]$

Equivalently:

$\mathcal{L}(q)=\mathbb{E}_{q(z)}[\log p(x\mid z)]-D_{\mathrm{KL}}(q(z)|p(z))$

Since KL is nonnegative, maximizing the ELBO makes $q(z)$ closer to the true posterior.

This one framework powers:

latent-variable models,
VAEs,
amortized inference,
Bayesian deep learning,
large-scale approximate Bayesian methods.

15. Gaussian processes and function-space thinking

A Gaussian process defines a distribution over functions:

$f\sim\mathcal{GP}(m(x),k(x,x'))$

where $m(x)$ is the mean function and $k(x,x')$ is the kernel.

The conceptual leap here is powerful: instead of putting uncertainty over parameters, I can put uncertainty directly over functions.

Gaussian processes matter because they teach:

uncertainty-aware prediction,
the role of kernels,
Bayesian function-space inference,
the relationship between infinite-width networks and kernel limits.

16. Bayesian neural networks and predictive uncertainty

In Bayesian neural networks, I put a posterior over parameters:

$p(\theta\mid\mathcal{D})$

and then predictive uncertainty becomes:

$p(y_\ast\mid x_\ast,\mathcal{D})=\int p(y_\ast\mid x_\ast,\theta),p(\theta\mid\mathcal{D}),d\theta$

This is one of the cleanest ways to represent epistemic uncertainty in neural prediction. Approximation strategies include VI, Laplace approximations, MCMC, dropout-based approximations, and deep ensembles.

17. Attention and transformers

The core transformer mechanism is scaled dot-product attention:

$\mathrm{Attention}(Q,K,V)=\mathrm{softmax}!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$

Self-attention uses:

$Q=XW^Q,\qquad K=XW^K,\qquad V=XW^V$

Multi-head attention computes several attention maps in parallel:

$\mathrm{MHA}(Q,K,V)=\mathrm{Concat}(\text{head}_1,\dots,\text{head}_H)W^O$

The key conceptual lesson is that attention lets a model dynamically route information based on relevance rather than fixed local structure. This is one reason transformers generalized so strongly across text, vision, and multimodal systems.

18. Language modeling and LLMs

Autoregressive language modeling factors a sequence as:

$P(x_{1:T})=\prod_{t=1}^{T}P(x_t\mid x_{<t})$

Training objective:

$\mathcal{L}_{\text{LM}}=-\sum_{t=1}^{T}\log P_\theta(x_t\mid x_{<t})$

Perplexity:

$\mathrm{PPL}=\exp!\left(-\frac{1}{T}\sum_{t=1}^{T}\log P_\theta(x_t\mid x_{<t})\right)$

This is the master probabilistic formulation behind GPT-style pretraining. LLMs are best understood as large autoregressive probabilistic sequence models trained at scale.

19. Fine-tuning and distillation

Supervised fine-tuning objective:

$\mathcal{L}_{\text{SFT}}=-\sum_{(x,y)}\log P_\theta(y\mid x)$

Knowledge distillation:

$\mathcal{L}_{\text{KD}}=(1-\alpha)\mathcal{L}_{\text{hard}}+\alpha T^2 D_{\mathrm{KL}}!\left(p_T^{\text{teacher}}||p_T^{\text{student}}\right)$

The conceptual point is that a student model can learn not only ground-truth labels but also the teacher's richer soft distribution over outputs.

20. Retrieval-augmented generation and AI engineering

In a simplified RAG pipeline:

$d^\star=\arg\max_{d\in\mathcal{D}}\mathrm{sim}(q,d)$

$P(y\mid q)\approx P(y\mid q,d^\star)$

This shows the core architecture idea:

model weights contain parametric memory,
the retriever/index provides nonparametric memory.

Modern LLM systems often perform best when both are combined. That is why AI engineering is not just about model size or fine-tuning, but also about retrieval, orchestration, evaluation, and system design.

21. Markov decision processes and RL

A Markov decision process is:

$(\mathcal{S},\mathcal{A},P,R,\gamma)$

Return:

$G_t=\sum_{k=0}^{\infty}\gamma^k R_{t+k+1}$

State-value function:

$V^\pi(s)=\mathbb{E}_\pi[G_t\mid S_t=s]$

Action-value function:

$Q^\pi(s,a)=\mathbb{E}_\pi[G_t\mid S_t=s,A_t=a]$

Bellman expectation equation:

$V^\pi(s)=\sum_a\pi(a\mid s)\sum_{s',r}p(s',r\mid s,a)\big[r+\gamma V^\pi(s')\big]$

Bellman optimality equation:

$V^\star(s)=\max_a\sum_{s',r}p(s',r\mid s,a)\big[r+\gamma V^\star(s')\big]$

These equations form the mathematical backbone of RL and sequential decision making.

22. Temporal-difference learning and policy gradients

TD value update:

$V(S_t)\leftarrow V(S_t)+\alpha\Big(R_{t+1}+\gamma V(S_{t+1})-V(S_t)\Big)$

Q-learning:

$Q(S_t,A_t)\leftarrow Q(S_t,A_t)+\alpha\Big(R_{t+1}+\gamma\max_aQ(S_{t+1},a)-Q(S_t,A_t)\Big)$

Policy gradient objective:

$J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)]$

REINFORCE gradient:

$\nabla_\theta J(\theta)=\mathbb{E}!\left[\sum_t\nabla_\theta\log\pi_\theta(a_t\mid s_t),G_t\right]$

These formulas explain the main split in RL:

value-based learning,
policy-based learning,
actor-critic hybrids.

23. Exploration vs exploitation

In bandits and RL, the system must balance using what it knows and discovering what it does not know.

One canonical exploration rule is UCB:

$A_t=\arg\max_a\left[\hat{Q}_t(a)+c\sqrt{\frac{\ln t}{N_t(a)}}\right]$

This captures a deep principle: act according to both current value estimate and uncertainty bonus.

Another key idea is Thompson sampling: sample from the posterior and act optimally under the sample. This naturally links Bayesian uncertainty to exploration.

24. Decision theory

Prediction is not enough; action depends on utility.

Expected utility principle:

$a^\star=\arg\max_a\mathbb{E}[U\mid a]$

or equivalently with losses:

$a^\star=\arg\min_a\mathbb{E}[L(a,\theta)\mid x]$

This is one of the deepest ideas in the collection. Many practical systems fail because they optimize prediction accuracy without explicitly reasoning about utility, cost, risk, and downstream decisions.

25. Distribution shift and robustness

One of the most important advanced lessons in modern ML is that train and test distributions often differ.

Under covariate shift:

$R(f)=\mathbb{E}_{(x,y)\sim p_{\text{test}}}[\ell(f(x),y)]=\mathbb{E}_{(x,y)\sim p_{\text{train}}}!\left[\frac{p_{\text{test}}(x)}{p_{\text{train}}(x)}\ell(f(x),y)\right]$

This leads to reweighting strategies, adaptation, and robust training ideas.

Modern ML needs to account for:

covariate shift,
label shift,
domain adaptation,
continual learning,
OOD detection,
adversarial examples.

26. Generative modeling taxonomy

Modern generative AI can be seen through several major families:

Diffusion models

Forward corruption:

$q(x_t\mid x_{t-1})$

Learned reverse denoising:

$p_\theta(x_{t-1}\mid x_t)$

GANs

A generator and discriminator are trained in opposition through adversarial objectives.

The key lesson is that generative modeling is not one technique. It is an ecosystem of probabilistic modeling strategies with different tradeoffs in likelihood, sample quality, inference, and training stability.

27. Representation learning

Representation learning is about learning $z=f_theta(x)$ such that useful structure is preserved and nuisance variation is compressed.

This can be done through:

supervised learning,
self-supervised learning,
generative modeling,
multiview learning,
bottleneck objectives.

A strong representation is not just a compressed vector; it is a structure-preserving abstraction that improves downstream learning and transfer.

28. Validation and rare-event thinking

Robust system evaluation requires more than aggregate accuracy.

For rare failure estimation, importance sampling plays a key role:

$\hat{p}=\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}[x_i\in\mathcal{F}]\frac{p(x_i)}{q(x_i)}$

where $F$ is the failure set.

The larger lesson is that trustworthy AI must account for:

rare failures,
adversarial behavior,
reachability of unsafe states,
runtime monitoring,
property violation,
explainability and post-deployment safety.

29. Fairness criteria

Three central statistical fairness notions are:

Independence

$\hat{Y}\perp A$

Separation

$\hat{Y}\perp A\mid Y$

Sufficiency

$Y\perp A\mid\hat{Y}$

A major lesson from the fairness literature is that these criteria are generally not simultaneously satisfiable except under special conditions. That means fairness is not just a matter of choosing one formula; it requires thinking carefully about goals, institutions, social context, and the limits of observational criteria.

30. Causality

Causal reasoning asks not only what is associated, but what would happen under intervention.

The intervention notation is:

$p(y\mid do(x))$

When backdoor adjustment is valid:

$p(y\mid do(x))=\sum_zp(y\mid x,z)p(z)$

Causality matters because prediction alone cannot answer interventional questions, policy questions, or many scientific questions. This is one of the most important distinctions between pattern recognition and genuine decision-support intelligence.

That is why I recommend these books as a serious alternative to random course-hopping. Together they form a complete path from fundamentals to advanced AI understanding.

Downloads last month: 339

Size of downloaded dataset files:

557 MB

Size of the auto-converted Parquet files: