functions Math & Statistics

Basic Theory Notebook

Being a researcher in AI, apart from coding skills, you really need a solid background in math and stats. Nobody knows everything, but you need enough intuition to understand how an algorithm works and where the theoretical guarantees come from.

Books and Background

Casella for bounds, distributions, and where random variables come from.
Strang for linear algebra.
Bishop, Statistical Learning, and Boyd for ML and optimization foundations.
Linear Algebra and its Essence, plus probability, statistics, and Calculus, are non-negotiable refreshers.

From experience, that background is the minimum needed to understand what is really going on in ML rather than just copying recipes.

What This Page Covers

Expected value, variance, and covariance
Common families of distributions
Useful inequalities
Convergence of sequence
Regression

References in the original notes included 36705 Notes by Siva Balakrishnan and 36410 Notes by Siva Balakrishnan.

Expected Value, Variance, and Covariance

These are the questions you should be able to recover quickly if you claim to know statistics.

Covariance and Correlation derivation

Linearity of expectation: $$\mathbb{E}\left(\sum_{j=1}^k c_j g_j(X)\right)=\sum_{j=1}^k c_j \mathbb{E}(g_j(X))$$
Variance: $${\sf Var}(X)=\mathbb{E}((X-\mu)^2)=\mathbb{E}(X^2)-\mu^2$$
Covariance: $${\sf Cov}(X,Y)=\mathbb{E}(XY)-\mu_X\mu_Y$$ and correlation stays between $-1$ and $1$.
Conditional expectation, the law of total expectation, and the law of total variance are basic tools that keep coming back.
Sampling reminders matter too: sample mean, sample variance with the n-1 correction, inverse CDF sampling as in LeetCode 528, sample to draw, sampling in 2D, and reparameterization done right.

Distributions

Normal, chi-squared, Bernoulli, binomial, Poisson, exponential, multinomial, and gamma all appear throughout the original notes.
One reason to keep these close is that so many estimators and concentration arguments reduce back to these standard families.
The moment generating function is a useful unifying tool because equal mgfs imply equal distributions around zero.

Useful Inequalities

Inequality cheatsheet, concentration inequality video, and concentration inequality slides.

Markov and Chebyshev are the elementary starting points.
Chernoff, Hoeffding, and Bernstein help formalize concentration near the mean and in the tails.
Jensen, union bounds, and McDiarmid are worth remembering because they appear everywhere.
The theory notes also keep reminders on sub-Gaussian intuition, Lipschitz concentration, and U-statistics.

Convergence

Almost sure convergence
Convergence in probability and the weak law of large numbers
Convergence in quadratic mean
Convergence in distribution

The practical stack is still: know the definitions, know the intuition, and know what consistency means when you talk about an estimator.

Regression

Least squares and its derivation, plus an intuitive example.
Matrix formulation for multiple linear regression, plus matrix derivation.
Projection matrix interpretation.
Weighted least squares: $$\hat{\beta}_{WLS} = (X^T W X)^{-1}X^T W Y$$