Basic Theory Notbook and pointers

Basic Theory Notebook and pointers

Being a researcher to work on A.I (apologize for using the buzz word), apart from coding skills, one really gotta have to be equipped with a solid background in math and stats. Nobody, maybe except for senior PhDs in math or stats major can claim that they know everything, but at least we need those basic intuitions to understand how an algorithm works, or where the theoretical guarantee comes from. Here are the 7 books I have read.

Cassella Book is a must, where you know your bounds, distribution, and where RV comes from.
Strang's Linear Algebra doesn't need much of my own explanation
This "Green book" is really a good pointer to all the basics.
Bishop book is the bible of ML.
Boyd book is probably the most practical book out there about optimization. Rockfella one is too hard for me.
Statistical Learning book is also very famously fundamental.
Goodfellow book is new, but not necessarily the best written one.

From my experience, we would need the following background at least to understand what's going on:

Linear Algebra and its Essence

Probability

Statistics

Calculus (Of course)

As a result, before I go on the job market, there's tons of work for me to keep my memory fresh. Here's what this page covers:

Expected Value, Variance and Covariance
Common Families of Distributions
Useful Inequalities
Convergence of sequence
Regression

References used on this notebook:

36705 Notes by Siva Balakrishnan

36410 Notes by Siva Balakrishnan

Expected Value, Variance and Covariance

If you claim you know stats, and you flounder at any of the above questions... great embarrassment.

Covariance and Correlation derivation

Linearity of Expectations $$ \mathbb{E} ( \sum_{j=1}^k c_j g_j(X) )= \sum_{j=1}^k c_j \mathbb{E} (g_j(X))$$

Variance: $\sigma^2={\sf Var}\left(X\right) = \mathbb{E}\left((X-\mu)^2\right)$ ${\sf Var}\left(X\right) = \mathbb{E}\left(X^2\right) - \mu^2.$ where $\mu = \mathbb{E}(X)$.
If $X_1, \ldots, X_n$ are independent then: $ {\sf Var}\left(\sum_{i=1}^n a_i X_i\right) = \sum_i a_i^2 {\sf Var}\left(X_i\right). $

The covariance is $${\sf Cov}(X,Y) = \mathbb{E}((X-\mu_x)(Y-\mu_y)) = \mathbb{E}(XY)- \mu_X \mu_Y $$ and the correlation is $\rho(X,Y)= {\sf Cov}(X,Y)/\sigma_x\sigma_y$ Recall that $-1 \leq \rho(X,Y) \leq 1$

The conditional expectation of $Y$ given $X$ is the random variable $\mathbb{E}(Y|X)$ whose value, when $X=x$ is $$ \mathbb{E}(Y|X=x) = \int y \ p(y|x)dy $$ where $p(y|x) = \frac{p(x,y)}{p(x)}$.

Law of Total Expectation or Law of Iterated Expectation : $$ \mathbb{E}(Y)= \mathbb{E}\bigl[\mathbb{E}(Y|X)\bigr] = \int \mathbb{E}(Y|X=x) p_X(x)dx. $$ The Law of Total Variance is $$ {\sf Var}(Y) = {\sf Var}\bigl[ \mathbb{E}(Y|X)\bigr] + \mathbb{E}\bigl[ {\sf Var}(Y|X)\bigr]. $$

Sampling:
The sample mean is $$ \widehat{\mu}_n = \frac{1}{n}\sum_i X_i $$ Why sample variance has a divider of (n-1)? $$ \widehat{\sigma}^2_n= \frac{1}{n-1} \sum_i (X_i- \widehat{\mu}_n)^2. $$ The sampling distribution of $\widehat{\mu}_n$ is $$ G_n(t) = \mathbb{P}(\widehat{\mu}_n \leq t). $$ LeetCode528 talks about inverse CDF sampling. $X_i = F^{-1}(U_i).$ $U_1,\cdots,U_n$ independently with distribution U[0,1]
Sample to draw
Reparameterization trick done right

Sampling in 2D example from twitter

Gibbs Sampling: The idea in Gibbs sampling is that we start in some state $(x_1, . . . , x_n)$ and run the following algorithm:

Choose a coordinate to update, uniformly at random, i.e. with probability $\frac{1}{n}$ we select co-ordinate $j$.
Update its value by sampling from the conditional distribution keeping all other coordinates fixed, i.e. we sample a new value for $x_j$ : $x_j \sim \pi(x_j |(x_1, . . . , x_{j−1}, x_{j+1}, . . . , x_n))$.

The moment generating function (mgf) is $$ M_X(t) = \mathbb{E}\left(e^{tX}\right). $$ If $M_X(t)=M_Y(t)$ for all $t$ in an interval around 0 then $X\stackrel{d}{=}Y.$ We can take derivatives of the mgf with respect to $t$ and evaluate at $t = 0$, i.e. we have that $ M^{(n)}_X(t)|_{t=0} = \mathbb{E}\left(X^n\right). $

Common Families of Distributions

Normal (Gaussian) $X\sim N(\mu,\sigma^2)$ if $$ p(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-(x-\mu)^2/(2\sigma^2)}. $$ If $X \in \mathbb{R}^d$ then $X\sim N(\mu,\Sigma)$ if $$ p(x) = \frac{1}{ (2\pi)^{d/2} |\Sigma|} \exp\left( - \frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)\right). $$ Then $\mathbb{E}(Y) = \mu$ and ${\sf cov}(Y) = \Sigma$. The moment generating function is $$ M(t) = \exp\left( \mu^T t + \frac{t^T \Sigma t}{2}\right). $$

Chi-squared $X\sim \chi^2_p$ if $X = \sum_{j=1}^p Z_j^2$ where $Z_1,\ldots, Z_p\sim N(0,1)$. Non-central chi-square.$X \sim \chi^2_1(\mu^2)$ if $X = Z^2$ where $Z \sim N(\mu,1)$.

Bernoulli. $X\sim {\rm Bernoulli}(\theta)$ if $\mathbb{P}(X=1)=\theta$ and $\mathbb{P}(X=0)=1-\theta$ and hence $$ p(x)= \theta^x (1-\theta)^{1-x}\ \ \ \ \ \ x=0,1. $$

Binomial. $X\sim {\rm Binomial}(\theta)$ if $$ p(x)=\mathbb{P}(X=x) = \binom{n}{x} \theta^x (1-\theta)^{n-x} \ \ \ \ \ \ x\in\{-1,\ldots, n\}. $$

Poisson $X\sim {\rm Poisson}(\lambda)$ if $P(X=x) = \frac{e^{-\lambda}\lambda^x}{x!}$ $x=0,1,2,\ldots$. The $\mathbb{E}\left(X\right) = {\sf Var}\left(X\right) = \lambda$ and $M_X(t) = e^{\lambda({e^t}-1)}$. We can use the mgf to show: if $X_1 \sim {\rm Poisson}(\lambda_1)$, $X_2 \sim {\rm Poisson}(\lambda_2)$, independent then $Y=X_1 + X_2 \sim {\rm Poisson} (\lambda_1 + \lambda_2)$.

Exponential $X\sim {\rm exp}(\beta)$ if $p_X(x) = \frac{1}{\beta} e^{-x/\beta}$, $x>0$. Note that ${\rm exp}(\beta)=\Gamma(1,\beta)$.

Multinomial The multivariate version of a Binomial is called a Multinomial. Consider drawing a ball from an urn with has balls with $k$ different colors labeled ``color 1, color 2, $\ldots$, color $k$.'' Let $p=(p_1,p_2,\ldots,p_k)$ where $\sum_j p_j =1$ and $p_j$ is the probability of drawing color $j$. Draw $n$ balls from the urn (independently and with replacement) and let $X=(X_1,X_2,\ldots,X_k)$ be the count of the number of balls of each color drawn. We say that $X$ has a Multinomial $(n,p)$ distribution. The pdf is $$ p(x) = {n \choose {x_1,\ldots,x_k}} p_1^{x_1}\ldots p_k^{x_k}. $$

Gamma $X \sim \Gamma(\alpha,\beta)$ if $$ p_X(x) = \frac 1{\Gamma(\alpha) \beta^\alpha} x^{\alpha-1} e^{-x/\beta} $$ for $x>0$ where $\Gamma(\alpha) = \int_0^\infty \frac 1{\beta^\alpha} x^{\alpha-1} e^{-x/\beta} dx$.

Useful Inequalities

Inequality Cheetsheet

Concetration Inequality short video

Concentration Inequality Slides

Markov Inequality The most elementary tail bound is Markov's inequality, which asserts that for a positive random variable $X \geq 0$, (only for non-negative!) \begin{align*} \mathbb{P}(X \geq t) \leq \frac{\mathbb{E}[X]}{t}. \end{align*}
Chebyshev Inequality Chebyshev's inequality states that for a random variable $X$, with ${\sf Var}(X) = \sigma^2$: \begin{align*} \mathbb{P}( |X - \mathbb{E}[X]| \geq k \sigma) \leq \frac{1}{k^2}~~~~\forall k \geq 0. \end{align*} Proof: Chebyshev's inequality is an immediate consequence of Markov's inequality. \begin{align*} \mathbb{P}( |X - \mathbb{E}[X]| \geq k \sigma) &= \mathbb{P}( |X - \mathbb{E}[X]|^2 \geq k^2 \sigma^2) &\leq \frac{\mathbb{E}(|X - \mathbb{E}[X]|^2)}{k^2 \sigma^2} = \frac{1}{k^2}. \end{align*} Good enough for constant-probability statements. Things usually are around a few standard deviations. Only requires pairwise independence.
Define, $\mu = \mathbb{E}[X]$. For any $t > 0$, we have that, \begin{align*} \mathbb{P}((X - \mu) > u) = \mathbb{P}( \exp(t(X - \mu)) > \exp(t u) ) \leq \frac{ \mathbb{E}[\exp(t(X - \mu))]}{\exp(t u)}. \end{align*} Now $t$ is a parameter we can choose to get a tight upper bound, i.e. we can write this bound as: \begin{align*} \mathbb{P}((X - \mu) > u) \leq \inf_{0 \leq t \leq b} \exp (-t(u + \mu)) \mathbb{E}[\exp(tX)]. \end{align*} This bound is known as Chernoff's bound.
Formally, a random variable $X$ with mean $\mu$ is sub-Gaussian(tails decay faster than Gaussian) if there exists a positive number $\sigma$ such that, \begin{align*} \mathbb{E}[\exp(t(X - \mu))] \leq \exp (\sigma^2 t^2/2), \end{align*} for all $t \in \mathbb{R}$. Gaussian random variables with variance $\sigma^2$ satisfy the above condition with equality, so a $\sigma$-sub-Gaussian random variable basically just has an mgf that is dominated by a Gaussian with variance $\sigma$.
Jensen's inequality: Jensen's inequality states that for a convex function $g: \mathbb{R} \mapsto \mathbb{R}$ we have that, \begin{align*} \mathbb{E}[g(X)] \geq g(\mathbb{E}[X]). \end{align*} If $g$ is concave then the reverse inequality holds.
This in turn yields Hoeffding's bound. Suppose that, $X_1,\ldots, X_n$ are independent identically distribution bounded random variables, with $a \leq X_i \leq b$ for all $i$ then, \begin{align*} \mathbb{P}\left( \left| \frac{1}{n} \sum_{i=1}^n X_i - \mu \right| \geq t \right) \leq 2 \exp \left( - \frac{nt^2}{(b-a)^2} \right). \end{align*} This is a two-sided exponential tail inequality for the averages of bounded random variables.
Bernstein's inequality suppose we had $X_1,\ldots,X_n$ which were i.i.d from a distribution with mean zero, bounded support $[a,b]$, with variance $\mathbb{E}[(X-\mu)^2] = \sigma^2$. Then, \begin{align*} \mathbb{P}(| \widehat{\mu} - \mu| \geq t) \leq 2 \exp\left( - \frac{nt^2}{2(\sigma^2 + (b-a) t)} \right). \end{align*} Roughly this inequality says with probability at least $1 - \delta$, \begin{align*} |\widehat{\mu} - \mu| \leq 4 \sigma \sqrt{ \frac{\ln (2/\delta)}{n}} + \frac{ 4(b-a) \ln(2/\delta)}{n}. \end{align*} Intuition: both subgaussian (near the mean) and subexponential (far tails) parts: "Poisson-like" bounds.
The union bound. This is also known as Boole's inequality. It says that if we have events $A_1,\ldots,A_n$ then \begin{align*} \mathbb{P}\left( \bigcup_{i=1}^n A_i\right) \leq \sum_{i=1}^n \mathbb{P}(A_i). \end{align*}
Levy's inequality/Tsirelson's inequality: Concentration of Lipschitz functions of Gaussian random variables: \begin{align*} |f(X_1,\ldots,X_n) - f(Y_1,\ldots,Y_n)| \leq L \sqrt{\sum_{i=1}^n (X_i - Y_i)^2}, \end{align*} for all $X_1,\ldots,X_n,Y_1,\ldots,Y_n \in \mathbb{R}$. For such functions we have that if $X_1,\ldots,X_n \sim N(0,1)$ then, \begin{align*} \mathbb{P}(|f(X_1,\ldots,X_n) - \mathbb{E}[f(X_1,\ldots,X_n)] | \geq t) \leq 2 \exp \left( - \frac{t^2}{2L^2} \right). \end{align*}
A $U$-statistic is defined by a kernel, which is just a function of two random variables, i.e. $g: \mathbb{R}^2 \mapsto \mathbb{R}$. The U-statistic is then given as: \begin{align*} U(X_1,\ldots,X_n) := \frac{1}{ {n \choose 2} } \sum_{j < k} g(X_j, X_k). \end{align*} e.g. Variance: $g(X_j, X_k) = \frac{1}{2}(X_j-X_k)^2$ Mean Absolute Deviation $g(X_j, X_k) = |X_j-X_k|$.
Mcdiarmid's Inequality Formally, we have i.i.d. RVs $X_1,\ldots,X_n$, where each $X_i \in \mathbb{R}$. We have a function $f: \mathbb{R}^n \mapsto \mathbb{R},$ that satisfies the property that: \begin{align*} |f(x_1,\ldots,x_n) - f(x_1,\ldots,x_{k-1}, x'_k, x_{k+1},\ldots,x_n)| \leq L_k, \end{align*} for every $x,x' \in \mathbb{R}^n$, i.e. the function changes by at most $L_k$ if its $k$-th coordinate is changed. This is known as the bounded difference condition. If the random variables $X_1,\ldots, X_n$ are i.i.d then for all $t \geq 0$ \begin{align*} \mathbb{P}(|f(X_1,\ldots,X_n) - \mathbb{E}[f(X_1,\ldots,X_n)] | \geq t) \leq 2 \exp \left( - \frac{2t^2}{\sum_{k=1}^n L_k^2} \right). \end{align*} For bounded $U$-statistics, i.e. if $g(X_i, X_j) \leq b$, we can apply Azuma's inequality to obtain a concentration bound. Note that since each random variable $X_i$ participates in $(n-1)$ terms we have that, \begin{align*} |U(X_1,\ldots,X_n) - U(X_1,\ldots,X_i',\ldots, X_n)| \leq \frac{1}{{n \choose 2}} (n-1)(2b) = \frac{4b}{n}. \end{align*} So that Azuma's inequality tells us that, \begin{align*} \mathbb{P}(|U(X_1,\ldots,X_n) - \mathbb{E}[U(X_1,\ldots,X_n)] | \geq t) \leq 2 \exp (-nt^2/(8b^2)). \end{align*}

Pinsker's Inequality is cancelled

Johnson–Lindenstrauss (JL)

Deviation from the mean

Convergence of Sequence

Almost sure convergence \begin{align*} \lim_{n \rightarrow \infty} X_n = X. \end{align*} The correct analogue turns out to be to require: \begin{align*} \mathbb{P}\left(\lim_{n \rightarrow \infty} X_n = X\right) = 1. \end{align*}
Convergence in probability: A sequence of random variables $X_1,\ldots,X_n$ converges in probability to a random variable $X$ if for every $\epsilon > 0$ we have that, \begin{align*} \lim_{n \rightarrow \infty} \mathbb{P}( |X_n - X| \geq \epsilon) = 0. \end{align*} Example: Weak Law of Large Numbers Suppose that $Y_1,\ldots,Y_n$ are i.i.d. with $\mathbb{E}[Y_i] = \mu$ and $\text{Var}(Y_i) = \sigma^2 < \infty$. Define, for $i \in \{1,\ldots,n\}$, \begin{align*} X_i = \frac{1}{i} \sum_{j = 1}^i Y_j. \end{align*} The WLLN says that the sequence $X_1,X_2,\ldots$ converges in probability to $\mu$. {\bf Proof: } The proof is simply an application of Chebyshev's inequality. We note that by Chebyshev's inequality: \begin{align*} \mathbb{P}(|X_n - \mathbb{E}[X]| \geq \epsilon) \leq \frac{\sigma^2}{n \epsilon^2}. \end{align*} This in turn implies that, \begin{align*} \lim_{n \rightarrow \infty} \mathbb{P}(|X_n - \mathbb{E}[X]| \geq \epsilon) = 0, \end{align*} as desired. Usually we will construct an estimator $\widehat{\theta}_n$ for some quantity $\theta^*$. We will then say that the estimator is consistent if the sequence of RVs $\widehat{\theta}_n$ converges in probability to $\theta^*$.
Convergence in Quadratic Mean: \begin{align*} \mathbb{E}(X_n - X)^2 \rightarrow 0, \end{align*} as $n \rightarrow \infty$.>
Convergence in Distribution: We say that a sequence converges to $X$ in distribution if: \begin{align*} \lim_{n \rightarrow \infty} F_{X_n}(t) = F_X(t), \end{align*} for all points $t$ where the CDF $F_X$ is continuous.

5 ways to do linear Regression

Least Square method and it's derivation and an intuitive example.
Matrix Formulation for Multiple Linear Regression, but need to understand Matrix Derivation
Projection Matrix
Weighted Least Square:
We would like to fit a linear regression model to the dataset $ \mathcal{D} = \left\{\left(\mathbf{x}^{(1)},y^{(1)}\right), \left(\mathbf{x}^{(2)},y^{(2)}\right),\cdots, \left(\mathbf{x}^{(N)},y^{(N)}\right)\right\} $ with $\mathbf{x}^{(i)} \in \mathbb{R}^M$ by minimizing the ordinary least square (OLS) objective function: $ J(\mathbf{w}) = \frac{1}{2}\sum_{i=1}^N\left(y^{(i)} - \sum_{j=1}^M w_j x_j^{(i)}\right)^2 $ Specifically, we solve for each linear regression coefficient $w_k, 1\leq k\leq M$, by deriving an expression for $w_k$ from the critical point $\frac{\partial J(\mathbf{w})}{\partial w_k} = 0$. expression for each $w_k$: $w_k = \frac{\sum_{i=1}^N x_k^{(i)}\left(y^{(i)}-\sum_{j=1,j\neq k}^M w_j x_j^{(i)}\right)}{\sum_{i=1}^N \left(x_k^{(i)}\right)^2}$
$\hat{\beta}_{WLS} = \underset{\beta}{argmin}{\sum\limits_{i=1}^n{\epsilon_i^2}} = (X^T WX)^{−1}X^T WY$

There are so much stuff need to review, updating...

Comments

Want to leave a comment? Visit this post's issue page on GitHub (you'll need a GitHub account).