analytics Core Machine Learning

Machine Learning Pointers

Pointers, derivations, and reminders on classical machine learning topics I keep coming back to.

MLE and MAP

$$\theta_{MLE} = \arg\max_{\theta} P(D | \theta) = \arg\max_{\theta} \prod_i P(x_i | \theta)$$

To avoid underflow, I would rather work in log space, since logarithm is monotonically increasing, so maximizing a function is the same as maximizing the log of that function.

$$\theta_{MLE} = \arg\max_{\theta} \log P(X | \theta) = \arg\max_{\theta} \sum_i \log P(x_i | \theta)$$

For MAP:

$$P(\theta | X) = \frac{P(X | \theta)P(\theta)}{P(X)} \propto P(X | \theta)P(\theta)$$

$$\theta_{MAP} = \arg\max_{\theta} \log P(X | \theta) + \log P(\theta)$$

Example: for Gaussian random variables, the log likelihood reduces to the average squared deviation, so MLE for variance becomes $$\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^2 = \sigma^2.$$
Example: for Bernoulli random variables, MLE gives $$\hat{\theta} = \frac{1}{n}\sum_{i=1}^{n}X_i.$$
These derivations are basic, but useful enough that I still like keeping them written down.

Logistic Regression

Original video reference

$$\frac{\exp(w^Tx)}{1+\exp(w^Tx)} = \frac{1}{1+\exp(-w^Tx)}$$ and this can also be viewed as a special case of softmax with only two classes.
$$\frac{1}{1+\exp(-w^Tx)} \ge \frac{1}{2} \rightarrow w^Tx \ge 0,$$ hence the decision boundary is linear in the binary case.
Logistic regression gives an unconstrained, smooth objective. It also gives calibrated probabilities that can be interpreted as confidence in a decision.
As an optimization problem, $\ell_2$, $\ell_1$, and elastic-net regularized logistic regression are all worth remembering because maximum conditional a posteriori corresponds naturally to regularization.

SVM

TL;DR: SVM is about finding a soft margin, or supporting hyperplane, that best separates the data. If we cannot find such a separable instance, we can move to a higher dimension by using kernels.
Understanding RKHS is the basis of understanding kernels.
The primal view is useful because it shows the role of slack variables and the penalty term $C$ directly. The original note pointed to the SVC primal problem.
The dual view highlights that the training vectors are implicitly mapped into a higher-dimensional space by the kernel function, for example the RBF kernel.
The famous kernel trick is exactly why the dual formulation is worth understanding.

Fully Connected Nets and Convolutional Nets

The original notebook also kept reminders on fully connected networks and convolutional nets.

For deep fully connected nets, writing out the derivative chain directly makes vanishing gradients painfully obvious.
For convolutional nets, I still like the view that convolution is just the discrete version of smoothed convolution in math, and many image-processing kernels become intuitive once you stare at them long enough.
Implementation notes, CS231n, lecture notes, and the MIT video were some of the most useful references in the original notebook.
I also kept links for Gaussian derivatives, Gaussian smoothing, convolution as matrix vector product, and a proper implementation of convolution.
Convolution also ties surprisingly well back to coding questions like image overlap, which is one reason I kept those notes connected.

There are still many details here that need periodic review. I prefer to keep this page notebook-like rather than make it sound more polished than it really is.

Browse the broader coding and ML note repo