analytics Core Machine Learning

Machine Learning Pointers

Pointers, derivations, and reminders on classical machine learning topics I keep coming back to.

MLE and MAP

$$\theta_{MLE} = \arg\max_{\theta} P(D | \theta) = \arg\max_{\theta} \prod_i P(x_i | \theta)$$

To avoid underflow, I would rather work in log space, since logarithm is monotonically increasing, so maximizing a function is the same as maximizing the log of that function.

$$\theta_{MLE} = \arg\max_{\theta} \log P(X | \theta) = \arg\max_{\theta} \sum_i \log P(x_i | \theta)$$

For MAP:

$$P(\theta | X) = \frac{P(X | \theta)P(\theta)}{P(X)} \propto P(X | \theta)P(\theta)$$

$$\theta_{MAP} = \arg\max_{\theta} \log P(X | \theta) + \log P(\theta)$$

  • Example: for Gaussian random variables, the log likelihood reduces to the average squared deviation, so MLE for variance becomes $$\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^2 = \sigma^2.$$
  • Example: for Bernoulli random variables, MLE gives $$\hat{\theta} = \frac{1}{n}\sum_{i=1}^{n}X_i.$$
  • These derivations are basic, but useful enough that I still like keeping them written down.

Logistic Regression

Original video reference

  • $$\frac{\exp(w^Tx)}{1+\exp(w^Tx)} = \frac{1}{1+\exp(-w^Tx)}$$ and this can also be viewed as a special case of softmax with only two classes.
  • $$\frac{1}{1+\exp(-w^Tx)} \ge \frac{1}{2} \rightarrow w^Tx \ge 0,$$ hence the decision boundary is linear in the binary case.
  • Logistic regression gives an unconstrained, smooth objective. It also gives calibrated probabilities that can be interpreted as confidence in a decision.
  • As an optimization problem, $\ell_2$, $\ell_1$, and elastic-net regularized logistic regression are all worth remembering because maximum conditional a posteriori corresponds naturally to regularization.

SVM

  • TL;DR: SVM is about finding a soft margin, or supporting hyperplane, that best separates the data. If we cannot find such a separable instance, we can move to a higher dimension by using kernels.
  • Understanding RKHS is the basis of understanding kernels.
  • The primal view is useful because it shows the role of slack variables and the penalty term $C$ directly. The original note pointed to the SVC primal problem.
  • The dual view highlights that the training vectors are implicitly mapped into a higher-dimensional space by the kernel function, for example the RBF kernel.
  • The famous kernel trick is exactly why the dual formulation is worth understanding.

Fully Connected Nets and Convolutional Nets

The original notebook also kept reminders on fully connected networks and convolutional nets.

There are still many details here that need periodic review. I prefer to keep this page notebook-like rather than make it sound more polished than it really is.

Browse the broader coding and ML note repo