- \begin{align} \theta_{MLE} &= \mathop{\rm arg\,max}\limits_{\theta} P(D \vert \theta) \\[10pt] &= \mathop{\rm arg\,max}\limits_{\theta} \prod_i P(x_i \vert \theta) \end{align} To avoid underflow, we will instead work in the log space, as logarithm is monotonically increasing, so maximizing a function is equal to maximizing the log of that function. \begin{align} \theta_{MLE} &= \mathop{\rm arg\,max}\limits_{\theta} \log P(X \vert \theta) \\[10pt] &= \mathop{\rm arg\,max}\limits_{\theta} \log \prod_i P(x_i \vert \theta) \\[10pt] &= \mathop{\rm arg\,max}\limits_{\theta} \sum_i \log P(x_i \vert \theta) \end{align} MAP \begin{align} P(\theta \vert X) &= \frac{P(X \vert \theta) P(\theta)}{P(X)} \\[10pt] &\propto P(X \vert \theta) P(\theta) \end{align} \begin{align} \theta_{MAP} &= \mathop{\rm arg\,max}\limits_{\theta} P(X \vert \theta) P(\theta) \\[10pt] &= \mathop{\rm arg\,max}\limits_{\theta} \log P(X \vert \theta) + \log P(\theta) \\[10pt] &= \mathop{\rm arg\,max}\limits_{\theta} \log \prod_i P(x_i \vert \theta) + \log P(\theta) \\[10pt] &= \mathop{\rm arg\,max}\limits_{\theta} \sum_i \log P(x_i \vert \theta) + \log P(\theta) \end{align}
- Example: $n$ random variables $x_i$ drawn from Gaussian $\mu, \sigma$: $$log(\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma} exp\{ \frac{-(x_i - \mu)^2}{2\sigma^2}\} = -n log(\sqrt{2\pi}\sigma) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}{(x_i - \mu)^2}$$ MLE for variance: $$\nabla_{\sigma}l (x_1, x_2, \cdots, x_n | \mu, \sigma^2) = 0 $$ $$ \frac{1}{n}\sum_{i=1}^{n}{(x_i - \mu)^2} = \sigma^2$$
- Example: $n$ random variables $x_i$ drawn from Bernoulli $\theta$: $$(\sum_{i=1}^{n}x_i)log(\theta)+(n - \sum_{i=1}^{n}x_i)log(1-\theta)$$ MLE for variance: $$\nabla_{\theta}l (x_1, x_2, \cdots, x_n | \theta) = 0 $$ $$ \hat{\theta} = \frac{1}{n}\sum_{i=1}^{n}{X_i}$$

- $$\frac{\exp{w^Tx}}{1+\exp{w^Tx}} = \frac{1}{1+\exp{(-w^Tx)}}$$ this can also be viewed as a special case of softmax with only 2-classes: $$\frac{1}{1+\exp{(-w^Tx)}} \geq \frac{1}{2} \rightarrow w^Tx \geq 0 $$ hence the decision boundary is linear in the binary case
- LR gives us an unconstrained, smooth objective. LR gives calibrated probabilities that can be interpreted as confidence in a decision
- LR is a linear classifier: decision rule is a hyperplane. When feature is quadratic, could be ellipse, regularization will enlarge such ellipse.
- As an optimization problem, binary class $\ell_2$ penalized logistic regression minimizes the following cost function: $\min_{w, c} \frac{1}{2}w^T w + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1)$ Similarly, $\ell_1$ regularized logistic regression solves the following optimization problem: $\min_{w, c} \|w\|_1 + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1).$ Elastic-Net regularization is a combination of and, and minimizes the following cost function: $\min_{w, c} \frac{1 - \rho}{2}w^T w + \rho \|w\|_1 + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1),$ where $\rho$ controls the strength of $\ell_1$ regularization vs. $\ell_2$ regularization (it corresponds to the l1_ratio parameter). No regularization amounts to setting C to a very high value. Maximum conditional a posteriori corresponds to regularization.

- TL;DR: SVM is about find a soft margin (supporting hyper plane) that best separates the data. If we cannot find such separatable instance, we can move to higher dimension by using Kernels.
- Understanding RKHS is the basis of understanding Kernels.
- SVM Primal and Dual derivation.
- SVC solves the following primal problem:

\begin{align}\begin{aligned}\min_ {w, b, \zeta} \frac{1}{2} w^T w + C \sum_{i=1}^{n}
\zeta_i\\\begin{split}\textrm {subject to } & y_i (w^T \phi (x_i) + b) \geq 1 - \zeta_i,\\
& \zeta_i \geq 0, i=1, ..., n\end{split}\end{aligned}\end{align}
Intuitively, we’re trying to maximize the margin (by minimizing $||w||^2 = w^Tw$), while
incurring a penalty when a sample is misclassified or within the margin boundary.
Ideally, the value $y_i (w^T \phi (x_i) + b) $would be $\geq 1 $for all samples, which indicates
a
perfect prediction.
But problems are usually not always perfectly separable with a hyperplane, so we allow some
samples to be at a distance $\zeta_i$ (Slack Variable) from their correct margin boundary.
The penalty term $C$ controls the strength of this penalty, and as a result, acts as an inverse
regularization parameter. Basically, the smaller $C$ is, the more SVM can tolerate slack
variable,
and thus allowing more misclassification.
For very tiny values of $C$, you should get misclassified examples, often even if your training
data is linearly separable.(also see note below).

The dual problem to the primal is:
\begin{align}\begin{aligned}\min_{\alpha} \frac{1}{2} \alpha^T Q \alpha - e^T
\alpha\\\begin{split}
\textrm {subject to } & y^T \alpha = 0 \\
& 0 \leq \alpha_i \leq C, i=1, ..., n\end{split}\end{aligned}\end{align}
where $e$ is the vector of all ones, and $Q$ is an n by n positive semidefinite matrix, $Q_{ij}
\equiv y_i y_j K(x_i, x_j)$,
where $K(x_i, x_j) = \phi (x_i)^T \phi (x_j)$ is the kernel.
The terms $\alpha_i$ are called the dual coefficients, and they are upper-bounded by $C$.
This dual representation highlights the fact that training vectors are implicitly mapped into a
higher (maybe infinite) dimensional space by the
Kernel Function
e.g. The The RBF kernel.

Once the optimization problem is solved,
the output of decision_function for a given sample x: $\sum_{i\in SV} y_i \alpha_i K(x_i, x)
+ b$ and the predicted class correspond to its sign.

The primal problem can be equivalently formulated as
$\min_ {w, b} \frac{1}{2} w^T w + C \sum_{i=1}\max(0, 1 - y_i (w^T \phi(x_i) + b)),$
where we make use of the hinge loss.
This is the form that is directly optimized by LinearSVC, but unlike the dual form, this one
does not involve inner products between samples, so the
famous kernel trick cannot be applied.

- Example: $N$ layer neural network that takes in a scalar input $x$, where each layer consists of a single neuron. $x = o_0$, $$s_i = w_i o_{i-1} + b_i$$ $$o_i = \sigma(s_i)$$ $$\frac{\partial{o_N}}{\partial{w_1}} = \frac{\partial{o_N}}{\partial{o_{N-1}}}\frac{\partial{o_{N-1}}}{\partial{o_{N-2}}} \cdots \frac{\partial{o_1}}{\partial{w_1}} = \frac{\partial{o_1}}{\partial{w_1}} \prod_{i=2}^{N}\frac{\partial{o_{i}}}{\partial{o_{i-1}}} = \sigma'(s_1)x \prod_{i=2}^{N}\sigma'(s_i)w_i \leq x(\frac{1}{4})^N$$

- Implementation of an example Convolution Network from Strach.
- De facto Convolution
resources and a lecture
Key takeaway of The Conv Layer:
Accepts a volume of size W1×H1×D1

Requires four hyperparameters:

Number of filters K,

their spatial extent F,

the stride S,

the amount of zero padding P.

Produces a volume of size W2×H2×D2 where:

W2=(W1−F+2P)/S+1

H2=(H1−F+2P)/S+1 (i.e. width and height are computed equally by symmetry)

D2=K

With parameter sharing, it introduces F⋅F⋅D1 weights per filter, for a total of (F⋅F⋅D1)⋅K weights and K biases.

In the output volume, the d-th depth slice (of size W2×H2) is the result of performing a valid convolution of the d-th filter over the input volume with a stride of S, and then offset by d-th bias. - Wonderful Video
\begin{bmatrix}
-0.5 & -1.0 & -0.5\\
-1.0 & 7.0 & -1.0\\
-0.5 & -1.0 & -0.5\\
\end{bmatrix} This matrix is actually a sharpening kernel, sum to 1 and doesn't change color!
\begin{split}\begin{bmatrix}
1 & 0 & -1\\
2 & 0 & -2\\
1 & 0 & -1\\
\end{bmatrix} = \begin{bmatrix}
1\\
2\\
1\\
\end{bmatrix}\begin{bmatrix}
1 & 0 & -1
\end{bmatrix} =
\begin{bmatrix}
1\\
1\\
\end{bmatrix}* \begin{bmatrix}
1\\
1\\
\end{bmatrix}\begin{bmatrix}
1 & -1
\end{bmatrix} * \begin{bmatrix}
1 & 1
\end{bmatrix}
\end{split}
This is Sobel Filter (vertical edge detection) since it deals with finding the gradient change
$G_x$. Interestingly, all the elements sum up to 0.

Remember: $$ (k_0 + k_1x + k_2x^2)(a_0 + a_1x + a_2x^2 + a_3x^3 + a_4x^4 + a_5x^5 + a_6x^6 + \cdots)$$ the constant for $x^3$ term is $(k_2a_1 + k_1a_2 + k_0a_3)$ Why?

Let's first flip the kernel: $(k_2x^2 + k_1x + k_0)$ and let it march through: $(a_0 + a_1x + a_2x^2 + a_3x^3 + a_4x^4 + a_5x^5 + a_6x^6 + \cdots)$

$(k_0a_0)x^0 + $

$(k_1a_0 + k_0a_1)x^1 + $

$(k_2a_0+ k_1a_1 + k_0a_2)x^1 + $

$(k_2a_1+ k_1a_2 + k_0a_3)x^2 + $

$(k_2a_2+ k_1a_3 + k_0a_4)x^3 + $

$(k_2a_3+ k_1a_4 + k_0a_5)x^4 + $

$(k_2a_4+ k_1a_5 + k_0a_6)x^5 + $

replace $x$ with $e^{2\pi if}$, $(a_0 + a_1(e^{2\pi if}) + a_2(e^{2\pi if})^2 + a_3(e^{2\pi if})^3 + a_4(e^{2\pi if})^4 + a_5(e^{2\pi if})^5 + a_6(e^{2\pi if})^6 + \cdots)$ becomes Fourier Series.

This is the exact convolving process mentioned as above: $f * g$. These stuff are all referenced from the awesome MIT course: Computational Thinking - Convolve with Gaussian $G^s(x,y) = \frac{1}{2\pi s^2}\exp\left(-\frac{x^2+y^2}{2s^2}\right)$
- Gaussian Smoothing (This guy also seems to have a lot to offer on his page, especially about Fourier)
- Convolution as Matrix Vector product Fascinating facts about how to actually implement convolution and its derivatives efficiently using Adjoint matrix of $W$, and why we actually need to use $\vec{v}\frac{\partial conv(x, W)}{\partial W}$ (im2col) operation. How Hardware acceleration can speed up convolution using tiling. Here's a Proper Implementation of the forward pass of a conv net.
- Convolution is actually image overlap question, Leetcode 835.

## Comments

Want to leave a comment? Visit this post's issue page on GitHub (you'll need a GitHub account).