Data is perfectly linearly separable, i.e, there must exist a hyperplane that can perfectly separate the data into two distinct classes without any misclassification.
No noise or outliers that fall within the margin or on the wrong side of the decision boundary. Note: Even a single outlier can prevent the algorithm from finding a valid solution or drastically affect the boundary’s position, leading to poor generalization.

Distance Between Margins

\[ \begin{aligned} \text{Decision Boundary: } \pi &= \mathbf{w^Tx} + w_0 = 0\\ \text{Upper Margin: }\pi^+ &= \mathbf{w^Tx} + w_0 = +1\\ \text{Lower Margin: }\pi^- &= \mathbf{w^Tx} + w_0 = -1\\ \end{aligned} \]

🐎 Distance(signed) of a hyperplane from origin = \(\frac{-w_0}{\|w\|}\)
🦣 Margin width = distance(\(\pi^+, \pi^-\))
= \(\frac{1-w_0 - (-1 -w_0)}{\|w\|}\) = \(\frac{1-\cancel{w_0} + 1 + \cancel{w_0})}{\|w\|}\)
distance(\(\pi^+, \pi^-\)) = \(\frac{2}{\|w\|}\)

Figure: Distance of Hyperplane from Origin

images/machine_learning/supervised/support_vector_machines/hard_margin_svm/slide_04_01.png

3 - Soft Margin SVM

Soft Margin SVM

Support Vector Machine | All Videos

Intuition💡

💡Imagine the margin is a fence 🌉.

Hard Margin: fence is made of steel.
Nothing can cross it.
Soft Margin: fence is made of rubber(porous).
Some points can ‘push’ into the margin or even cross over to the wrong side, but we charge them a penalty 💵 for doing so.

Issue

Distance from decision boundary:

Distance of positive labelled points must be \(\ge 1\)
But, distance of noise 📢 points (actually positive points) \(x_1, x_2 ~\&~ x_3\) < 1

Solution

⚔️ So, we introduce a slack variable or allowance for error term, \(\xi_i\) (pronounced ‘xi’) for every single data point.

\[y_i.(w^Tx_i + w_0) \ge 1 - \xi_i, ~ \forall i = 1,2,\dots, n\]

\[ \implies \xi_i \ge 1 - y_i.(w^Tx_i + w_0) \\ also, ~ \xi_i \ge 0 \]\[So, ~ \xi _{i}=\max (0,1-y_{i}\cdot (w^Tx_i + w_0))\]

Note: The above error term is also called ‘Hinge Loss’.

images/machine_learning/supervised/support_vector_machines/soft_margin_svm/slide_06_01.png

Hinge

images/machine_learning/supervised/support_vector_machines/soft_margin_svm/slide_06_02.png

Slack/Error Term Interpretation

\[\xi _{i}=\max (0,1-y_{i}\cdot f(x_{i}))\]

\(\xi_i = 0\) : Correctly classified and outside (or on) the margin.
\(0 < \xi_i \le 1 \) : Within the margin but on the correct side of the decision boundary.
\(\xi_i > 0\): On the wrong side of the decision boundary (misclassified).

e.g.: Since, the noise 📢 point are +ve (\(y_i=1\)) labeled:

\[\xi _{i}=\max (0,1-f(x_{i}))\]

\(x_1, (d=+0.5)\): \(\xi _{i}=\max (0,1-0.5) = 0.5\)
\(x_2, (d=-0.5)\): \(\xi _{i}=\max (0,1-(-0.5))= 1.5\)
\(x_3, (d=-1.5)\): \(\xi _{i}=\max (0,1-(-1.5)) = 2.5\)

Goal 🎯

\[\text{Maximize the width of margin: } \min_{w, w_0} \frac{1}{2} {\|w\|^2}\]

\[\text{Minimize violation or sum of slack/error terms: } \sum \xi_i\]

Optimization (Primal Formulation)

\[\min_{w, w_0, \xi} \underbrace{\frac{1}{2} \|w\|^2}_{\text{Regularization}} + \underbrace{C \sum_{i=1}^n \xi_i}_{\text{Error Penalty}}\]

Subject to constraints:

\(y_i(w^T x_i + b) \geq 1 - \xi_i\): The ‘softened’ margin constraint.
\(\xi_i \geq 0\): Slack/Error cannot be negative.

Note: We use a hyper-parameter ‘C’ to control the trade-off.

Hyper-Parameter ‘C'

Large ‘C’: Over-Fitting;
Misclassifications are expensive 💰.
Model tries to keep the errors as low as possible.
Small ‘C’: Under-Fitting;
Margin width is more important than individual errors.
Model will ignore outliers/noise to get a ‘cleaner’(wider) boundary.

Hinge Loss View

\[ \text{Hinge loss: } \xi _{i}=\max (0,1-y_{i}\cdot (w^Tx_i + w_0))\]\[ \min_{w, b} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^n \text{HingeLoss}(y_i, f(x_i))\]

Note: SVM is just L2-Regularized Hinge Loss minimization, as Logistic Regression minimizes Log-Loss.

images/machine_learning/supervised/support_vector_machines/soft_margin_svm/slide_14_01.tif

Soft Margin Support Vector Machine (SVM) | Hinge Loss View | Explained with Example

Previous: Hard Margin SVM Next: Primal Dual Equivalence

End of Section

4 - Primal Dual Equivalence

Primal Dual Equivalence

Support Vector Machine | All Videos

Intuition💡

The Primal form is intuitive but computationally expensive in high dimensions.

➡️ Number of new features = \({d+p \choose p}\), grows roughly as \(O(d^{p})\)

d= number of dimensions (features)
p = degree of polynomial

Note: The Dual form is what enables the Kernel Trick 🪄.

Optimization (Primal Formulation)

\[\min_{w, w_0, \xi} \underbrace{\frac{1}{2} \|w\|^2}_{\text{Regularization}} + \underbrace{C \sum_{i=1}^n \xi_i}_{\text{Error Penalty}}\]

Subject to constraints:

\(y_i(w^T x_i + b) \geq 1 - \xi_i\): The ‘softened’ margin constraint.
\(\xi_i \geq 0\): Slack/Error cannot be negative.

Lagrangian

⭐️ We start with the Soft-Margin Primal objective and incorporate its constraints using Lagrange Multipliers \((\alpha_i, \mu_i \geq 0)\)

\[L(w, w_0, \xi, \alpha, \mu) = \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[y_i(w^T x_i + w_0) - 1 + \xi_i \right] - \sum_{i=1}^n \mu_i \xi_i\]

Note: Inequality conditions must be \(\le 0\).

Lagrangian Objective

👉The Lagrangian function has two competing objectives:

\[L(w, w_0, \xi, \alpha, \mu) = \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[y_i(w^T x_i + w_0) - 1 + \xi_i \right] - \sum_{i=1}^n \mu_i \xi_i\]

Minimization: We want to minimize \(L(w, w_0, \xi, \alpha, \mu)\) w.r.t primal variables (\(w, w_0, \xi_i \) ) to find the hyperplane with the largest possible margin.
Maximization: We want to maximize \(L(w, w_0, \xi, \alpha, \mu)\) w.r.t dual variables (\(\alpha_i, \mu_i\) ) to ensure all training constraints are satisfied.

Note: A point that is simultaneously a minimum for one set of variables and a maximum for another is, by definition, a saddle point.

Karush–Kuhn–Tucker (KKT) Conditions

👉To find the Dual, we find the saddle point by taking partial derivatives with respect to the primal variables \((w, w_0, \xi)\) and equating them to 0.

\[\frac{\partial L}{\partial w} = 0 \implies \mathbf{w = \sum_{i=1}^n \alpha_i y_i x_i}\]

\[\frac{\partial L}{\partial w_0} = 0 \implies \mathbf{\sum_{i=1}^n \alpha_i y_i = 0}\]

\[\frac{\partial L}{\partial \xi_i} = 0 \implies C - \alpha_i - \mu_i = 0 \implies \mathbf{0 \leq \alpha_i \leq C}\]

Primal Expansion

\[\frac{1}{2}\mathbf{w}^T\mathbf{w} + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[y_i(\mathbf{w}^T x_i + w_0) - 1 + \xi_i\right] - \sum_{i=1}^n \mu_i \xi_i\]

\[ \begin{aligned} = \frac{1}{2} \left(\sum_i \alpha_i y_i x_i \right)^T . \left(\sum_j \alpha_j y_j x_j \right) + C \sum_{i=1}^n \xi_i + \sum_{i=1}^n -\alpha_i y_i \left( \sum_{j=1}^n \alpha_j y_j x_j \right)^T x_i \\ -w_0 \sum_{i=1}^n \alpha_i y_i + \sum_{i=1}^n \alpha_i(1-\xi_i) + \sum_{i=1}^n \mu_i. (-\xi_i) \\ \end{aligned} \]\[ = \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j (x_i \cdot x_j) \underbrace{-w_0\sum_{i=1}^n \alpha_i y_i}_{=0, K.K.T} + \underbrace{\sum_{i=1}^n \xi_i (C -\alpha_i -\mu_i)}_{=0, K.K.T} \]\[ = \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j (x_i \cdot x_j) \]

(Wolfe) ‘Dual' Optimization

\[ \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[y_i(w^T x_i + w_0) - 1 + \xi_i\right] - \sum_{i=1}^n \mu_i \xi_i\]

\[ = \max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{(x_i \cdot x_j)}\]

subject to: \(0 \leq \alpha_i \leq C\) and \(\sum \alpha_i y_i = 0\)

\(\alpha_i\)= 0, for non support vectors (correct side)
\(0 < \alpha_i < C\), for free support vectors (exactly on the margin)
\(\alpha_i = C\), for bounded support vectors (misclassified or inside margin)

Note: Sequential Minimal Optimization (SMO) algorithm is used to find optimal \(\alpha_i\) values.

Inference Time ⏰

🎯To classify unseen point \(x_q\) : \(f(x_q) = \text{sign}(w^T x_q + w_0)\)

✅ From the KKT stationarity condition, we know: \(\mathbf{w} = \sum_{i=1}^n \alpha_i y_i x_i\)

👉 Substituting this into the equation:

\[f(x_q) = \text{sign}\left( \sum_{i=1}^n \alpha_i y_i (x_i^T x_q) + w_0 \right)\]

Note: Even if you have 1 million training points, if only 500 are support vectors, the summation only runs for 500 terms.
All other points have \(\alpha_i = 0\) and vanish.

Primal Dual Equivalence of Support Vector Machine (SVM) | Lagrangian | KKT Conditions | Explained

Previous: Soft Margin SVM Next: Kernel Trick

End of Section

5 - Kernel Trick

Kernel Trick

Support Vector Machine | All Videos

Intuition 💡

👉If our data is not linearly separable in its original space , we can map it to a higher-dimensional feature space (where D»d) using a transformation function .

Kernel Trick 🪄

Bridge between Dual formulation and the geometry of high dimensional spaces.
It is a way to manipulate inner product spaces without the computational cost 💰 of explicit transformation.

(Wolfe) ‘Dual' Optimization

\[ \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[y_i(w^T x_i + w_0) - 1 + \xi_i\right] - \sum_{i=1}^n \mu_i \xi_i\]

\[= \max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{(x_i \cdot x_j)}\]

subject to: \(0 \leq \alpha_i \leq C\) and \(\sum \alpha_i y_i = 0\)

Observation

💡Actual values of the input vectors \(x_i\) and \(x_j\) never appear in isolation; only appear as inner product.

\[ \max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{(x_i \cdot x_j)}\]

\[f(x_q) = \text{sign}\left( \sum_{i=1}^n \alpha_i y_i (x_i^T x_q) + w_0 \right)\]

👉The ‘shape’ of the decision boundary is entirely determined by how similar points are to one another, not by their absolute coordinates.

Non-Linear Separation

If our data is not linearly separable in its original space \(\mathbb{R}^d\), we can map it to a higher-dimensional \(\mathbb{R}^D\) feature space (where D»d) using a transformation function \(\phi(x)\) .

Problem 🤔

If we choose a very high-dimensional mapping (e.g. \(D = 10^6\) or \(D = \infty\) ), calculating and then performing the dot product \(\phi(x_i)^T \phi(x_j)\) becomes computationally impossible or extremely slow.

Kernel Trick 👻

So we define a function called ‘Kernel Function’.

The ‘Kernel Trick’ 🪄 is an optimization that replaces the dot product of a high-dimensional mapping with a function of the dot product in the original space.

\[(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle\]

💡How it works ?

Instead of mapping (to higher dimension) \(x_i \rightarrow \phi(x_i)\), \(x_j \rightarrow \phi(x_j)\), and calculating the dot product.
We simply compute \(K(x_i, x_j)\) directly in the original input space.

👉The ‘Kernel Function’ gives the similar mathematical equivalent of mapping it to a higher dimensions and taking the dot product.

Note: For \(K(x_i, x_j)\) to be a valid kernel, it must satisfy Mercer’s Condition.

Polynomial (Quadratic) Kernel

Below is an example for a quadratic Kernel function in 2D that is equivalent to mapping the vectors to 3D and taking a dot product in 3D.

\[K(x, z) = (x^T z)^2\]

\[(x_1 z_1 + x_2 z_2)^2 = x_1^2 z_1^2 + 2x_1 z_1 x_2 z_2 + x_2^2 z_2^2\]

The output of above quadratic kernel function is equivalent to the explicit dot product of two vectors in 3D:

\[\phi(x) = [x_1^2, \sqrt{2}x_1x_2, x_2^2]^T\]

\[\phi(z) = [z_1^2, \sqrt{2}z_1z_2, z_2^2]^T\]\[\phi (x)\cdot \phi (z)=x_{1}^{2}z_{1}^{2}+2x_{1}x_{2}z_{1}z_{2}+x_{2}^{2}z_{2}^{2}\]

Advantages ⛳️

Computational Efficiency: Avoids the ‘combinatorial blowup’ 💥 of generating thousands of interaction features manually.
Memory Savings: No need to store 💾 or process high-dimensional coordinates, only the scalar result of the kernel function.

Why Kernel SVMs are Not so Popular ?

Designing special purpose domain specific kernel is very hard.
- Basically, we are trying to replace feature engineering with kernel design.
- Note: Deep learning does feature engineering implicitly for us.
Runtime complexity depends on number of support vectors, whose count is not easy to control.

Note: Runtime Time Complexity ⏰ = \(O(n_{SV}\times d)\) , whereas linear SVM,\(O(d)\) .

Kernel Trick in Support Vector Machines | Quadratic Kernel Function Example

Previous: Primal Dual Equivalence Next: RBF Kernel

End of Section

6 - RBF Kernel

RBF Kernel

Support Vector Machine | All Videos

Intuition 💡

Unlike the polynomial kernel, which looks at global 🌎 interactions, the RBF kernel acts like a similarity measure.
If ‘x’ and ‘z’ are identical \(K(x,z)=1\).
- As they move further apart in Euclidean space, the value decays exponentially towards 0.

Radial Basis Function (RBF) Kernel

\[K(x, z) = \exp\left(-\gamma. \|x - z\|^2\right)\]

\[\text{where, }\gamma = \frac{1}{2\sigma^2}\]

If \(x \approx z\) (very close), \(K(x,z)=1\)
If ‘x’, ‘z’ are far apart, \(K(x,z) \approx 0\)

Note: Kernel function is the measure of similarity or closeness.

Infinite Dimension Mapping

Say \(\sigma = 1\), then Euclidean distance: \(\|x - z\|^2 = \|x\|^2 + \|z\|^2 - 2x^Tz\)

\[K(x, z) = \exp(-( \|x\|^2 + \|z\|^2 - 2x^T z )) = \exp(-\|x\|^2) \exp(-\|z\|^2) \exp(2x^T z)\]

The Taylor expansion for \(e^u= \sum_{n=0}^{\infty} \frac{u^n}{n!}\)

\[\exp(2x^T z) = \sum_{n=0}^{\infty} \frac{(2x^T z)^n}{n!} = 1 + \frac{2x^T z}{1!} + \frac{(2x^T z)^2}{2!} + \dots + \frac{(2x^T z)^n}{n!} + \dots\]

\[K(x, z) = e^{-\|x\|^2} e^{-\|z\|^2} \left( \sum_{n=0}^{\infty} \frac{2^n (x^T z)^n}{n!} \right)\]

💡If we expand each \((x^T z)^n\) term, it represents the dot product of all possible n-th order polynomial features.

👉Thus, the implicit feature map is:

\[\phi(x) = e^{-\|x\|^2} \left[ 1, \sqrt{\frac{2}{1!}}x, \sqrt{\frac{2^2}{2!}}(x \otimes x), \dots, \sqrt{\frac{2^n}{n!}}(\underbrace{x \otimes \dots \otimes x}_{n \text{ times}}), \dots \right]^T\]

Important: The tensor product \(x\otimes x\) creates a vector (or matrix) containing all combinations of the features. e.g. if \(x=[x_{1},x_{2}]\), then \(x\otimes x=[x_{1}^{2},x_{1}x_{2},x_{2}x_{1},x_{2}^{2}]\)

Note: Because the Taylor series has an infinite number of terms, feature map has an infinite number of dimensions.

Bias-Variance Trade-Off ⚔️

High Gamma(low \(\sigma\)): Over-Fitting
- Makes the kernel so ‘peaky’ that each support vector only influences its immediate neighborhood.
- Decision boundary becomes highly irregular, ‘wrapping’ tightly around individual data points to ensure they are classified correctly.
Low Gamma(high \(\sigma\)): Under-Fitting
- The Gaussian bumps are wide and flat.
- Decision boundary becomes very smooth, essentially behaving more like a linear or low-degree polynomial classifier.

Radial Basis Function (RBF) Kernel | Support Vector Machine | Infinite Dimension Mapping | Explained

Previous: Kernel Trick Next: Support Vector Regression

End of Section

7 - Support Vector Regression

Support Vector Regression

Support Vector Machine | All Videos

Intuition 💡

👉Imagine a ‘tube’ of radius \(\epsilon\) surrounding the regression line.

Points inside the tube are considered ‘correct’ and incur zero penalty.
Points outside the tube are penalized based on their distance from the tube’s boundary.

Ignore Errors

👉SVR ignores errors as long as they are within a certain distance (\(\epsilon\)) from the true value.

🎯This makes SVR inherently robust to noise and outliers, as it does not try to fit every single point perfectly, only those that ‘matter’ to the structure of the data.

Note: Standard regression (like OLS) tries to minimize the squared error between the prediction and every data point.

Optimization (Primal Formulation)

\[\min_{w, w_0, \xi, \xi^*} \underbrace{\frac{1}{2} \|w\|^2}_{\text{Regularization}} + \underbrace{C \sum_{i=1}^n \xi_i, \xi_i^*}_{\text{Error Penalty}}\]

Subject to constraints:

\(y_i - (w^T x_i + w_0) \leq \epsilon + \xi_i\): (Upper boundary)
\((w^T x_i + w_0) - y_i \leq \epsilon + \xi_i^*\): (Lower boundary)
\(\xi_i, \xi_i^* \geq 0\): (Slack/Error cannot be negative)

Terms:

Epsilon(\(\epsilon\)): The width of the tube. Increasing ‘\(\epsilon\)’ results in fewer support vectors and a smoother (flatter) model.
Slack Variables (\(\xi_i, \xi_i^*\)): How far a point lies outside the upper and lower boundaries of the tube.
C: The trade-off between the flatness of the model and the extent to which deviations larger than \(\epsilon\) are tolerated.

Loss Function

SVR uses a specific loss function that is 0 when the error<’\(\epsilon\)’.

\[L_\epsilon(y, f(x)) = \max(0, |y - f(x)| - \epsilon)\]

The solution becomes sparse, because the loss is zero for points inside the tube.
Only the Support Vectors, i.e, points outside or on the boundary of the tube have non-zero Lagrange multipliers (\(\alpha_i\)).

Note: \(\epsilon=0.1\) default value in scikit-learn.

(Wolfe) ‘Dual' Optimization

\[\max_{\alpha, \alpha^*} \sum_{i=1}^n y_i (\alpha_i - \alpha_i^*) - \epsilon \sum_{i=1}^n (\alpha_i + \alpha_i^*) - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n (\alpha_i - \alpha_i^*) (\alpha_j - \alpha_j^*) \mathbf{(x_i^T x_j)}\]

Subject to:

\(\sum_{i=1}^n (\alpha_i - \alpha_i^*) = 0\)
\(0 \leq \alpha_i, \alpha_i^* \leq C\)

\(\alpha_i = \alpha_i^* = 0\): point is inside the tube.
\(|\alpha_i - \alpha_i^*| > 0\) : support vectors; points on or outside the tube.

Note: \(\alpha_i , \alpha_i^* \) cannot both be non-zero for the same point; a point cannot be simultaneously above and below the tube.

Inference & Kernel Trick

\[f(z) = \sum_{i \in SV} (\alpha_i - \alpha_i^*) \mathbf{K(x_i, z)} + w_0\]

👉 For non-linear SVR we replace dot product \(x_i^T x_j\) with kernel function \(K(x_i, x_j)\).
✅ Model needs to store only support vectors, i.e, points where \(|\alpha_i - \alpha_i^*| > 0\).
⭐️\(\xi_i =0 \) for a point that lies exactly on the boundary, so we can use that to calculate the bias (\(w_0\)):

\[w_0 = y_i - \sum_{j \in SV} (\alpha_j - \alpha_j^*) K(x_j, x_i) - \epsilon\]

\[ \text{Since, } y_i - (w^T x_i + w_0) \leq \epsilon + \xi_i\]

Support Vector for Regression (SVR) | Support Vector Machines | Explained with Example

Previous: RBF Kernel Next: Naive Bayes Intro

End of Section