Primal Dual Equivalence
3 minute read
The Primal form is intuitive but computationally expensive in high dimensions.
➡️ Number of new features = \({d+p \choose p}\), grows roughly as \(O(d^{p})\)
- d= number of dimensions (features)
- p = degree of polynomial
Note: The Dual form is what enables the Kernel Trick 🪄.
Subject to constraints:
- \(y_i(w^T x_i + b) \geq 1 - \xi_i\): The ‘softened’ margin constraint.
- \(\xi_i \geq 0\): Slack/Error cannot be negative.
⭐️ We start with the Soft-Margin Primal objective and incorporate its constraints using Lagrange Multipliers \((\alpha_i, \mu_i \geq 0)\)
\[L(w, w_0, \xi, \alpha, \mu) = \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[y_i(w^T x_i + w_0) - 1 + \xi_i \right] - \sum_{i=1}^n \mu_i \xi_i\]Note: Inequality conditions must be \(\le 0\).
👉The Lagrangian function has two competing objectives:
\[L(w, w_0, \xi, \alpha, \mu) = \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[y_i(w^T x_i + w_0) - 1 + \xi_i \right] - \sum_{i=1}^n \mu_i \xi_i\]- Minimization: We want to minimize \(L(w, w_0, \xi, \alpha, \mu)\) w.r.t primal variables (\(w, w_0, \xi_i \) ) to find the hyperplane with the largest possible margin.
- Maximization: We want to maximize \(L(w, w_0, \xi, \alpha, \mu)\) w.r.t dual variables (\(\alpha_i, \mu_i\) ) to ensure all training constraints are satisfied.
Note: A point that is simultaneously a minimum for one set of variables and a maximum for another is, by definition, a saddle point.
👉To find the Dual, we find the saddle point by taking partial derivatives with respect to the primal variables \((w, w_0, \xi)\) and equating them to 0.
\[\frac{\partial L}{\partial w} = 0 \implies \mathbf{w = \sum_{i=1}^n \alpha_i y_i x_i}\]\[\frac{\partial L}{\partial w_0} = 0 \implies \mathbf{\sum_{i=1}^n \alpha_i y_i = 0}\]\[\frac{\partial L}{\partial \xi_i} = 0 \implies C - \alpha_i - \mu_i = 0 \implies \mathbf{0 \leq \alpha_i \leq C}\]subject to: \(0 \leq \alpha_i \leq C\) and \(\sum \alpha_i y_i = 0\)
- \(\alpha_i\)= 0, for non support vectors (correct side)
- \(0 < \alpha_i < C\), for free support vectors (exactly on the margin)
- \(\alpha_i = C\), for bounded support vectors (misclassified or inside margin)
Note: Sequential Minimal Optimization (SMO) algorithm is used to find optimal \(\alpha_i\) values.
🎯To classify unseen point \(x_q\) : \(f(x_q) = \text{sign}(w^T x_q + w_0)\)
✅ From the KKT stationarity condition, we know: \(\mathbf{w} = \sum_{i=1}^n \alpha_i y_i x_i\)
👉 Substituting this into the equation:
\[f(x_q) = \text{sign}\left( \sum_{i=1}^n \alpha_i y_i (x_i^T x_q) + w_0 \right)\]Note: Even if you have 1 million training points, if only 500 are support vectors, the summation only runs for 500 terms.
All other points have \(\alpha_i = 0\) and vanish.
End of Section