Machine Learning

End of Section

2.1.1 - Meaning of 'Linear'

Meaning of ‘Linear’ in Linear Regression

Why Linear Regression is NOT 'Linear' ? | Meaning of 'Linear' in Linear Regression | Explained

What is the meaning of “linear” in Linear Regression ?

Equation of a line is of the form $y = mx + c$.
To represent a line in 2D space, we need 2 things:

m = slope or direction of the line
c = y-intercept or distance from the origin

images/machine_learning/supervised/linear_regression/line.png

Similarly,
A hyperplane is a lower (d-1) dimensional sub-space that divides a d-dimensional space into 2 distinct parts. Equation of a hyperplane:

\[y = w_1x_1 + w_2x_2+ \dots + w_nx_n + w_0 \\[5pt] \implies y = w^Tx + w_0\]

Here, ‘y’ is expressed as a linear combination of parameters - $ w_0, w_1, w_2, \dots, w_n $
Hence - Linear means the model is ‘linear’ with respect to its parameters NOT the variables.
Read more about Hyperplane

images/machine_learning/supervised/linear_regression/hyperplane.png

Polynomial Features ✅

\[ y = w_1x_1 + w_2x_2 + w_3x_1^2 + w_4x_2^3 + w_0 \]

can be rewritten as:

\[y = w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 + w_0\]

where, $x_3 = x_1^2 ~ and ~ x_4 = x_2^3 $
$x_3 ~ and ~ x_4 $ are just 2 new (polynomial) variables.
And, ‘y’ is still a linear combination of parameters: $w_0, w_1, \dots w_4$

images/machine_learning/supervised/linear_regression/hypersurface.png

Non-Linear Features ✅

\[ y = w_1log(x) + w_2\sqrt{x}+ w_0 \]

can be rewritten as:

\[y = w_1x_1 + w_2x_2 + w_0\]

where, $x_1 = log(x) ~ and ~ x_2 = \sqrt{x} $
$x_1 ~ and ~ x_2 $ are are transformations of variable $x$.
And, ‘y’ is still a linear combination of parameters: $w_0, w_1, ~and~ w_2$

images/machine_learning/supervised/linear_regression/non_linear_features.png

Non-Linear Parameters ❌

\[ y = x_1^{w_1} + x_2^{w_2} + w_0 \]

Even if we take log, we get:

\[log(y) = w_1log(x_1) + w_2log(x_2) + log(w_0)\]

here, $log(w_0) $ parameter is NOT linear.

images/machine_learning/supervised/linear_regression/exponential.png

Importance of Linearity

Linearity in parameters allows to use Ordinary Least Squares (OLS) to find the best-fit coefficients by solving a set of linear equations, making estimation straightforward.

Previous: Types of ML Next: Meaning of 'Regression

End of Section

2.1.2 - Meaning of 'Regression'

Meaning of ‘Regression’ in Linear Regression

Why Linear Regression is NOT 'Regression' ? | Meaning of 'Regression' in Linear Regression

What is the meaning of “Regression” in Linear Regression ?

Regression = Going Back

Regression has a very specific historical origin that is different from its current statistical meaning.

Sir Francis Galton (19th century), cousin of Charles Darwin, coined 🪙 this term.

Observation:
Galton observed that -
the children 👶 of unusually tall ⬆️ parents 🧑‍🧑‍🧒‍🧒, tended to be shorter ⬇️ than their parents 🧑‍🧑‍🧒‍🧒,
and children 👶 of unusually short ⬇️ parents 🧑‍🧑‍🧒‍🧒, tended to be taller ⬆️ than their parents 🧑‍🧑‍🧒‍🧒.

Galton named this biological tendency - ‘regression towards mediocrity/mean’.

Galton used method of least squares to model this relationship, by fitting a line to the data 📊.

Regression = Fitting a Line
Over time ⏳, the name ‘regression’ got permanently attached to the method of fitting line to the data 📊.

Today in statistics and machine learning, ‘regression’ universally refers to the method of finding the
‘line of best fit’ for a set of data points, NOT the concept of ‘regressing towards the mean’.

images/machine_learning/supervised/linear_regression/line_of_best_fit.png

Previous: Meaning of 'Linear Next: Linear Regression

End of Section

2.1.3 - Linear Regression

Linear Regression

Linear Regression | Dimensions of Fitting Hyperplane | Explained with Example

Predict Salary

Let’s understand linear regression using an example to predict salary.

Predict the salary 💰 of an IT employee, based on various factors, such as, years of experience, domain, role, etc.

images/machine_learning/supervised/linear_regression/salary_prediction.png

Let’s start with a simple problem and predict the salary using only one input feature.

Goal 🎯 : Find the line of best fit.

Plot: Salary vs Years of Experience

\[y = mx + c = w_1x + w_0\]

Slope = $m = w_1 $
Intercept = $c = w_0$

images/machine_learning/supervised/linear_regression/salary_yoe.png

Similarly, if we include other factors/features impacting the salary 💰, such as, domain, role, etc, we get an equation of a fitting hyperplane:

\[y = w_1x_1 + w_2x_2 + \dots + w_dx_d + w_0\]

where,

\[ \begin{align*} x_1 &= \text{Years of Experience} \\ x_2 &= \text{Domain (Tech, BFSI, Telecom, etc.)} \\ x_3 &= \text{Role (Dev, Tester, DevOps, ML, etc.)} \\ x_d &= d^{th} ~ feature \\ w_0 &= \text{Salary of 0 years experience} \\ \end{align*} \]

What is the dimensions of the fitting hyperplane?

Space = ’d’ features + 1 target variable = ’d+1’ dimensions
In a ’d+1’ dimensional space, we try to fit a ’d’ dimensional hyperplane.

images/machine_learning/supervised/linear_regression/fitting_hyperplane.png

Parameters/Weights of the Model

Let, data = $ {(x_i, y_i)}_{i=1}^N ; ~ x_i \in R^d , y_i \in R^d$
where, N = number of training samples.

images/machine_learning/supervised/linear_regression/linear_regression_data.png

Note: Fitting hyperplane ($y = w_1x_1 + w_2x_2 + \dots + w_dx_d + w_0$) is the model.
Objective 🎯: find the parameters/weights ($w_0, w_1, w_2, \dots w_d $) of the model.

$\mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix}_{\text{d x 1}}$ $ \mathbf{x_i} = \begin{bmatrix} x_{i_1} \\ x_{i_2} \\ \vdots \\ x_{i_d} \end{bmatrix}_{\text{d x 1}} $ $ \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_i \\ \vdots \\ y_n \end{bmatrix}_{\text{n x 1}} $ $ X = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{i1} & x_{i2} & \cdots & x_{id} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \\ \end{bmatrix} _{\text{n x d}} $
Prediction:

\[ \hat{y_i} = w_1x_{i_1} + w_2x_{i_2} + \dots + w_dx_{i_d} + w_0 \]

\[ i^{th} \text{ Prediction } = \hat{y_i} = x_i^Tw + w_0\]

Error = Actual - Predicted

\[ \epsilon_i = y_i - \hat{y_i}\]

Goal 🎯: Minimize error between actual and predicted.

Loss 💸 Function

We can quantify the error for a single data point in following ways:

Absolute error = $|y_i - \hat{y_i}|$
Squared error = $(y_i - \hat{y_i})^2$

Issues with Absolute Value function

Not differentiable at x=0, required for gradient descent.
Constant gradient, i.e $\pm 1$, model learns at same rate, whether the error is large or small.

Cost 💰 Function

Average loss across all data points.

Mean Squared Error (MSE) =

\[ J(w) = \frac{1}{n} \sum_{i=1}^N (y_i - \hat{y_i})^2 \]

images/machine_learning/supervised/linear_regression/mean_squared_error.png

Optimization

\[ \begin{align*} \underset{w_0, w}{\mathrm{min}}\ J(w) &= \underset{w_0, w}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^N (y_i - \hat{y_i})^2 \\ &= \underset{w_0, w}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^N (y_i - (x_i^Tw + w_0))^2 \\ &= \underset{w_0, w_1, w_2, \dots w_d}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^N (y_i - (w_1x_{i_1} + w_2x_{i_2} + \dots + w_dx_{i_d} + w_0))^2 \\ \underset{w_0, w}{\mathrm{min}}\ J(w) &= \underset{w_0, w_1, w_2, \dots w_d}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^N y_i^2 + w_0^2 + w_1^2x_{i_1}^2 + w_2^2x_{i_2}^2 + \dots + w_d^2x_{i_d}^2 + \dots \\ \end{align*} \]

The above equation is quadratic in $w_0, w_1, w_2, \dots w_d $.

Below is an image of a Paraboloid in 3D, similarly we will have a Paraboloid in ’d’ dimensions.

images/machine_learning/supervised/linear_regression/paraboloid.png

Find the Minima

In order to find the minima of the cost function we need to take its derivative w.r.t weights and equate to 0.

\[ \begin{align*} \frac{\partial{J(w)}}{\partial{w_0}} = 0 \\ \frac{\partial{J(w)}}{\partial{w_1}} = 0 \\ \frac{\partial{J(w)}}{\partial{w_2}} = 0 \\ \vdots \\ \frac{\partial{J(w)}}{\partial{w_d}} = 0 \\ \end{align*} \]

We have ‘d+1’ linear equations to solve for ‘d+1’ weights $w_0, w_1, w_2, \dots , w_d$.

But solving ‘d+1’ system of linear equations (called the ’normal equations’) is tedious and NOT used for practical purposes.

Parameters of Linear Regression | Issues with Absolute Error | Loss Function | Cost Function

Matrix Form of Cost Function

\[ J(w) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y_i})^2 \]

\[ J(w) = \frac{1}{n} (y - Xw)^2 \]

where, $\mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix}_{\text{d x 1}}$ $ \mathbf{x_i} = \begin{bmatrix} x_{i_1} \\ x_{i_2} \\ \vdots \\ x_{i_d} \end{bmatrix}_{\text{d x 1}} $ $ \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_i \\ \vdots \\ y_n \end{bmatrix}_{\text{n x 1}} $ $ X = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{i1} & x_{i2} & \cdots & x_{id} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \\ \end{bmatrix} _{\text{n x d}} $
Prediction:

\[ \hat{y_i} = w_1x_{i_1} + w_2x_{i_2} + \dots + w_dx_{i_d} + w_0 \]

Let’s expand the cost function J(w):

\[ \begin{align*} J(w) &= \frac{1}{n} (y - Xw)^2 \\ &= \frac{1}{n} (y - Xw)^T(y - Xw) \\ &= \frac{1}{n} (y^T - w^TX^T)(y - Xw) \\ J(w) &= \frac{1}{n} (y^Ty - w^TX^Ty - y^TXw + w^TX^TXw) \end{align*} \]

Since,$w^TX^Ty$, is a scalar, so it is equal to its transpose.

\[ w^TX^Ty = (w^TX^Ty)^T = y^TXw\]

\[ J(w) = \frac{1}{n} (y^Ty - y^TXw - y^TXw + w^TX^TXw)\]

\[ J(w) = \frac{1}{n} (y^Ty - 2y^TXw + w^TX^TXw) \]

Note: $X^2 = X^TX$ and $(AB)^T = B^TA^T$

Normal Equation

To find the minimum, take the derivative of cost function J(w) w.r.t ‘w’, and equate to 0 vector.

\[\frac{\partial{J(w)}}{\partial{w}} = \vec{0}\]\[ \begin{align*} &\frac{\partial{[\frac{1}{n} (y^Ty - 2y^TXw + w^TX^TXw)]}}{\partial{w}} = 0\\ & \implies 0 - 2X^Ty + (X^TX + X^TX)w = 0 \\ & \implies \cancel{2}X^TXw = \cancel{2} X^Ty \\ & \therefore \mathbf{w} = (X^TX)^{-1}X^T\mathbf{y} \end{align*} \]

Note: $\frac{\partial{(a^T\mathbf{x})}}{\partial{\mathbf{x}}} = a$ and $\frac{\partial{(\mathbf{x}^TA\mathbf{x})}}{\partial{\mathbf{x}}} = (A + A^T)\mathbf{x}$

This is the closed-form solution of normal equations.

Issues with Normal Equation

Inverse may NOT exist (non-invertible).
Time complexity of calculating the inverse is O(n^3).

Pseudo Inverse

If the inverse does NOT exist then we can use the approximation of the inverse, also called Pseudo Inverse or Moore Penrose Inverse ($A^+$).

Moore Penrose Inverse ( $A^+$) is calculated using Singular Value Decomposition (SVD).

SVD of $A = U \Sigma V^T$

Pseudo Inverse $A^+ = V \Sigma^+ U^T$

Where, $\Sigma^+$ is a transpose of $\Sigma$ with reciprocals of non-zero singular values on its diagonals.
e.g:

\[ \Sigma = \begin{bmatrix} 5 & 0 & 0 \\\\ 0 & 2 & 0 \end{bmatrix} \]

\[ \Sigma ^{+}=\left[\begin{matrix}1/5&0\\ 0&1/2\\ 0&0\end{matrix}\right]=\left[\begin{matrix}0.2&0\\ 0&0.5\\ 0&0\end{matrix}\right] \]

Note: Time Complexity = O(mn^2)

Normal Equation | Issues with Inverse | Pseudo Inverse | Explained with Example

Previous: Meaning of 'Regression Next: Probabilistic Interpretation

End of Section

2.1.4 - Probabilistic Interpretation

Probabilistic Interpretation of Linear Regression

Probabilistic Interpretation of Linear Regression | MLE | Probability vs Likelihood | Explained

Probabilistic Interpretation

Explains why we use ordinary least squares error to find the model weights/parameters.

Model Assumptions

Error = Random Noise, Un-modeled effects

\[ \begin{align*} \epsilon_i = y_i - \hat{y_i} \\ \implies y_i = \hat{y_i} + \epsilon_i \\ \because \hat{y_i} = x_i^Tw \\ \therefore y_i = x_i^Tw + \epsilon_i \\ \end{align*} \]

Actual value($y_i$) = Deterministic linear predictor($x_i^Tw$) + Error term($\epsilon_i$)

Error Assumptions

Independent and Identically Distributed (I.I.D):
Each error term is independent of others.
Normal (Gaussian) Distributed:
Error follows a normal distribution with mean = 0 and a constant variance, .

This implies that the target variable itself is a random variable, normally distributed around the linear relationship.

\[(y_{i}|x_{i};w)∼N(x_{i}^{T}w,\sigma^{2})\]

images/machine_learning/supervised/linear_regression/probabilistic_interpretation/slide_04_01.png

images/machine_learning/supervised/linear_regression/probabilistic_interpretation/slide_05_01.png

Why is Error terms distribution considered to be Gaussian ?

Central Limit Theorem (CLT) states that for a sequence of I.I.D random variables, the distribution of the sample mean(sum) approximates to a normal distribution, regardless of the original population distribution.

images/machine_learning/supervised/linear_regression/probabilistic_interpretation/slide_07_01.png

Probability Vs Likelihood

Probability (Forward View):
Quantifies the chance of observing a specific outcome given a known, fixed model.
Likelihood (Backward/Inverse View):
Inverse concept used for inference (working backward from results to causes). It is a function of the parameters and measures how ‘likely’ a specific set of parameters makes the observed data appear.

Maximum Likelihood Estimate (MLE)

‘Find the most plausible explanation for what I see.'

The goal of the probabilistic interpretation is to find the parameters ‘w’ that maximize the probability (likelihood) of observing the given dataset.

Assumption: Training data is I.I.D.

\[ \begin{align*} Likelihood &= \mathcal{L}(w) \\ \mathcal{L}(w) &= p(y|x;w) \\ &= \prod_{i=1}^N p(y_i| x_i; w) \\ &= \prod_{i=1}^N \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y_i-x_i^Tw)^2}{2\sigma^2}} \end{align*} \]

Issue with Likelihood

Maximizing the likelihood function is mathematically complex due to the product term and the exponential function.

A common simplification is to maximize the log-likelihood function instead, which converts the product into a sum.

Note: Log is a strictly monotonically increasing function.

Solution: Log Likelihood

\[ \begin{align*} log \mathcal{L}(w) &= log \prod_{i=1}^N \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y_i-x_i^Tw)^2}{2\sigma^2}} \\ &= \sum_{i=1}^N log(\frac{1}{\sigma\sqrt{2\pi}}) + log (e^{-\frac{(y_i-x_i^Tw)^2}{2\sigma^2}}) \\ log \mathcal{L}(w) &= Nlog(\frac{1}{\sigma\sqrt{2\pi}}) - \sum_{i=1}^N \frac{(y_i-x_i^Tw)^2}{2\sigma^2} \\ \end{align*} \]

Note: The first term is constant w.r.t ‘w'.

So, we need to find parameters ‘w’ that maximize the log likelihood.

\[ \begin{align*} log \mathcal{L}(w) & \propto -\frac{1}{2\sigma^2} \sum_{i=1}^N (y_i-x_i^Tw)^2 \\ & \because \frac{1}{2\sigma^2} \text{ is constant} \\ log \mathcal{L}(w) & \propto -\sum_{i=1}^N (y_i-x_i^Tw)^2 \\ \end{align*} \]

Ordinary Least Squares

\[ \begin{align*} log \mathcal{L}(w) &\propto -\sum_{i=1}^N (y_i-x_i^Tw)^2 \\ \underset{w}{\mathrm{max}}\ -\sum_{i=1}^N (y_i-x_i^Tw)^2 &= \underset{w}{\mathrm{min}}\ \sum_{i=1}^N (y_i-x_i^Tw)^2 \end{align*} \]

Maximizing the log-likelihood is equivalent to minimizing the sum of squared errors, which is the exact objective of the ordinary least squares (OLS) method.

\[ \underset{w}{\mathrm{min}}\ J(w) = \underset{w}{\mathrm{min}}\ \sum_{i=1}^N (y_i - x_i^Tw)^2 \]

Previous: Linear Regression Next: Convex Function

End of Section

2.1.5 - Convex Function

Convex Function

Convex Function | Why Mean Squared Error Function is Convex ? | Explained

Convexity

Refers to a property of a function where a line segment connecting any two points on its graph lies above or on the graph itself.

A convex function is curved upwards.

Is MSE Cost function Convex ? YES ✅

MSE cost function J(w) is convex because its Hessian (H) is always positive semi definite.

\[\nabla J(\mathbf{w})=\frac{1}{n}\mathbf{X}^{T}(\mathbf{Xw}-\mathbf{y})\]

\[\mathbf{H}=\frac{\partial (\nabla J(\mathbf{w}))}{\partial \mathbf{w}^{T}}=\frac{1}{n}\mathbf{X}^{T}\mathbf{X}\]

\[\therefore \mathbf{H} = \nabla^2J(w) = \frac{1}{n} \mathbf{X}^{T}\mathbf{X}\]

images/machine_learning/supervised/linear_regression/convex_function/slide_05_01.png

MSE: Positive Semi Definite (Proof)

A matrix H is positive semi-definite if and only if for any non-zero vector ‘z’, the quadratic form $z^THz \ge 0$.

For the Hessian of J(w),

\[ z^THz = z^T(\frac{1}{n}X^TX)z = \frac{1}{n}(Xz)^T(Xz) \]

$(Xz)^T(Xz) = \lVert Xz \rVert^2$ = squared L2 norm (magnitude) of the vector

Note: The squared norm of any real-valued vector is always $\ge 0$.

Previous: Probabilistic Interpretation Next: Gradient Descent

End of Section

2.1.6 - Gradient Descent

Gradient Descent

Gradient Descent Algorithm | MSE Gradient Calculation | Learning Rate | Explained with Examples

Goal 🎯

Minimize the cost 💰function.

\[J(w)=\frac{1}{2n}(y-Xw)^{2}\]

Note: The 1/2 term is included simply to make the derivative cleaner (it cancels out the 2 from the square).

Issues with Normal Equation

Normal Equation (Closed-form solution) jumps straight to the optimal point in one step.

\[w=(X^{T}X)^{-1}X^{T}y\]

But it is not always feasible and computationally expensive 💰(due to inverse calculation 🧮)

Gradient Descent 🎢

An iterative optimization algorithm slowly nudges parameters ‘w’ towards a value that minimize the cost💰 function.

Algorithm ⚙️

Initialize the weights/parameters with random values.
Calculate the gradient of the cost function at current parameter values.
Update the parameters using the gradient. \[ w_{new} = w_{old} - \eta \frac{\partial{J(w)}}{\partial{w_{old}}} \] $ \eta $ = learning rate or step size to take for each parameter update.
Repeat 🔁 steps 2 and 3 iteratively until convergence (to minima).

images/machine_learning/supervised/linear_regression/gradient_descent/slide_07_01.png

Gradient Calculation

\[ \begin{align*} &J(w) = \frac{1}{2n} (y - Xw)^2 \\ &\frac{\partial{J(w)}}{\partial{w}} = \frac{\partial{(\frac{1}{2n} (y - Xw)^2)}}{\partial{w}} \end{align*} \]

Applying chain rule:

\[ \begin{align*} &\text{Let } u = (y - Xw) \\ &\text{Derivative of } u^2 \text{ w.r.t 'w' }= 2u.\frac{\partial{u}}{\partial{w}} \\ \frac{\partial{(\frac{1}{2n} (y - Xw)^2)}}{\partial{w}} &= \frac{1}{\cancel2n}.\cancel2(y - Xw).\frac{\partial{(y - Xw)}}{\partial{w}} \\ &= \frac{1}{n}(y - Xw).X^T.(-1) \\ \therefore \frac{\partial{J(w)}}{\partial{w}} &= \frac{1}{n}X^T(Xw - y) \end{align*} \]

Note: $\frac{∂(a^{T}x)}{∂x}=a$

Update parameter using gradient:

\[ w_{new} = w_{old} - \eta'. X^T(Xw - y) \]

Learning Rate

Most important hyper parameter of gradient descent.
Dictates the size of the steps taken down the cost function surface.

Small $\eta$ ->

images/machine_learning/supervised/linear_regression/gradient_descent/slide_11_01.png

Large $\eta$ ->

images/machine_learning/supervised/linear_regression/gradient_descent/slide_11_02.png

Learning Rate Techniques

Learning Rate Schedule:
The learning rate is decayed (reduced) over time.
Large steps initially and fine-tuning near the minimum, e.g., step decay or exponential decay.
Adaptive Learning Rate Methods:
Automatically adjust the learning rate for each parameter ‘w’ based on the history of gradients.
Preferred in modern deep learning as they require less manual tuning, e.g., Adagrad, RMSprop, and Adam.

Types of Gradient Descent 🎢

Batch, Stochastic, Mini-Batch are classified by data usage for gradient calculation in each iteration.

Batch Gradient Descent (BGD): Entire Dataset
Stochastic Gradient Descent (SGD): Random Point
Mini-Batch Gradient Descent (MBGD): Subset

Batch Gradient Descent 🎢 (BGD)

Computes the gradient using all the data points in the dataset for parameter update in each iteration.

\[w_{new} = w_{old} - \eta.\text{(average of all ’n’ gradients)}\]

🔑Key Points:

Slow 🐢 steps towards convergence, i.e, TC = O(n).
Smooth, direct path towards minima.
Minimum number of steps/iterations.
Not suitable for large datasets; impractical for Deep Learning, as n = millions/billions.

Stochastic Gradient Descent 🎢 (SGD)

Uses only 1 data point selected randomly from dataset to compute gradient for parameter update in each iteration.

\[w_{new} = w_{old} - \eta.\text{(gradient at any random data point)}\]

🔑Key Points:

Computationally fastest 🐇 per step; TC = O(1).
Highly noisy, zig-zag path to minima.
High variance in gradient estimation makes path to minima volatile, requiring a careful decay of learning rate to ensure convergence to minima.

Mini Batch Gradient Descent 🎢 (MBGD)

Uses small randomly selected subsets of dataset, called mini-batch, (1<k<n), to compute gradient for parameter update in each iteration. \[w_{new} = w_{old} - \eta.\text{(average gradient of ‘k' data points)}\]

🔑Key Points:

Moderate time ⏰ consumption per step; TC = O(k<n).
Less noisy, and more reliable convergence than stochastic gradient descent.
More efficient and faster than batch gradient descent.
Standard optimization algorithm for Deep Learning.Note: Vectorization on GPUs allows for parallel processing of mini-batches; also GPUs are the reason for the mini-batch size to be a power of 2.

BGD vs SGD vs Mini-BGD

images/machine_learning/supervised/linear_regression/types_of_gradient_descent/slide_08_01.png

images/machine_learning/supervised/linear_regression/types_of_gradient_descent/slide_09_01.png

Types of Gradient Descent | Batch GD | Stochastic GD | Mini-Batch GD | Explained with Examples

Previous: Convex Function Next: Polynomial Regression

End of Section

2.1.7 - Polynomial Regression

Polynomial Regression

Polynomial Regression | Polynomial Features | Overfitting | Underfitting | Explained with Examples

What if our data is more complex than a straight line?

We can use a linear model to fit non-linear data.

Add powers of each feature as new features, then train a linear model on this extended set of features.

Polynomial Regression

Linear: $\hat{y_i} = w_0 + w_1x_{i_1} $

Polynomial (quadratic): $\hat{y_i} = w_0 + w_1x_{i_1} + w_2x_{i_1}^2$

Polynomial (n-degree): $\hat{y_i} = w_0 + w_1x_{i_1} + w_2x_{i_1}^2 +w_3x_{i_1}^3 + \dots + w_nx_{i_1}^n $

Above polynomial can be re-written as linear equation:

\[\hat{y_i} = w_0 + w_1X_1 + w_2X_2 +w_3X_3 + \dots + w_nX_n \]

where, $X_1 = x_{i_1}, X_2 = x_{i_1}^2, X_3 = x_{i_1}^3, \dots, X_n = x_{i_1}^n$

=> the model is still linear w.r.t to its parameters/weights $w_0, w_1, w_2, \dots , w_n $.

e.g:

images/machine_learning/supervised/linear_regression/polynomial_regression/slide_04_01.png

Strategy to find Polynomial Features

Fit a linear model to the data points.
Plot the errors.
If the variance of errors is too high, then try polynomial features.

Note: Detect and remove outliers from error distribution.

High Degree Polynomial Regression

images/machine_learning/supervised/linear_regression/polynomial_regression/slide_07_01.png

Conclusion

Polynomial model : Over-fitting ❌
Linear model : Under-fitting ❌
Quadratic model: Generalizes best ✅

Previous: Gradient Descent Next: Data Splitting

End of Section

2.1.8 - Data Splitting

Data Splitting

Data Splitting (Train/Validation/Test Split) | Data Leakage | Stratified Sampling | Time Series Data

Why Data-Splitting is Required ?

To avoid over-fitting (memorize), so that, the model generalizes well, improving its performance on unseen data.

Train/Validation/Test Split

Training Data: Learn model parameters (Textbook + Practice problems)
Validation Data: Tune hyper-parameters (Mock tests)
Test Data: Evaluate model performance (Real (final) exam)

Data Leakage

Data leakage occurs when information from the validation or test set is inadvertently used to train 🏃‍♂️ the model.

The model ‘cheats’ by learning to exploit information it should not have access to, resulting in artificially inflated performance metrics during testing 🧪.

Typical Split Ratios

Small datasets(1K-100K): 60/20/20, 70/15/15 or 80/10/10
Large datasets(>1M): 98/1/1 would suffice, as 1% of 1M is still 10K.

Note: There is no fixed rule, its trial and error.

Imbalanced Data

Imbalanced data refers to a dataset where the target classes are represented by an unequal or highly skewed distribution of samples, such that the majority class significantly outnumbers the minority class.

Stratified Sampling

If there is class imbalance in the dataset, (e.g., 95% class A , 5% class B), a random split might result in the validation set having 99% class A.

Solution: Use stratified sampling to ensure class proportions are maintained across all splits (train🏃‍♂️/validation📋/test🧪).

Note: Non-negotiable for imbalanced data.

Time-Series ⏳ Data

In time-series ⏰ data, divide the data chronologically, not randomly, i.e, training data time ⏰ should precede validation data time ⏰.
We always train 🏃‍♂️ on past data to predict future data.

Golden rule: Never look 👀 into the future.

Previous: Polynomial Regression Next: Cross Validation

End of Section

2.1.9 - Cross Validation

Cross Validation

Cross Validation | K-Fold CV | LOOCV in Machine Learning | Explained with Examples

Core Idea 💡

Do not trust one split of the data; validate across many splits, and average the result to reduce randomness and bias.

Note: Two different splits of the same dataset can give very different validation scores.

Cross-validation

Cross-validation is a statistical resampling technique used to evaluate how well a machine learning model generalizes to an independent, unseen dataset.

It works by systematically partitioning the available data into multiple subsets, or ‘folds’, and then training and testing the model on different combinations of these folds.

K-Fold Cross-Validation
Leave-One-Out Cross-Validation (LOOCV)

K-Fold Cross-Validation

Shuffle the dataset randomly (except time-series ⏳).
Split data into k equal subsets(folds).
Iterate through each unique fold, using it as the validation set.
Use remaining k-1 fold for training 🏃‍♂️.
Take an average of the results.Note: Common choice for k=5 or 10.

Iteration 1: [V][T][T][T][T]
Iteration 2: [T][V][T][T][T]
Iteration 3: [T][T][V][T][T]
Iteration 4: [T][T][T][V][T]
Iteration 5: [T][T][T][T][V]

Leave-One-Out Cross-Validation (LOOCV)

Model is trained 🏃‍♂️on all data points except one, and then tested 🧪on that remaining single observation.

LOOCV is an extreme case of k-fold cross-validation, where, k=n (number of data points).

Pros:
Useful for small (<1000) datasets.
Cons:
Computationally 💻 expensive 💰.

Previous: Data Splitting Next: Bias Variance Tradeoff

End of Section

2.1.10 - Bias Variance Tradeoff

Bias Variance Tradeoff

Bias Variance TradeOff in Machine Learning | Underfitting vs Overfitting | Explained with Examples

Total Error

Mean Squared Error (MSE) = $\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y_i})^2$

Total Error = Bias^2 + Variance + Irreducible Error

Bias = Systematic Error
Variance = Sensitivity to Data
Irreducible Error = Sensor noise, Human randomness

Bias

Systematic error from overly simplistic assumptions or strong opinion in the model.

e.g. House 🏠 prices 💰 = Rs. 10,000 * Area (sq. ft).

Note: This is over simplified view, because it ignores, amenities, location, age, etc.

Variance

Error from sensitivity to small fluctuations 📈 in the data.

e.g. Deep neural 🧠 network trained on a small dataset.

Note: Memorizes everything, including noise.

Say a house 🏠 in XYZ street was sold for very low price 💰.

Reason: Distress selling (outlier), or incorrect entry (noise).

Note: Model will make wrong(lower) price 💰predictions for all houses in XYZ street.

Linear (High Bias), Polynomial(High Variance)

images/machine_learning/supervised/linear_regression/bias_variance_tradeoff/slide_06_01.png

Bias Variance Table

Bias-Variance Trade-Off

Goal 🎯 is to minimize total error.

Find a sweet-spot balance ⚖️ between Bias and Variance.

A good model ‘generalizes’ well, i.e.,

Is not too simple or has a strong opinion.
Does not memorize 🧠 everything in the data, including noise.

Fix 🩹 High Bias (Under-Fitting)

Make model more complex.
- Add more features, add polynomial features.
Decrease Regularization.
Train 🏃‍♂️longer, the model has not yet converged.

Fix 🩹 High Variance (Over-Fitting)

Add more data (most effective).
- Harder to memorize 🧠 1 million examples than 100.
- Use data augmentation, if getting more data is difficult.
Increase Regularization.
Early stopping 🛑, prevents memorization 🧠.
Dropout (DL), randomly kill neurons, prevents co-adaptation.
Use Ensembles.
Averaging reduces variance.

Note: Co-adaptation refers to a phenomenon where neurons in a neural network become highly dependent on each other to detect features, rather than learning independently.

Previous: Cross Validation Next: Regularization

End of Section

2.1.11 - Regularization

Regularization

Regularization (L1/L2/ElasticNet) | Why weights shrink to 0 in L1 Regularization ? | Explained

Over-Fitting

Over-Fitting happens when we have a complex model that creates highly erratic curve to fit every single data point, including the random noise.

Excellent performance on training 🏃‍♂️ data, but poor performance on unseen data.

How to Avoid Over-Fitting ?

Penalty for overly complex models, i.e, models with excessively large or numerous parameters.
Focus on learning general patterns rather than memorizing 🧠 everything, including noise in training 🏃‍♂️data.

Regularization

Set of techniques that prevents over-fitting by adding a penalty term to the model’s loss function.

\[ J_{reg}(w) = J(w) + \lambda.\text{Penalty(w)} \]

$\lambda$ = Regularization strength hyper parameter, bias-variance tradeoff knob.

High $\lambda$: High ⬆️ penalty, forces weights towards 0, simpler model.
Low $\lambda$: Weak ⬇️ penalty, closer to un-regularized model.

Regularization introduces Bias

By intentionally simplifying a model (shrinking weights) to reduce its complexity, which prevents it from overfitting.
Penalty pulls feature weights closer to zero, making the model less faithful representation of the training data’s true complexity.

Common Regularization Techniques

L2 Regularization (Ridge Regression)
L1 Regularization (Lasso Regression)
Elastic Net Regularization
Early Stopping
Dropout (Neural Networks)

L2 Regularization

\[ \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^n (y_i - x_i^Tw)^2 + \lambda.\sum_{j=1}^n w_j^2 \]

Penalty term: $\ell_2$ norm - penalizes large weights quadratically.
Pushes the weights close to 0 (not exactly 0), making models more stable by distributing importance across weights.
Splits feature importance across co-related features.
Use case: Best when most features are relevant and co-related.
Also known as Ridge regression or Tikhonov regularization.

L1 Regularization

\[ \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^n (y_i - x_i^Tw)^2 + \lambda.\sum_{j=1}^n |w_j| \]

Penalty term: $\ell_1$ norm.
Shrinks some weights exactly to 0, effectively performing feature selection, giving sparse solutions.
For a group of highly co-related features, arbitrarily selects one feature and shrinks others to 0.
Use case: Best when using high-dimensional datasets (d»n) where we suspect many features are irrelevant or redundant, or when model interpretability matters.
Also known as Lasso (Least Absolute Shrinkage and Selection Operator) regression.
Computational hack: $\frac{\partial{|w_j|}}{\partial{w_j}} = 0$, since absolute function is not differentiable at 0.

Elastic Net Regularization

\[ \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^n (y_i - x_i^Tw)^2 + \lambda.((1-\alpha).\sum_{j=1}^n w_j^2 + \alpha.\sum_{j=1}^n |w_j|) \]

Penalty term : linear combination of and norm.
Sparsity(feature-selection) of L1 and stability/grouping effect of L2 regularization.
Use case: Best when we have high dimensional data with correlated features and we want sparse and stable solution.

L1/L2/Elastic Net Regularization Comparison

images/machine_learning/supervised/linear_regression/regularization/slide_12_01.png

Why weights shrink exactly to 0 in L1 regularization but NOT in L2 regularization ?

Because the gradient of L1 penalty (absolute function) is a constant value, i.e, $\pm 1$, this means a constant reduction in weight at each step, making it gradually reach to 0 in finite steps.
Whereas, the derivative of L2 penalty is proportional to the weight ($2w_j$) and as the weight reaches close to 0, the gradient also becomes very small, this means that the weight will become very close to 0, but not exactly equal to 0.

L1 vs L2 Regularization Comparison

images/machine_learning/supervised/linear_regression/regularization/slide_14_01.png

Previous: Bias Variance Tradeoff Next: Regression Metrics

End of Section

2.1.12 - Regression Metrics

Regression Metrics

Previous: Regularization Next: Assumptions of Linear Regression

Regression Metrics

Quantify the difference between the actual values and the predicted values.

Mean Absolute Error(MAE)

MAE = $\frac{1}{n} \sum_{i=1}^n |y_i - \hat{y_i}|$

Treats each error equally.
- Robust to outliers.
Not differentiable x=0.
- Using gradient descent requires computational hack.
Easy to interpret, as same units as target variable.

Mean Squared Error(MSE)

MSE = $\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y_i})^2$

Heavily penalizes large errors.
- Sensitive to outliers.
Differentiable everywhere.
- Used by gradient descent and most other optimization algorithms.
Difficult to interpret, as it has squared units.

Root Mean Squared Error(RMSE)

RMSE = $\sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y_i})^2}$

Easy to interpret, as it has same units as target variable.
Useful when we need outlier-sensitivity of MSE but the interpretability of MAE.

R^2 Metric

Measures improvement over mean model.

\[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y_i})^2}{\sum_{i=1}^n (y_i - \bar{y_i})^2} \]

Good R^2 value depends upon the use case, e.g. :

Car 🚗 sale, R =0.8 is good enough.
Cancer 🧪 prediction R 0.95, as life depends on it.

Range of values:

Best value = 1
Baseline value = 0
Worst value = $- \infty$

Note: Example for bad model is all the points lie along x-axis and model predicts y-axis.

Huber Loss

Quadratic for small errors; Linear for large errors.

\[ L_{\delta}(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{for } |y - \hat{y}| \le \delta \\ \\ \delta (|y - \hat{y}| - \frac{1}{2}\delta) & \text{otherwise} \end{cases} \]

Robust to outliers.
Differentiable at 0; smooth convergence to minima.
Delta ($\delta$) knob(hyper parameter) to control.
$\delta$ high: MSE
$\delta$ low: MAE

Note: Tune $\delta$: MAE, for outliers > $\delta$; MSE, for small errors < $\delta$.
e.g: = 95th percentile of errors or 1.35$\sigma$ for standard Gaussian data.

Huber loss (Green) and Squared loss (blue)

images/machine_learning/supervised/linear_regression/regression_metrics/slide_09_01.png

End of Section

2.1.13 - Assumptions

Assumptions Of Linear Regression

Assumptions of Linear Regression | Linearity | Normality | Homoscedasticity | Multicollinearity

Assumptions

Linear Regression works reliably only when certain key 🔑 assumptions about the data are met.

Linearity
Independence of Errors (No Auto-Correlation)
Homoscedasticity (Equal Variance)
Normality of Errors
No Multicollinearity
No Endogeneity (Target not correlated with the error term)

Linearity

Relationship between features and target 🎯 is linear in parameters.

Note: Polynomial regression is linear regression.
$y=w_0 +w_1x_1+w_2x_2^2 + w_3x_3^3$

Independence of Errors (No Auto-Correlation)

Residuals (errors) should not have a visible pattern or correlation with one another (most common in time-series ⏰ data).

Risk:
If errors are correlated, standard errors will be underestimated, making variables look ‘statistically significant’ when they are not.

Test:

Durbin–Watson test
Autocorrelation plots (ACF)
Residuals vs time

Homoscedasticity

Constant variance of errors; Var(ϵ|X)=σ

Risk:
Standard errors become biased, leading to unreliable hypothesis tests (t-tests, F-tests).

Test:

Breusch–Pagan test
White test

Fix:

Log transform
Weighted Least Squares(WLS)

Normality of Errors

Error terms should follow a normal distribution; (Required for small datasets.)

Note: Because of Central Limit Theorem, with a large enough sample size, this becomes less critical for estimation.

Risk: Hypothesis testing (calculating p-values and confidence intervals), we assume the error terms follow a normal distribution.

Test:

Q-Q plot
Shapiro-Wilk Test

No Multicollinearity

Features should not be highly correlated with each other.

Risk:

High correlation makes it difficult to determine the unique, individual impact of each feature. This leads to high variance in model parameter estimates, small changes in data cause large swings in parameters.
Model interpretability issues.

Test:

Variance Inflation Factor(VIF)VIF > 5 → concern, VIF > 10 → serious issue

Fix:

PCA
Remove redundant features

No Endogeneity (Exogeneity)

Error term must be uncorrelated with the features; E[ϵ|X] = 0

Risk:

Parameters will be biased and inconsistent.

Test:

Hausman Test
Durbin-Wu-Hausman (DWH) Test

Fix:

Controlled experiments.

Previous: Regression Metrics Next: Binary Classification

End of Section

2.2 - Logistic Regression

Logistic Regression

End of Section

2.2.1 - Binary Classification

Binary Classification

Binary Classification | Sigmoid Function | Why is Logistic Regression called so ?

Binary Classification

Why can’t we use Linear Regression for binary classification ?

Linear regression tries to find the best fit line, but we want to find the line or decision boundary that clearly separates the two classes.

images/machine_learning/supervised/logistic_regression/binary_classification_intro/slide_03_01.tif

Goal 🎯

Find the decision boundary, i.e, the equation of the separating hyperplane.

\[z=w^{T}x+w_{0}\]

Decision Boundary

Value of $z = \mathbf{w^Tx} + w_0$ tells us how far is the point from the decision boundary and on which side.

Note: Weight 🏋️‍♀️ vector ‘w’ is normal/perpendicular to the hyperplane, pointing towards the positive class (y=1).

Distance of Points from Separating Hyperplane

For points exactly on the decision boundary \[z = \mathbf{w^Tx} + w_0 = 0 \]
Positive (+ve) labeled points \[ z = \mathbf{w^Tx} + w_0 > 0 \]
Negative (-ve) labeled points \[ z = \mathbf{w^Tx} + w_0 < 0 \]

Missing Link 🔗

The distance of a point from the hyperplane can range from $-\infty$ to $+ \infty$.
So we need a link 🔗 to transform the geometric distance to probability.

Sigmoid Function (a.k.a Logistic Function)

Maps the output of a linear equation to a value between 0 and 1, allowing the result to be interpreted as a probability.

\[\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}\]

If the distance ‘z’ is large and positive, $\hat{y} \approx 1$ (High confidence).
If the distance ‘z’ is 0, $\hat{y} = 0.5$ (Maximum uncertainty).

Why is it called Logistic Regression ?

Because, we use the logistic (sigmoid) function as the ‘link function’🔗 to map 🗺️ the continuous output of the regression into a probability space.

Previous: Assumptions of Linear Regression Next: Log Loss

End of Section

2.2.2 - Log Loss

Log Loss

Log Loss | Logistic Regression | Derivative Calculation | Sigmoid Function Derivative | Explained

Log Loss

Log Loss = $\begin{cases} -log(\hat{y_i}) & \text{if } y_i = 1 \\ \\ -log(1-\hat{y_i}) & \text{if } y_i = 0 \end{cases} $

Combining the above 2 conditions into 1 equation gives:

Log Loss = $-[y_ilog(\hat{y_i}) + (1-y_i)log(1-\hat{y_i})]$

images/machine_learning/supervised/logistic_regression/log_loss/slide_04_01.png

Cost Function

\[J(w) = -\frac{1}{n}\sum [y_ilog(\hat{y_i}) + (1-y_i)log(1-\hat{y_i})]\]

We need to find the weights 🏋️‍♀️ ‘w’ that minimize the cost 💵 function.

Gradient Descent

Weight update: \[w_{new}=w_{old}-η.\frac{∂J(w)}{∂w_{old}}\]

We need to find the gradient of log loss w.r.t weight ‘w’.

Gradient Calculation

Chain Rule:

\[\frac{\partial{J(w)}}{\partial{w}} = \frac{\partial{J(w)}}{\partial{\hat{y}}}.\frac{\partial{\hat{y}}}{\partial{z}}.\frac{\partial{z}}{\partial{w}}\]

Cost Function: $J(w) = -\frac{1}{n}\sum [y_ilog(\hat{y_i}) + (1-y_i)log(1-\hat{y_i})]$
Prediction: $\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}$
Distance of Point: $z = \mathbf{w^Tx} + w_0$

Cost 💰Function Derivative

\[ J(w) = -\sum [ylog(\hat{y}) + (1-y)log(1-\hat{y})]\]

How loss changes w.r.t prediction ?

\[ \begin{align*} \frac{\partial{J(w)}}{\partial{\hat{y}}} &= - [\frac{y}{\hat{y}} - \frac{1-y}{1-\hat{y}}] \\ &= -[\frac{y- \cancel{y\hat{y}} -\hat{y} + \cancel{y\hat{y}}}{\hat{y}(1-\hat{y})}] \\ \therefore \frac{\partial{J(w)}}{\partial{\hat{y}}} &= \frac{\hat{y} - y}{\hat{y}(1-\hat{y})} \end{align*} \]

Prediction Derivative

\[ \hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}} \]

How prediction changes w.r.t distance ?

\[ \begin{align*} \frac{\partial{\hat{y}}}{\partial{z}} &= \frac{\partial{\sigma(z)}}{\partial{z}} = \sigma'(z) \\ \sigma'(z) &= \sigma(z)(1-\sigma(z)) \\ \therefore \frac{\partial{\hat{y}}}{\partial{z}} &= \hat{y}(1-\hat{y}) \end{align*} \]

Sigmoid Derivative

\[ \sigma(z) = \frac{1}{1 + e^{-z}} \]\[ \begin{align} &\text {Let } u = 1 + e^{-z} \nonumber \\ &\implies \sigma(z) = \frac{1}{u}, \quad \text{so, } \nonumber \\ &\frac{\partial{\sigma(z)}}{\partial{z}} = \frac{\partial{\sigma(z)}}{\partial{u}}. \frac{\partial{u}}{\partial{z}} \nonumber \\ &\frac{\partial{\sigma(z)}}{\partial{u}} = -\frac{1}{u^2} = - \frac{1}{(1 + e^{-z})^2} \\ &\text{and } \frac{\partial{u}}{\partial{z}} = \frac{\partial{(1 + e^{-z})}}{\partial{z}} = -e^{-z} \end{align} \]

from equations (1) & (2):

\[ \begin{align*} \because \frac{\partial{\sigma(z)}}{\partial{z}} &= \frac{\partial{\sigma(z)}}{\partial{u}}. \frac{\partial{u}}{\partial{z}} \\ \implies \frac{\partial{\sigma(z)}}{\partial{z}} &= - \frac{1}{(1 + e^{-z})^2}. -e^{-z} = \frac{e^{-z}}{(1 + e^{-z})^2} \\ 1 - \sigma(z) & = 1 - \frac{1}{1 + e^{-z}} = \frac{e^{-z}}{1 + e^{-z}} \\ \frac{\partial{\sigma(z)}}{\partial{z}} &= \frac{1}{1 + e^{-z}}.\frac{e^{-z}}{1 + e^{-z}} \\ \therefore \frac{\partial{\sigma(z)}}{\partial{z}} &= \sigma(z).(1-\sigma(z)) \end{align*} \]

Distance Derivative

\[z=w^{T}x+w_{0}\]

How distance changes w.r.t weight 🏋️‍♀️ ?

\[ \frac{\partial{z}}{\partial{w}} = \mathbf{x} \]

\[\because \frac{\partial{(a^T\mathbf{x})}}{\partial{\mathbf{x}}} = a\]

Gradient Calculation (combined)

Chain Rule:

\[ \begin{align*} \frac{\partial{J(w)}}{\partial{w}} &= \frac{\partial{J(w)}}{\partial{\hat{y}}}.\frac{\partial{\hat{y}}}{\partial{z}}.\frac{\partial{z}}{\partial{w}} \\ &= \frac{\hat{y} - y}{\cancel{\hat{y}(1-\hat{y})}}.\cancel{\hat{y}(1-\hat{y})}.x \\ \therefore \frac{\partial{J(w)}}{\partial{w}} &= (\hat{y} - y).x \end{align*} \]

Cost 💰Function Derivative

\[\frac{\partial{J(w)}}{\partial{w}} = \sum (\hat{y_i} - y_i).x_i\]

Gradient = Error x Input

Error = $(\hat{y_i}-y_i)$: how far is prediction from the truth?
Input = $x_i$: contribution of specific feature to the error.

Gradient Descent (update)

Weight update:

\[w_{new} = w_{old} - \eta. \sum_{i=1}^n (\hat{y_i} - y_i).x_i\]

Why MSE can NOT be used as Loss Function?

Mean Squared Error (MSE) can not be used to quantify error/loss in binary classification because:

Convexity : MSE combined with Sigmoid is non-convex, so, Gradient Descent can get trapped in local minima.
Penalty: MSE does not appropriately penalize mis-classifications in binary classification.
- e.g: If actual value is class 1 but the model predicts class 0, then MSE = $(1-0)^2 = 1$, which is very low, whereas los loss = $-log(0) = \infty$

Previous: Binary Classification Next: Regularization

End of Section

2.2.3 - Regularization

Regularization in Logistic Regression

What happens to the weights of Logistic Regression if the data is perfectly linearly separable?

The weights 🏋️‍♀️ will tend towards infinity, preventing a stable solution.

The model tries to make probabilities exactly 0 or 1, but the sigmoid function never reaches these limits, leading to extreme weights 🏋️‍♀️ to push probabilities near the extremes.

Distance of Point: $z = \mathbf{w^Tx} + w_0$
Prediction: $\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}$
Log loss: $-[y_ilog(\hat{y_i}) + (1-y_i)log(1-\hat{y_i})] $

Why is it a problem ?

Overfitting:
Model becomes perfectly accurate on training 🏃‍♂️data but fails to generalize, performing poorly on unseen data.

Solution 🦉

Regularization:
Adds a penalty term to the loss function, discouraging weights 🏋️‍♀️ from becoming too large.

L1 Regularization

\[ \begin{align*} \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ & \underbrace{- \sum_{i=1}^n [y_i\log(\hat{y_i}) + (1-y_i)\log(1-\hat{y_i})]}_{\text{Log Loss}} \\ & \underbrace{+ \lambda_1 \sum_{j=1}^n |w_j|}_{\text{L1 Regularization}} \\ \end{align*} \]

L2 Regularization

\[ \begin{align*} \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ & \underbrace{- \sum_{i=1}^n [y_i\log(\hat{y_i}) + (1-y_i)\log(1-\hat{y_i})]}_{\text{Log Loss}} \\ & \underbrace{+ \lambda_2 \sum_{j=1}^n w_j^2}_{\text{L2 Regularization}} \\ \end{align*} \]

2.2.4 - Log Odds

Log Odds

Log Odds (Logit) | Relation to Sigmoid Function | Logistic Regression Logit Assumption | Explained

What is the meaning of Odds ?

Odds compare the likelihood of an event happening vs. not happening.

Odds = $\frac{p}{1-p}$

p = probability of success

Log Odds (Logit) Assumption

In logistic regression we assume that Log-Odds (the log of the ratio of positive class to negative class) is a linear function of inputs.

Log-Odds (Logit) = $log_e \frac{p}{1-p}$

Log Odds (Logit)

Log Odds = $log_e \frac{p}{1-p} = z$

\[z=w^{T}x+w_{0}\]

\[ \begin{align*} &log_{e}(\frac{p}{1-p}) = z \\ &⟹\frac{p}{1-p} = e^{z} \\ &\implies p = e^z - p.e^z \\ &\implies p = \frac{e^z}{1+e^z} \\ &\text { divide numerator and denominator by } e^z \\ &\implies p = \frac{1}{1+e^{-z}} \quad \text { i.e, Sigmoid function} \end{align*} \]

Sigmoid Function

Sigmoid function is the inverse of log-odds (logit) function, it converts the log-odds back to probability, and vice versa.

Range of Values

Probability: 0 to 1
Odds: 0 to + $\infty$
Log Odds: -$\infty$ to +$\infty$

Previous: Regularization Next: Probabilistic Interpretation

End of Section

2.2.5 - Probabilistic Interpretation

Probabilistic Interpretation of Logistic Regression

Probabilistic Interpretation of Logistic Regression | Bernoulli Assumption | MLE | Explained

Why do we use Log Loss in Binary Classification?

To understand that let’s have a look 👀at the statistical assumptions.

Bernoulli Assumption

We assume that our target variable ‘y’ follows a Bernoulli distribution, i.e, has exactly 2 outputs success/failure.

P(Y=1|X) = p
P(Y=0|X) = 1- p

Combining above 2 into 1 equation gives:

P(Y=y|X) = $p^y(1-p)^{1-y}$

Maximum Likelihood Estimate (MLE)

‘Find the most plausible explanation for what I see.’

We want to find the weights 🏋️‍♀️‘w’ that maximize the likelihood of seeing the data.

Data, D = $\{ (x_i, y_i)_{i=1}^n , \quad y_i \in \{0,1\} \}$

We do this by maximizing likelihood function.

Likelihood Function

\[\mathcal{L}(w) = \prod_{i=1}^n [p_i^{y_i}(1-p_i)^{1-y_i}]\]

Assumption: Training data is I.I.D.

Problem 🦀

Multiplying many small probabilities is computationally difficult and prone to numerical errors.

Solution🦉

A common simplification is to maximize the log-likelihood function instead, which converts the product into a sum.

Note: Log is a strictly monotonically increasing function.

Log Likelihood Function

\[ \begin{align*} log\mathcal{L}(w) &= \sum_{i=1}^n log [p_i^{y_i}(1-p_i)^{1-y_i}] \\ \therefore log\mathcal{L}(w) &= \sum_{i=1}^n [ y_ilog(p_i) + (1-y_i)log(1-p_i)] \end{align*} \]

Maximizing the log-likelihood is same as minimizing the negative of log-likelihood.

\[ \begin{align*} \underset{w}{\mathrm{max}}\ log\mathcal{L}(w) &= \underset{w}{\mathrm{min}} - log\mathcal{L}(w) \\ \underset{w}{\mathrm{min}} - log\mathcal{L}(w) &= - \sum_{i=1}^n [ y_ilog(p_i) + (1-y_i)log(1-p_i)] \\ \underset{w}{\mathrm{min}} - log\mathcal{L}(w) &= \text {Log Loss} \end{align*} \]

Inference

Log Loss is not chosen arbitrarily, but it follows directly from Bernoulli assumption and MLE.

Previous: Log Odds Next: KNN Introduction

End of Section

2.3 - K Nearest Neighbors

K Nearest Neighbors (KNN)

End of Section

2.3.1 - KNN Introduction

K Nearest Neighbors Introduction

Introduction to KNN | K-Nearest Neighbors Algorithm | Explained with Examples

Issues with Linear/Logistic Regression

Parametric models:
- Rely on assumption that relationships between data points are linear.
- For polynomial regression we need to find the degree of polynomial.
Training:
- We need to train 🏃‍♂️the model for prediction.

K Nearest Neighbors

Simple: Intuitive way to classify data or predict values by finding similar existing data points (neighbors).
Non-Parametric: Makes no assumptions about the underlying data distribution.
No Training Required: KNN is a ‘lazy learner’, it does not require a formal training 🏃‍♂️ phase.

KNN Algorithm

Given a query point $x_q$ and a dataset, D = {$(x_i,y_i)_{i=1}^n, \quad x_i,y_i \in \mathbb{R}^d$}, the algorithm finds a set of ‘k’ nearest neighbors $\mathcal{N}_k(x_q) \subseteq D$.

Inference:

Choose a value of ‘k’ (hyper-parameter); odd number.
Calculate distance (Euclidean, Cosine etc.) between and every point in dataset and store in a distance list.
Sort the distance list in ascending order; choose top ‘k’ data points.
Make prediction:
- Classification: Take majority vote of ‘k’ nearest neighbors and assign label.
- Regression: Take the mean/median of ‘k’ nearest neighbors.

Note: Store entire dataset.

Time & Space Complexity

Storing Data: Space Complexity: O(nd)
Inference: Time Complexity ⏰: O(nd + nlogn)

Explanation:

Distance to all ’n’ points in ‘d’ dimensions: O(nd)
Sorting all ’n’ data points : O(nlogn)

Note: Brute force 🔨 KNN is unacceptable when ’n’ is very large, say billions.

Previous: Probabilistic Interpretation Next: KNN Optimizations

End of Section

2.3.2 - KNN Optimizations

KNN Optimizations

KNN Optimizations | Standardization | Distance Weighted KNN | Mahalanobis Distance | Explained

Optimizations

Naive KNN needs some improvements to fix some of its drawbacks.

Standardization
Distance-Weighted KNN
Mahalanobis Distance

Standardization

⭐️Say one feature is ‘Annual Income’ (0-1M), and another feature is ‘Years of Experience’ (0-40).

👉The Euclidean distance will be almost entirely dominated by income 💵.

💡So, we do standardization of each feature, such that it has a mean, $\mu$=0 and variance,$\sigma$=1.

\[z=\frac{x-\mu}{\sigma}\]

Distance-Weighted KNN

⭐️Vanilla KNN treats the 1st nearest neighbor and the k-th nearest neighbor as equal.

💡A neighbor that is 0.1units away should have more influence than a neighbor that is 10 units away.

👉We assign weight 🏋️‍♀️ to each neighbor; most common strategy is inverse of squared distance.

\[w_i = \frac{1}{d(x_q, x_i)^2 + \epsilon}\]

Improvements:

Noise/Outlier: Reduces the impact of ‘noise’ or ‘outlier’ (distant neighbors).
Imbalanced Data: Closer points dominate, mitigating impact of imbalanced data.
- e.g: If you have a query point surrounded by 2 very close ‘Class A’ points and 3 distant ‘Class B’ points, weighted 🏋️‍♀️ KNN will correctly pick ‘Class A'.

Mahalanobis Distance

⭐️Euclidean distance makes assumption that all the features are independent and provide unique information.

💡‘Height’ and ‘Weight’ are highly correlated.

👉If we use Euclidean distance, we are effectively ‘double-counting’ the size of the person.

🏇Mahalanobis distance measures distance in terms of standard deviations from the mean, accounting for the covariance between features.

\[d(x, y) = \sqrt{(x - y)^T \Sigma^{-1} (x - y)}\]

$\Sigma$: Covariance matrix of the data

If $\Sigma$ is identity matrix, Mahalanobis distance reduces to Euclidean distance.
If $\Sigma$ is a diagonal matrix, Mahalanobis distance reduces to Normalized Euclidean distance.

Runtime Issue

🦀Naive KNN shifts all computation 💻 to inference time ⏰, and it is very slow.

To find the neighbor for one query, we must touch every single bit of the ‘nxd’ matrix.
If n=10^9,a single query would take seconds, but we need milliseconds.

Advanced Optimizations

Distance Weighted KNN
- K-D Trees (d<20): Recursively partitions space into axis-aligned hyper-rectangles. O(log N ) search.
- Ball Trees : High dimensional data; Haversine distance for geospatial 🌎 data.
Approximate Nearest Neighbors (ANN)
- Locality Sensitive Hashing (LSH): Uses ‘bucketizing’ 🗑️ hashes. Points that are close have a high probability of having the same hash.
- Hierarchical Navigable Small World (HNSW); Graph of vectors; Search is a ‘greedy walk’ across levels.
Product Quantization (Reduce memory 🧠 footprint 👣 of high dimensional vectors)
- ScaNN (Google)
- FAISS (Meta)
Dimensionality Reduction (Mitigate ‘Curse of Dimensionality’)
- PCA

Previous: KNN Introduction Next: Curse Of Dimensionality

End of Section

2.3.3 - Curse Of Dimensionality

Curse Of Dimensionality

Curse of Dimensionality | Cosine Similarity | Normalization | Explained with Examples

Euclidean Distance

While Euclidean distance(L norm) is the most frequently discussed, ‘Curse of Dimensionality’ impacts all Minkowski norms ($L_p$)

\[L_p = (\sum |x_i|^p)^{\frac{1}{p}} \]

Note: ‘Curse of Dimensionality’ is largely a function of the exponent (p) in the distance calculation.

Issues with High Dimensional Data

Coined 🪙 by mathematician John Bellman in the 1960s while studying dynamic programming.

High dimensional data created following challenges:

Distance Concentration
Data Sparsity
Exponential Sample Requirement

Distance Concentration

💡Consider a hypercube in d-dimensions of side length = 1; Volume = $1^d$ = 1
🧊 A smaller inner cube with side length = 1 - $\epsilon$ ; Volume = $(1 -\epsilon)^d$

\[\lim_{d \rightarrow \infty} (1 - \epsilon)^d = 0\]

🧐 This implies that almost all the volume of the high-dimensional cube lies near the ‘crust’.
👉e.g: if $\epsilon$= 0.01, d = 500; Volume of inner cube = $(1 -0.01)^{500}$ = $0.99^{500}$ = 0.006 = 0.6%
🤔Consequently, all points become nearly equidistant, and the concept of ‘nearest’ or ‘neighborhood’ loses its meaning.

Data Sparsity

⭐️The volume of the feature space increases exponentially with each added dimension.

👉To maintain the same data density found in a 1D space with 10 points, we would need $10^{10}$(10 billion) points in 10D space.

💡Because real-world datasets are rarely this large, the data becomes “sparse,” making it difficult to find truly similar neighbors.

Exponential Sample Requirement

⭐️To maintain a reliable result, the amount of training data needed must grow exponentially with the number of dimensions.

👉Without this growth, the model is highly prone to overfitting, where it learns from noise in the ‘sparse’ data rather than actual underlying patterns.

Note: For modern embeddings (often 768 or 1536 dimensions), it is mathematically impossible to collect enough data to ‘fill’ the space.

Solution

Cosine Similarity
Normalization

Cosine Similarity

Cosine similarity measures the cosine of the angle between 2 vectors.

\[\text{cos}(\theta) = \frac{A \cdot B}{\|A\|\|B\|} = \frac{\sum_{i=1}^{D} A_i B_i}{\sqrt{\sum_{i=1}^{D} A_i^2} \sqrt{\sum_{i=1}^{D} B_i^2}}\]

Note: Cosine similarity mitigates the ‘curse of dimensionality" problem.

Normalization

⭐️Normalize the vector, i.e, make its length =1, a unit vector.

💡By normalizing, we project all points onto the surface of a unit hypersphere.

We are no longer searching in the ‘empty’ high-dimensional volume of a hypercube.
Now, we are searching on a constrained manifold (the shell).

Note: By normalizing, we move the data from the volume of the D-dimensional space onto the surface of a (D-1)-dimensional hypersphere.

Euclidean Distance Squared of Normalized Vectors:

\[ \begin{align*} \|A - B\|^2 &= (A - B) \cdot (A - B) \\ &= \|A\|^2 + \|B\|^2 - 2(A \cdot B)\\ \because \|A\| &= \|B\| = 1 \\ \|A - B\|^2 &= 1 + 1 - 2\cos(\theta) \\ \therefore \|A - B\|^2 &= 2(1 - \cos(\theta))\\ \end{align*} \]

Note: This formula proves that maximizing ‘Cosine similarity’ is identical to minimizing ‘Euclidean distance’ on the hypersphere.

Previous: KNN Optimizations Next: Bias Variance Tradeoff

End of Section

2.3.4 - Bias Variance Tradeoff

Bias Variance Tradeoff

K Nearest Neighbors (KNN) Bias-Variance Tradeoff | Explained with Examples

KNN Dataset

Let’s use this dataset to understand the impact of number of neighbours ‘k’.

images/machine_learning/supervised/k_nearest_neighbors/bias_variance_tradeoff/slide_01_01.tif

High Bias, Low Variance

👉If ‘k’ is very large, say, k=n,

model simply predicts the majority class of the entire dataset for every query point , i.e, under-fitting.

High Variance, Low Bias

👉If ‘k’ is very small, say, k=1,

model is highly sensitive to noise or outliers, as it looks at only 1 nearest neighbor, i.e, over-fitting.

‘K' Hyper-Parameter Tuning

Let’s plot Error vs ‘K’ neighbors:

images/machine_learning/supervised/k_nearest_neighbors/bias_variance_tradeoff/slide_04_01.tif

Over-Fitting Vs Under-Fitting

Figure 1: k=1, Over-fitting
Figure 2: k=n, Under-fitting
Figure 3: k=11, Lowest Error (Optimum)

Previous: Curse Of Dimensionality Next: Decision Trees Introduction

End of Section

2.4 - Decision Tree

Decision Tree

End of Section

2.4.1 - Decision Trees Introduction

Decision Trees Introduction

Decision Trees in Machine Learning | When to use Decision Trees ? | Explained with Examples

How do we classify the below dataset ?

images/machine_learning/supervised/decision_trees/decision_trees_introduction/slide_01_01.tif

💡It can be written as nested 🕸️ if else statements.

e.g: To classify the left bottom corner red points we can write:
👉if (FeatureX1 <1 & FeatureX2 <1)

⭐️Extending the logic for all, we have an if else ladder like below:

images/machine_learning/supervised/decision_trees/decision_trees_introduction/slide_03_01.tif

👉Final decision boundaries will be something like below:

images/machine_learning/supervised/decision_trees/decision_trees_introduction/slide_04_01.tif

What is a Decision Tree 🌲?

Non-parametric model.
Recursively partitions the feature space.
Top-down, greedy approach to iteratively select feature splits.
Maximize purity of a node, based on metrics, such as, Information Gain 💵 or Gini 🧞‍♂️Impurity.

Note: We can extract the if/else logic of the decision tree and write in C++/Java for better performance.

Computation 💻

Decision Tree 🌲 Analysis

⭐️Building an optimal decision tree 🌲 is a NP-Hard problem.
👉(Time Complexity: Exponential; combinatorial search space)

Pros
- No standardization of data needed.
- Highly interpretable.
- Good runtime performance.
- Works for both classification & regression.
Cons
- Number of dimensions should not be too large. (Curse of dimensionality)
- Overfitting.

When to use Decision Tree 🌲

As base learners in ensembles, such as, bagging(RF), boosting(GBDT), stacking, cascading, etc.
As a baseline, interpretable, model or for quick feature selection.
Runtime performance is important.

Previous: Bias Variance Tradeoff Next: Purity Metrics

End of Section

2.4.2 - Purity Metrics

Purity Metrics

Decision Trees Purity Metrics | Information Gain | Gini Impurity | Explained with Examples

Pure Leaf 🍃 Node ?

Decision trees recursively partition the data based on feature values.

images/machine_learning/supervised/decision_trees/purity_metrics/slide_02_01.tif

Pure Leaf 🍃 Node: Terminal node where every single data point belongs to the same class.

💡Zero Uncertainty.

So, what should be the logic to partition the data at each step or each node ?

The goal of a decision tree algorithm is to find the split that maximizes information gain, meaning it removes the most uncertainty from the data.

So, what is information gain ?
How do we reduce uncertainty ?

Let’s understand few terms first, before we understand information gain.

Entropy

Measure ⏱ of uncertainty, randomness, or impurity in a data.

\[H(S)=-\sum _{i=1}^{n}p_{i}\log(p_{i})\]

Binary Entropy:

images/machine_learning/supervised/decision_trees/purity_metrics/slide_04_01.png

Surprise 😮 Factor

💡Entropy can also be viewed as the ‘average surprise'.

A highly certain event provides little information when it occurs (low surprise).
An unlikely event provides a lot of information (high surprise).

Information Gain💰

⭐️Measures ⏱ the reduction in entropy (uncertainty) achieved by splitting a dataset based on a specific attribute.

\[IG=Entropy(Parent)-\left[\frac{N_{left}}{N_{parent}}Entropy(Child_{left})+\frac{N_{right}}{N_{parent}}Entropy(Child_{right})\right] \]

Note: The goal of a decision tree algorithm is to find the split that maximizes information gain, meaning it removes the most uncertainty from the data.

Gini 🧞‍♂️Impurity

⭐️Measures ⏱ the probability of an element being incorrectly classified if it were randomly labeled according to the distribution of labels in a node.

\[Gini(S)=1-\sum_{i=1}^{n}(p_{i})^{2}\]

Range: 0 (Pure) - 0.5 (Maximum impurity)

Note: Gini is used in libraries like Scikit-Learn (as the default), because it avoids the computationally expensive 💰 log function.

Gini Impurity Vs Entropy

Gini Impurity is a first-order approximation of Entropy.
For most of the real-world cases, choosing one over the other results in the exact same tree structure or negligible differences in accuracy.
When we plot the two functions, they follow nearly identical shapes.

Previous: Decision Trees Introduction Next: Decision Trees For Regression

End of Section

2.4.3 - Decision Trees For Regression

Decision Trees For Regression

Decision Trees for Regression | When to use Decision Trees for Regression ? | Explained

Decision Trees for Regression

Decision Trees can also be used for Regression tasks but using a different metrics.

⭐️Metric:

Mean Squared Error (MSE)
Mean Absolute Error (MAE)

👉Say we have a following dataset, that we need to fit using decision trees:

images/machine_learning/supervised/decision_trees/decision_trees_for_regression/slide_02_01.tif

👉Decision trees try to find the decision splits, building step functions that approximate the actual curve, as shown below:

images/machine_learning/supervised/decision_trees/decision_trees_for_regression/slide_03_01.tif

👉Internally the decision tree (if else ladder) looks like below:

images/machine_learning/supervised/decision_trees/decision_trees_for_regression/slide_04_01.tif

Can we use decision trees for all kinds of regression? Or is there any limitation ?

Decision trees cannot predict values outside the range of the training data, i.e, extrapolation.

Let’s understand the interpolation and extrapolation cases one by one.

Interpolation ✅

⭐️Predicting values within the range of features and targets observed during training 🏃‍♂️.

Trees capture discontinuities perfectly, because they are piece-wise constant.
They do not try to force a smooth line where a ‘jump’ exists in reality.

e.g: Predicting a house 🏡 price 💰 for a 3-BHK home when you have seen 2-BHK and 4-BHK homes in that same neighborhood.

Extrapolation ❌

⭐️Predicting values outside the range of training 🏃‍♂️data.

Problem:
Because a tree outputs the mean of training 🏃‍♂️ samples in a leaf, it cannot predict a value higher than the highest ‘y’ it saw during training 🏃‍♂️.

Flat-Line: Once a feature ‘X’ goes beyond the training boundaries, the tree falls into the same ‘last’ leaf forever.

e.g: Predicting the price 💰 of a house 🏡 in 2026 based on data from 2010 to 2025.

Previous: Purity Metrics Next: Regularization

End of Section

2.4.4 - Regularization

Regularization

Regularization in Decision Trees | Pre Pruning | Post Pruning | Minimal Cost Complexity Pruning

Over-Fitting

⭐️Because trees🌲 are non-parametric and ‘greedy’💰, they will naturally try to grow 📈 until every leaf 🍃 is pure, effectively memorizing noise and outliers rather than learning generalizable patterns.

Tree 🌲 Size

👉As the tree 🌲 grows, the amount of data in each subtree decreases 📉, leading to splits based on statistically insignificant samples.

Regularization Techniques

Pre-Pruning ✂️ &
Post-Pruning ✂️

Pre-Pruning ✂️

⭐️ ‘Early stopping’ heuristics (hyper-parameters).

max_depth: Limits how many levels of ‘if else’ the tree can have; most common.
min_samples_split: A node will only split, if it has at least ‘N’ samples; smooths the model (especially in regression), by ensuring predictions are based on an average of multiple points.
max_leaf_nodes: Limiting the number of leaves reduces the overall complexity of the tree, making it simpler and less likely to memorize the training data’s noise.
min_impurity_decrease: A split is only made if it reduces the impurity (Gini/MSE) by at least a certain threshold.

Max Depth Hyper Parameter Tuning

Below is an example for one of the hyper-parameter’s max_depth tuning.

As we can see below the cross-validation error decreases till depth=6 and after that reduction in error is not so significant.

images/machine_learning/supervised/decision_trees/regularization/slide_05_01.tif

Post-Pruning ✂️

Let the tree🌲 grow to its full depth (overfit) and then ‘collapse’ nodes that provide little predictive value.

Most common algorithm:

Minimal Cost Complexity Pruning

Minimal Cost Complexity Pruning ✂️

💡Define a cost-complexity 💰 measure that penalizes the tree 🌲 for having too many leaves 🍃.

\[R_\alpha(T) = R(T) + \alpha |T|\]

R(T): total misclassification rate (or MSE) of the tree
|T|: number of terminal nodes (leaves)
$\alpha$: complexity parameter (the ‘tax’ 💰 on complexity)

Logic:

If $\alpha$=0, the tree is the original overfit tree.
As $\alpha$ increases 📈, the penalty for having many leaves grows 📈.
To minimize the total cost 💰, the model is forced to prune branches🪾 that do not significantly reduce R(T).
Use cross-validation to find the ‘sweet spot’ $\alpha$ that minimizes validation error.

Previous: Decision Trees For Regression Next: Bagging

End of Section

2.4.5 - Bagging

Bagging

Bagging (Bootstrap Aggregation) | Decision Trees | Explained with Regression/Classification Examples

Issues with Decision Tree ?

A single decision tree is highly sensitive to the specific training dataset.
Small changes, such as, a few different rows or the presence of an outlier, can lead to a completely different tree structure.
Unpruned decision trees often grow until they perfectly classify the training set, essentially ‘memorizing’ noise and outliers, i.e, high variance, rather than finding general patterns.

What does Bagging mean 🤔?

Bagging = ‘Bootstrapped Aggregation’

Bagging 🎒is a parallel ensemble technique that reduces variance (without significantly increasing the bias) by training multiple versions of the same model on different random subsets of data and then combining their results.

Note: Bagging uses deep trees (overfit) and combines them to reduce variance.

Bootstrapping

Bootstrapping = ‘Without external help’

Given a training 🏃‍♂️set D of size ’n’, we create B new training sets D by sampling ’n’ observations from D ‘with replacement'.

Bootstrapped Samples

💡Since, we are sampling ‘with replacement’, so, some data points may be picked multiple times, while others may not be picked at all.

The probability that a specific observation is not selected in a bootstrap sample of size ’n’ is: \[\lim_{n \to \infty} \left(1 - \frac{1}{n}\right)^n = \frac{1}{e} \approx 0.368\]

🧐This means each tree is trained on roughly 63.2% of the unique data, while the remaining 36.8% (the Out-of-Bag or OOB set) can be used for cross validation.

Aggregation

⭐️Say we train ‘B’ models (base-learners), each with variance $\sigma^2$ .

👉Average variance of ‘B’ models (trees) if all are independent:

\[Var(X)=\frac{\sigma^{2}}{B}\]

👉Since, bootstrap samples are derived from the same dataset, the trees are correlated with some correlation coefficient ‘$\rho$'.

So, the true variance of bagged ensemble is:

\[Var(f_{bag}) = \rho \sigma^2 + \frac{1-\rho}{B} \sigma^2\]

$\rho$= 0; independent models, most reduction in variance.
$\rho$= 1; fully correlated models, no improvement in variance.
0<$\rho$<1; As correlation decreases, variance reduces .

Why Bagging is Better than a Single Model?

images/machine_learning/supervised/decision_trees/bagging/slide_09_01.png

Previous: Regularization Next: Random Forest

End of Section

2.4.6 - Random Forest

Random Forest

Random Forest | Decision Trees | Feature Sub-Sampling | Correlation Impact | Explained with Examples

Problem with Bagging

💡If one feature is extremely predictive (e.g., ‘Area’ for house prices), almost every bootstrap tree will split on that feature at the root.

👉This makes the trees(models) very similar, leading to a high correlation ‘$\rho$’.

\[Var(f_{bagging})=ρ\sigma^{2}+\frac{1-ρ}{B}\sigma^{2}\]

Feature Sub Sampling

💡Choose a random subset of ‘m’ features from the total ‘d’ features, reducing the correlation ‘$\rho$’ between trees.

👉By forcing trees to split on ‘sub-optimal’ features, we intentionally increase the variance of individual trees; also the bias is slightly increased (simpler trees).

Standard heuristics:

Classification: $m = \sqrt{d}$
Regression: $m = \frac{d}{3}$

Math of De-Correlation

💡Because ‘$\rho$’ is the dominant factor in the variance of the ensemble when B is large, the overall ensemble variance Var($f_{rf}$) drops significantly lower than standard Bagging.

\[Var(f_{rf})=ρ\sigma^{2}+\frac{1-ρ}{B}\sigma^{2}\]

Over-Fitting

💡A Random Forest will never overfit by adding more trees (B).

It only converges to the limit: ‘$\rho\sigma^2$’.

Overfitting is controlled by:

depth of the individual trees.
size of the feature subset ‘m'.

When to use Random Forest ?

High Dimensionality: 100s or 1000s of features; RF’s feature sampling prevents a few features from masking others.
Tabular Data (with Complex Interactions): Captures non-linear relationships without needing manual feature engineering.
Noisy Datasets: The averaging process makes RF robust to outliers (especially if using min_samples_leaf).
Automatic Validation: Need a quick estimate of generalization error without doing 10-fold CV (via OOB Error).

Previous: Bagging Next: Extra Trees

End of Section

2.4.7 - Extra Trees

Extra Trees

Extremely Randomized (Extra) Trees | Decision Trees | Why Random Splitting gives Performance Gain?

Issue with Decision Trees/Random Forest

💡In a standard Decision Tree or Random Forest, the algorithm searches for the optimal split point (the threshold ’s’) that maximizes Information Gain or minimizes MSE.

👉This search is:

computationally expensive (sort + mid-point) and
tends to follow the noise in the training 🏃‍♂️data.

Adding Randomness

Adding randomness (right kind) in ensemble averaging reduces correlation/variance.

\[Var(f_{bag})=ρ\sigma^{2}+\frac{1-ρ}{B}\sigma^{2}\]

Extremely Randomized (ExtRa) Trees

Random Thresholds: Instead of searching for the best split point (computationally expensive 💰) for a feature, it picks a threshold at random from a uniform distribution between the feature’s local minimum and maximum.
Entire Dataset: Uses entire training dataset (default) for every tree; no bootstrapping.
Random Feature Subsets: Random subset of m<n features is used in each decision tree.

Mathematical Intuition

\[Var(f_{et})=ρ\sigma^{2}+\frac{1-ρ}{B}\sigma^{2}\]

Picking thresholds randomly has two effects:

Structural correlation between trees becomes extremely low.
Individual trees are ‘weaker’ and have higher bias than a standard optimized tree.

👉The massive drop in ‘$\rho$’ often outweighs the slight increase in bias, leading to an overall ensemble that is smoother and more robust to noise than a standard Random Forest.

Note: Extra Trees are almost always grown to full depth, as they may need extra splits to find the same decision boundary.

When to use Extra Trees ?

Performance: Significantly faster to train, as it does not sort data to find optimal split.
Note: If we are working with billions of rows or thousands of features, ET can be 3x to 5x faster than a Random Forest(RF).
Robustness to Noise: By picking thresholds randomly, tends to ‘handle’ the noise more effectively than RF.
Feature Importance: Because ET is so randomized, it often provides more ‘stable’ feature importance scores.

Note: It is less likely to favor a high-cardinality feature (e.g. zip-code) just because it has more potential split points.

Previous: Random Forest Next: Boosting

End of Section

2.4.8 - Boosting

Boosting

Boosting Ensemble Technique | Why is Boosting Better? | Explained with Examples

Intuition 💡

⭐️In Bagging 🎒we trained multiple strong(over-fit, high variance) models (in parallel) and then averaged them out to reduce variance.

💡Similarly, we can train many weak(under-fit, high bias) models sequentially, such that, each new model corrects the errors of the previous ones to reduce bias.

Boosting

⚔️ An ensemble learning approach where multiple ‘weak learners’ (typically simple models like shallow decision trees 🌲 or ‘stumps’) are sequentially combined to create a single strong predictive model.

⭐️The core principle is that each subsequent model focuses 🎧 on correcting the errors made by its predecessors.

Why is Boosting Better ?

👉Boosting generally achieves better predictive performance because it actively reduces bias by learning 📖from ‘past mistakes’, making it ideal for achieving state-of-the-art 🖼️ results.

Popular Boosting Algorithms

AdaBoost(Adaptive Boosting)
Gradient Boosting Machine (GBM)
- XGBoost
- LightGBM (Microsoft)
- CatBoost (Yandex)

Previous: Extra Trees Next: AdaBoost

End of Section

2.4.9 - AdaBoost

AdaBoost

Adaptive Boosting (AdaBoost) Algorithm | Explained with Example

Adaptive Boosting (AdaBoost)

💡Works by increasing 📈 the weight 🏋️‍♀️ of misclassified data points after each iteration, forcing the next weak learner to ‘pay more attention’🚨 to the difficult cases.

⭐️ Commonly used for classification.

Decision Stumps

👉Weak learners are typically ‘Decision Stumps’, i.e, decision trees🌲with a depth of only one (1 split, 2 leaves 🍃).

images/machine_learning/supervised/decision_trees/adaboost/slide_04_01.png

Algorithm

Assign an equal weight 🏋️‍♀️to every data point; $w_i = 1/n$, where ’n’=number of samples.
Build a decision stump that minimizes the weighted classification error.
Calculate total error; $E_m = \Sigma w_i$.
Determine ‘amount of say’, i.e, the weight 🏋️‍♀️ of each stump in final decision. \[\alpha_m = \frac{1}{2}ln\left( \frac{1-E_m}{E_m} \right)\]
- Low error results in a high positive $\alpha$ (high influence).
- 50% error (random guessing) results in an $\alpha = 0$ (no influence).
Update sample weights 🏋️‍♀️.
- Misclassified samples: Weight 🏋️‍♀️ increases by $e^{\alpha_m}$.
- Correctly classified samples: Weight 🏋️‍♀️ decreases by $e^{-\alpha_m}$.
- Normalization: All new weights 🏋️‍♀️ are divided by their total sum so they add up back to 1.
Iterate for a specified number of estimators (n_estimators).

Final Prediction 🎯

👉 To classify a new data point, every stump makes a prediction (+1 or -1).

These are multiplied by their respective ‘amount of say’ $\alpha_m$ and summed.

\[H(x)=sign\sum_{m=1}^{M}\alpha_{m}⋅h_{m}(x)\]

👉 If the total weighted 🏋️‍♀️ sum is positive, the final class is +1; otherwise -1.

Note: Sensitive to outliers; Because AdaBoost aggressively increases weights 🏋️‍♀️ on misclassified points, it may ‘over-focus’ on noisy outliers, hurting performance.

Previous: Boosting Next: Gradient Boosting Machine

End of Section

2.4.10 - Gradient Boosting Machine

Gradient Boosting Machine Introduction

Gradient Boosting Machine (GBM) | Pseudo Residuals | Gradient Descent in Function Space | Explained

Idea 💡

👉GBM fits new models to the ‘residual errors’ (the difference between actual and predicted values) of the previous models.

Gradient Boosting Machine

GBM treats the final model $F_m(x)$ as weighted 🏋️‍♀️ sum of ‘m’ weak learners:

\[ F_{M}(x)=\underbrace{F_{0}(x)}_{\text{Initial\ Guess}}+\nu \sum _{m=1}^{M}\underbrace{\left(\sum _{j=1}^{J_{m}}\gamma _{jm}\mathbb{I}(x\in R_{jm})\right)}_{\text{Weak\ Learner\ }h_{m}(x)}\]

$F_0(x)$: The initial base model (usually a constant).
M: The total number of boosting iterations (number of trees).
$\gamma_{jm}$(Leaf Weight): The optimized value for leaf in tree .
$\nu$(Nu): The Learning Rate or Shrinkage; prevent overfitting.
$\mathbb{I}(x\in R_{jm})$: ‘Indicator Function’; It is 1 if data point falls into leaf of the tree, and 0 otherwise.
$R_{jm}$(Regions): Region of $j_{th}$ leaf in $m_{th}$tree.

Gradient Descent in Function Space

📍In Gradient Descent, we update parameters ‘$\Theta$';

📍In GBM, we update the predictions F(x) themselves.

🦕We move the predictions in the direction of the negative gradient of the loss function L(y, F(x)).

🎯We want to minimize loss:

\[\mathcal{L}(F) = \sum_{i=1}^n L(y_i, F(x_i))\]

✅ In parameter optimization we update weights 🏋️‍♀️:

\[w_{t+1} = w_t - \eta \cdot \nabla_{w}\mathcal{L}(w_t)\]

✅ In gradient boosting, we update the prediction function:

\[F_m(x) = F_{m-1}(x) -\eta \cdot \nabla_F \mathcal{L}(F_{m-1}(x))\]

➡️ The gradient is calculated w.r.t. predictions, not weights.

Pseudo Residuals

In GBM we can use any loss function as long as it is differentiable, such as, MSE, log loss, etc.

Loss(MSE) = $(y_i - F_m(x_i))^2$

\[\frac{\partial L}{F_m(x_i)} = -2 (y-F_m(x_i))\]

\[\implies \frac{\partial L}{F_m(x_i)} \propto - (y-F_m(x_i))\]

👉Pseudo Residual (Error) = - Gradient

Why initial model is Mean model for MSE ?

💡To minimize loss, take derivative of loss function w.r.t ‘$\gamma$’ and equate to 0:

\[F_0(x) = \arg\min_{\gamma} \sum_{i=1}^n L(y_i, \gamma)\]

MSE Loss = $\mathcal{L}(y_i, \gamma) = \sum_{i=1}^n(y_i -\gamma)^2$

\[ \begin{aligned} &\frac{\partial \mathcal{L}(y_i, \gamma)}{\partial \gamma} = -2 \cdot \sum_{i=1}^n(y_i -\gamma) = 0 \\ &\implies \sum_{i=1}^n (y_i -\gamma) = 0 \\ &\implies \sum_{i=1}^n y_i = n.\gamma \\ &\therefore \gamma = \frac{1}{n} \sum_{i=1}^n y_i \end{aligned} \]

Why optimal leaf 🍃value is the ‘Mean' of the residuals for MSE ?

💡To minimize cost, take derivative of cost function w.r.t ‘$\gamma$’ and equate to 0:

Cost Function = $J(\gamma )$

\[J(\gamma )=\sum _{x_{i}\in R_{jm}}\frac{1}{2}(y_{i}-(F_{m-1}(x_{i})+\gamma ))^{2}\]

We know that:

\[ r_{i}=y_{i}-F_{m-1}(x_{i})\]

\[\implies J(\gamma )=\sum _{x_{i}\in R_{jm}}\frac{1}{2}(r_{i}-\gamma )^{2}\]

\[\frac{d}{d\gamma }\sum _{x_{i}\in R_{jm}}\frac{1}{2}(r_{i}-\gamma )^{2}=\sum _{x_{i}\in R_{jm}}-(r_{i}-\gamma )=0\]

\[\implies \sum _{x_{i}\in R_{jm}}\gamma -\sum _{x_{i}\in R_{jm}}r_{i}=0\]

👉Since, $\gamma$ is constant for all $n_j$ samples in the leaf, $\sum _{x_{i}\in R_{jm}}\gamma =n_{j}\gamma $

\[n_{j}\gamma =\sum _{x_{i}\in R_{jm}}r_{i}\]

\[\implies \gamma =\frac{\sum _{x_{i}\in R_{jm}}r_{i}}{n_{j}}\]

Therefore, $\gamma$ = average of all residuals in the leaf.

Note: $R_{jm}$(Regions): Region of $j_{th}$ leaf in $m_{th}$tree.

Previous: AdaBoost Next: GBDT Algorithm

End of Section

2.4.11 - GBDT Algorithm

GBDT Algorithm

Gradient Boosted Decision Trees (GBDT) Algorithm | Step-by-Step Explanation

Gradient Boosted Decision Tree (GBDT)

Gradient Boosted Decision Tree (GBDT) is a decision tree based implementation of Gradient Boosting Machine (GBM).

GBM treats the final model $F_m(x)$ as weighted 🏋️‍♀️ sum of ‘m’ weak learners (decision trees):

\[ F_{M}(x)=\underbrace{F_{0}(x)}_{\text{Initial\ Guess}}+\nu \sum _{m=1}^{M}\underbrace{\left(\sum _{j=1}^{J_{m}}\gamma _{jm}\mathbb{I}(x\in R_{jm})\right)}_{\text{Decision\ Tree\ }h_{m}(x)}\]

$F_0(x)$: The initial base model (usually a constant).
M: The total number of boosting iterations (number of trees).
$\gamma_{jm}$(Leaf Weight): The optimized value for leaf in tree .
$\nu$(Nu): The Learning Rate or Shrinkage; prevent overfitting.
$\mathbb{I}(x\in R_{jm})$: ‘Indicator Function’; It is 1 if data point falls into leaf of the tree, and 0 otherwise.
$R_{jm}$(Regions): Region of $j_{th}$ leaf in $m_{th}$tree.

Algorithm

Step 1: Initialization.
Step 2: Iterative loop 🔁 : for i=1 to m.
2.1: Calculate pseudo residuals ‘$r_{im}$'.
2.2: Fit a regression tree 🌲‘$h_m(x)$'.
2.3:Compute leaf 🍃weights 🏋️‍♀️ ‘$\gamma_{jm}$'.
2.4:Update the model.

Step 1: Initialization

Start with a function that minimizes our loss function;
for MSE its mean.

\[F_0(x) = \arg\min_{\gamma} \sum_{i=1}^n L(y_i, \gamma)\]

MSE Loss = $\mathcal{L}(y_i, \gamma) = \sum_{i=1}^n(y_i -\gamma)^2$

Step 2.1: Calculate pseudo residuals ‘$r_{im}$'

Find the ‘gradient’ of error;
Tells us the direction and magnitude needed to reduce the loss.

\[r_{im}=-\frac{∂L(y_{i},F(x_{i}))}{∂F(x_{i})}_{F(x)=F_{m-1}(x)}\]

Step 2.2: Fit regression tree ‘$h_m(x)$'

Train a tree to predict the residuals ‘$h_m(x)$';

Tree 🌲 partitions the data into leaves 🍃 ($R_{jm}$regions )

Step 2.3: Compute leaf weights ‘$\gamma_{jm}$'

Within each leaf 🍃, we calculate the optimal value ‘$\gamma_{jm}$’ that minimizes the loss for the samples in that leaf 🍃.

\[\gamma_{jm} = \arg\min_{\gamma} \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma)\]

➡️ The optimal leaf 🍃value is the ‘Mean’(MSE) of the residuals; $\gamma = \frac{\sum r_i}{n_j}$

Step 2.4: Update the model.

Add the new ‘correction’ to the existing model, scaled by the learning rate.

\[F_{m}(x)=F_{m-1}(x)+\nu \cdot \underbrace{\sum _{j=1}^{J_{m}}\gamma _{jm}\mathbb{I}(x\in R_{jm})}_{h_{m}(x)}\]

$\mathbb{I}(x\in R_{jm})$: ‘Indicator Function’; It is 1 if data point falls into leaf of the tree, and 0 otherwise.

Previous: Gradient Boosting Machine Next: GBDT Example

End of Section

2.4.12 - GBDT Example

GBDT Example

Gradient Boosted Decision Tree (GBDT) Algorithm | Explained with Example

Gradient Boosted Decision Tree (GBDT)

Gradient Boosted Decision Tree (GBDT) is a decision tree based implementation of Gradient Boosting Machine (GBM).

GBM treats the final model $F_m(x)$ as weighted 🏋️‍♀️ sum of ‘m’ weak learners (decision trees):

\[ F_{M}(x)=\underbrace{F_{0}(x)}_{\text{Initial\ Guess}}+\nu \sum _{m=1}^{M}\underbrace{\left(\sum _{j=1}^{J_{m}}\gamma _{jm}\mathbb{I}(x\in R_{jm})\right)}_{\text{Decision\ Tree\ }h_{m}(x)}\]

$F_0(x)$: The initial base model (usually a constant).
M: The total number of boosting iterations (number of trees).
$\gamma_{jm}$(Leaf Weight): The optimized value for leaf in tree .
$\nu$(Nu): The Learning Rate or Shrinkage; prevent overfitting.
$\mathbb{I}(x\in R_{jm})$: ‘Indicator Function’; It is 1 if data point falls into leaf of the tree, and 0 otherwise.
$R_{jm}$(Regions): Region of $j_{th}$ leaf in $m_{th}$tree.

Algorithm

Step 1: Initialization.
Step 2: Iterative loop 🔁 : for i=1 to m.
2.1: Calculate pseudo residuals ‘$r_{im}$'.
2.2: Fit a regression tree 🌲‘$h_m(x)$'.
2.3:Compute leaf 🍃weights 🏋️‍♀️ ‘$\gamma_{jm}$'.
2.4:Update the model.

Predict House Prices

images/machine_learning/supervised/decision_trees/gbdt_example/house_price_table.png

👉Loss = MSE, Learning rate ($\nu$) = 0.5

Solution

Initialization : $F_0(x) = mean(2,4,9) = 5.0$
Iteration 1(m=1):
- 2.1: Calculate residuals ‘$r_{i1}$'
  \[\begin{aligned} r_{11} &= 2-5 = -3.0 \\ r_{21} &= 4-5 = -1.0 \\ r_{31} &= 9-5 = 4.0 \\ \end{aligned} \]
- 2.2: Fit tree($h_1$); Split at X<2150 (midpoint of 1800 and 2500)
- 2.3: Compute leaf weights $\gamma_{j1}$
  - Y-> Leaf 1: Ids 1, 2 ( $\gamma_{11}$= -2.0)
  - N-> Leaf 2: Id 3 ( $\gamma_{21}$= 4.0)
- 2.4: Update predictions ($F_1 = F_0 + 0.5 \cdot \gamma$)
  \[ \begin{aligned} F_1(x_1) &= 5.0 + 0.5(-2.0) = \mathbf{4.0}\ \\F_1(x_2) &= 5.0 + 0.5(-2.0) = \mathbf{4.0}\ \\F_1(x_3) &= 5.0 + 0.5(4.0) = \mathbf{7.0}\ \\ \end{aligned} \]
  Tree 1:

Iteration 2(m=2):
- 2.1: Calculate residuals ‘$r_{i2}$'
  \[ \begin{aligned} r_{12} &= 2-4.0 = -2.0 \\ r_{22} &= 4-4.0 = 0.0 \\ r_{32} &= 9-7.0 = 2.0 \\ \end{aligned} \]
- 2.2: Fit tree($h_2$); Split at X<1500 (midpoint of 1200 and 1800)
- 2.3: Compute leaf weights $\gamma_{j2}$
  - Y-> Leaf 1: Ids 1 ( $\gamma_{12}$= -2.0)
  - N-> Leaf 2: Id 2, 3 ( $\gamma_{22}$= 1.0)
- 2.4: Update predictions ($F_1 = F_0 + 0.5 \cdot \gamma$)
  \[ \begin{aligned} F_2(x_1) &= 4.0 + 0.5(-2.0) = \mathbf{3.0} \\F_2(x_2) &= 4.0 + 0.5(1.0) = \mathbf{4.5} \\ F_2(x_3) &= 7.0 + 0.5(1.0) = \mathbf{7.5}\ \\ \end{aligned} \]
  Tree 2:

Note: We can keep adding more trees with every iteration;
ideally, learning rate $\nu$ is small, say 0.1, so that we do not overshoot and converge slowly.

Inference

\[ F_{final}(x) = F_0 + \nu \cdot h_1(x) + \nu \cdot h_2(x) \]

Let’s predict the price of a house with area = 2000 sq. ft.

$F_{0}=5.0$
Pass though tree 1 ($h_1$): is 2000 < 2150 ? Yes, $\gamma_{11}$= -2.0
- Contribution ($h_1$) = 0.5 x (-2.0) = -1.0
Pass though tree 2 ($h_2$): is 2000 < 1500 ? No, $\gamma_{22}$ = 1.0
- Contribution($h_2$) = 0.5 x (1.0) = 0.5
Final prediction = 5.0 - 1.0 + 0.5 = 4.5

Therefore, the price of a house with area = 2000 sq. ft is Rs 4.5 crores, which is very close.
In just 2 iterations, although with higher learning rate ($\nu=0.5$), we were able to get a fairly good estimate.

Previous: GBDT Algorithm Next: Advanced GBDT Algorithms

End of Section

2.4.13 - Advanced GBDT Algorithms

Advanced GBDT Algorithms

Advanced GBDT Algorithms | XGBoost | LightGBM | CatBoost | Brief Overview

Advanced GBDT Algorithms

🔴 XGBoost (Extreme Gradient Boosting)
🔵 LightGBM (Light Gradient Boosting Machine)
⚫️ CatBoost (Categorical Boosting)

XGBoost (Extreme Gradient Boosting)

⭐️An optimized and highly efficient implementation of gradient boosting.

👉 Widely used in competitive data science (like Kaggle) due to its speed and performance.

Note: Research project developed by Tianqi Chen during his doctoral studies at the University of Washington.

LightGBM (Light Gradient Boosting Machine)

⭐️Developed by Microsoft, this framework is designed for high speed and efficiency with large datasets.

👉It grows trees leaf-wise rather than level-wise and uses Gradient-based One-Side Sampling (GOSS) to speed 🐇 up the finding of optimal split points.

CatBoost (Categorical Boosting)

⭐️Developed by Yandex, this algorithm is specifically optimized for handling ‘categorical’ features without requiring extensive preprocessing (such as, one-hot encoding).

Previous: GBDT Example Next: XgBoost

End of Section

2.4.14 - XgBoost

XgBoost

XGBoost Optimizations | Second Order Derivative| Sparsity Aware Splitting| Regularization| Explained

XGBoost (Extreme Gradient Boosting)

⭐️An optimized and highly efficient implementation of gradient boosting.

👉 Widely used in competitive data science (like Kaggle) due to its speed and performance.

Note: Research project developed by Tianqi Chen during his doctoral studies at the University of Washington.

Algorithmic Optimizations

🔵 Second order Derivative
🔵 Regularization
🔵 Sparsity-Aware Split Finding

Second order Derivative

⭐️Uses second derivative (Hessian), i.e, curvature, in addition to first derivative (gradient) to optimize the objective function more quickly and accurately than GBDT.

Let’s understand this with the problem to minimize $f(x) = x^4$, using:

Gradient descent (uses only 1st order derivative, $f'(x) = 4x^3$)
Newton’s method (uses both 1st and 2nd order derivatives $f''(x) = 12x^2$)

Regularization

Adds explicit regularization terms (L1/L2 on leaf weights and tree complexity) into the boosting objective, helping reduce over-fitting. \[ \text{Objective} = \underbrace{\sum_{i=1}^{n} L(y_i, \hat{y}_i)}_{\text{Training Loss}} + \underbrace{\gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2 + \alpha \sum_{j=1}^{T} |w_j|}_{\text{Regularization (The Tax)}} \]

Sparsity-Aware Split Finding

💡Real-world data often contains many missing values or zero-entries (sparse data).

👉 XGBoost introduces a ‘default direction’ for each node.

➡️During training, it learns the best direction (left or right) for missing values to go, making it significantly faster and more robust when dealing with sparse or missing data.

Previous: Advanced GBDT Algorithms Next: LightGBM

End of Section

2.4.15 - LightGBM

LightGBM

LightGBM Optimizations | Gradient Based One Side Sampling(GOSS) | Exclusive Feature Bundling( EFB)

LightGBM (Light Gradient Boosting Machine)

⭐️Developed by Microsoft, this framework is designed for high speed and efficiency with large datasets.

👉It grows trees leaf-wise rather than level-wise and uses Gradient-based One-Side Sampling (GOSS) to speed 🐇 up the finding of optimal split points.

Algorithmic Optimizations

🔵 Gradient-based One Side Sampling (GOSS)
🔵 Exclusive Feature Bundling (EFB)
🔵 Leaf-wise Tree Growth Strategy

Gradient-based One Side Sampling (GOSS)

❌ Traditional GBDT uses all data instances for gradient calculation, which is inefficient.
✅ GOSS focuses 🔬on instances with larger gradients (those that are less well-learned or have higher error).
🐛 Keeps all instances with large gradients but randomly samples from those with small gradients.
🦩This way, it prioritizes the most informative examples for training, significantly reducing the data used and speeding up 🐇 the process while maintaining accuracy.

Exclusive Feature Bundling (EFB)

🦀 High-dimensional data often contains many sparse, mutually exclusive features (features that never take a non-zero value simultaneously, such as, One Hot Encoding (OHE)).
💡 EFB bundles the non-overlapping features into a single, dense feature, reducing the number of features, without losing much information, saving computation.

Leaf-wise Tree Growth Strategy

❌ Traditional gradient boosting machines (like XGBoost), the trees are built level-wise (BFS-like), meaning all nodes at the current level are split before moving to the next level.
✅ LightGBM maintains a set of all potential leaves that can be split at any given time and selects the leaf (for splitting) that provides the maximum gain across the entire tree, regardless of its depth.

Note: Need mechanisms to avoid over-fitting.

Previous: XgBoost Next: CatBoost

End of Section

2.4.16 - CatBoost

CatBoost

CatBoost Optimizations | Ordered Target Encoding | Oblivious(Symmetric) Trees| Handle Missing Values

CatBoost (Categorical Boosting)

⭐️Developed by Yandex, this algorithm is specifically optimized for handling ‘categorical’ features without requiring extensive preprocessing (such as, one-hot encoding).

Algorithmic Optimizations

🔵 Ordered Target Encoding
🔵 Symmetric(Oblivious) Trees
🔵 Handling Missing Values

Ordered Target Encoding

❌ Standard target encoding can lead to target leakage, where the model uses information from the target variable during training that would not be available during inference.
👉(model ‘cheats’ by using a row’s own label to predict itself).
✅ CatBoost calculates the target statistics (average target value) for each category based only on the history of previous training examples in a random permutation of the data.

Symmetric (Oblivious) Trees

🦋 Uses symmetric decision trees by default.
👉 In symmetric trees, the same split condition is applied at each level across the entire tree structure.
🦘Does not walk down the tree using ‘if-else’ logic, instead it evaluates decision conditions to create a binary index (e.g 101) and jumps directly to that leaf 🍃 in memory 🧠.

Handling Missing Values

⚙️ CatBoost offers built-in, intelligent handling of missing values and sparse features, which often require manual preprocessing in other GBDT libraries.
💡Treats ‘NaN’ as a distinct category, reducing the need for imputation.

Previous: LightGBM Next: SVM Intro

End of Section

2.5 - Support Vector Machine

Support Vector Machine

End of Section

2.5.1 - SVM Intro

SVM Intro

Support Vector Machine (SVM) | Geometric Intuition | Decision Boundary & Margins | Explained

Geometric Intuition💡

⭐️We have two classes of points (e.g. Cats 😸vs. Dogs 🐶) that can be separated by a straight line.

👉 Many such lines exist !

💡SVM asks: “Which line is the safest?”

images/machine_learning/supervised/support_vector_machines/svm_intro/slide_01_01.tif

Highway 🛣️ Analogy

💡Think of the decision boundary as the center-line of a highway 🛣️.

SVM tries to make this highway 🛣️ as wide as possible without hitting any ‘buildings’ 🏡 (data points) on either side.

images/machine_learning/supervised/support_vector_machines/svm_intro/slide_05_01.tif

Support Vectors

The points that lie exactly on the edges of the highway are the Support Vectors.

Goal

🎯Maximize the width of the ‘street’ (the margin) to ensure the model generalizes well to unseen data.

Previous: CatBoost Next: Hard Margin SVM

End of Section

2.5.2 - Hard Margin SVM

Hard Margin SVM

Assumptions of Hard Margin SVM

Data is perfectly linearly separable, i.e, there must exist a hyperplane that can perfectly separate the data into two distinct classes without any misclassification.
No noise or outliers that fall within the margin or on the wrong side of the decision boundary. Note: Even a single outlier can prevent the algorithm from finding a valid solution or drastically affect the boundary’s position, leading to poor generalization.

Distance Between Margins

\[ \begin{aligned} \text{Decision Boundary: } \pi &= \mathbf{w^Tx} + w_0 = 0\\ \text{Upper Margin: }\pi^+ &= \mathbf{w^Tx} + w_0 = +1\\ \text{Lower Margin: }\pi^- &= \mathbf{w^Tx} + w_0 = -1\\ \end{aligned} \]

🐎 Distance(signed) of a hyperplane from origin = $\frac{-w_0}{\|w\|}$
🦣 Margin width = distance($\pi^+, \pi^-$)
= $\frac{1-w_0 - (-1 -w_0)}{\|w\|}$ = $\frac{1-\cancel{w_0} + 1 + \cancel{w_0})}{\|w\|}$
distance($\pi^+, \pi^-$) = $\frac{2}{\|w\|}$

Figure: Distance of Hyperplane from Origin

images/machine_learning/supervised/support_vector_machines/hard_margin_svm/slide_04_01.png

2.5.3 - Soft Margin SVM

Soft Margin SVM

Soft Margin Support Vector Machine (SVM) | Hinge Loss View | Explained with Example

Intuition💡

💡Imagine the margin is a fence 🌉.

Hard Margin: fence is made of steel.
Nothing can cross it.
Soft Margin: fence is made of rubber(porous).
Some points can ‘push’ into the margin or even cross over to the wrong side, but we charge them a penalty 💵 for doing so.

Issue

Distance from decision boundary:

Distance of positive labelled points must be $\ge 1$
But, distance of noise 📢 points (actually positive points) $x_1, x_2 ~\&~ x_3$ < 1

Solution

⚔️ So, we introduce a slack variable or allowance for error term, $\xi_i$ (pronounced ‘xi’) for every single data point.

\[y_i.(w^Tx_i + w_0) \ge 1 - \xi_i, ~ \forall i = 1,2,\dots, n\]

\[ \implies \xi_i \ge 1 - y_i.(w^Tx_i + w_0) \\ also, ~ \xi_i \ge 0 \]\[So, ~ \xi _{i}=\max (0,1-y_{i}\cdot (w^Tx_i + w_0))\]

Note: The above error term is also called ‘Hinge Loss’.

images/machine_learning/supervised/support_vector_machines/soft_margin_svm/slide_06_01.png

Hinge

images/machine_learning/supervised/support_vector_machines/soft_margin_svm/slide_06_02.png

Slack/Error Term Interpretation

\[\xi _{i}=\max (0,1-y_{i}\cdot f(x_{i}))\]

$\xi_i = 0$ : Correctly classified and outside (or on) the margin.
$0 < \xi_i \le 1 $ : Within the margin but on the correct side of the decision boundary.
$\xi_i > 0$: On the wrong side of the decision boundary (misclassified).

e.g.: Since, the noise 📢 point are +ve ($y_i=1$) labeled:

\[\xi _{i}=\max (0,1-f(x_{i}))\]

$x_1, (d=+0.5)$: $\xi _{i}=\max (0,1-0.5) = 0.5$
$x_2, (d=-0.5)$: $\xi _{i}=\max (0,1-(-0.5))= 1.5$
$x_3, (d=-1.5)$: $\xi _{i}=\max (0,1-(-1.5)) = 2.5$

Goal 🎯

\[\text{Maximize the width of margin: } \min_{w, w_0} \frac{1}{2} {\|w\|^2}\]

\[\text{Minimize violation or sum of slack/error terms: } \sum \xi_i\]

Optimization (Primal Formulation)

\[\min_{w, w_0, \xi} \underbrace{\frac{1}{2} \|w\|^2}_{\text{Regularization}} + \underbrace{C \sum_{i=1}^n \xi_i}_{\text{Error Penalty}}\]

Subject to constraints:

$y_i(w^T x_i + b) \geq 1 - \xi_i$: The ‘softened’ margin constraint.
$\xi_i \geq 0$: Slack/Error cannot be negative.

Note: We use a hyper-parameter ‘C’ to control the trade-off.

Hyper-Parameter ‘C'

Large ‘C’: Over-Fitting;
Misclassifications are expensive 💰.
Model tries to keep the errors as low as possible.
Small ‘C’: Under-Fitting;
Margin width is more important than individual errors.
Model will ignore outliers/noise to get a ‘cleaner’(wider) boundary.

Hinge Loss View

\[ \text{Hinge loss: } \xi _{i}=\max (0,1-y_{i}\cdot (w^Tx_i + w_0))\]\[ \min_{w, b} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^n \text{HingeLoss}(y_i, f(x_i))\]

Note: SVM is just L2-Regularized Hinge Loss minimization, as Logistic Regression minimizes Log-Loss.

images/machine_learning/supervised/support_vector_machines/soft_margin_svm/slide_14_01.tif

Previous: Hard Margin SVM Next: Primal Dual Equivalence

End of Section

2.5.4 - Primal Dual Equivalence

Primal Dual Equivalence

Primal Dual Equivalence of Support Vector Machine (SVM) | Lagrangian | KKT Conditions | Explained

Intuition💡

The Primal form is intuitive but computationally expensive in high dimensions.

➡️ Number of new features = ${d+p \choose p}$, grows roughly as $O(d^{p})$

d= number of dimensions (features)
p = degree of polynomial

Note: The Dual form is what enables the Kernel Trick 🪄.

Optimization (Primal Formulation)

\[\min_{w, w_0, \xi} \underbrace{\frac{1}{2} \|w\|^2}_{\text{Regularization}} + \underbrace{C \sum_{i=1}^n \xi_i}_{\text{Error Penalty}}\]

Subject to constraints:

$y_i(w^T x_i + b) \geq 1 - \xi_i$: The ‘softened’ margin constraint.
$\xi_i \geq 0$: Slack/Error cannot be negative.

Lagrangian

⭐️ We start with the Soft-Margin Primal objective and incorporate its constraints using Lagrange Multipliers $(\alpha_i, \mu_i \geq 0)$

\[L(w, w_0, \xi, \alpha, \mu) = \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[y_i(w^T x_i + w_0) - 1 + \xi_i \right] - \sum_{i=1}^n \mu_i \xi_i\]

Note: Inequality conditions must be $\le 0$.

Lagrangian Objective

👉The Lagrangian function has two competing objectives:

\[L(w, w_0, \xi, \alpha, \mu) = \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[y_i(w^T x_i + w_0) - 1 + \xi_i \right] - \sum_{i=1}^n \mu_i \xi_i\]

Minimization: We want to minimize $L(w, w_0, \xi, \alpha, \mu)$ w.r.t primal variables ($w, w_0, \xi_i $ ) to find the hyperplane with the largest possible margin.
Maximization: We want to maximize $L(w, w_0, \xi, \alpha, \mu)$ w.r.t dual variables ($\alpha_i, \mu_i$ ) to ensure all training constraints are satisfied.

Note: A point that is simultaneously a minimum for one set of variables and a maximum for another is, by definition, a saddle point.

Karush–Kuhn–Tucker (KKT) Conditions

👉To find the Dual, we find the saddle point by taking partial derivatives with respect to the primal variables $(w, w_0, \xi)$ and equating them to 0.

\[\frac{\partial L}{\partial w} = 0 \implies \mathbf{w = \sum_{i=1}^n \alpha_i y_i x_i}\]

\[\frac{\partial L}{\partial w_0} = 0 \implies \mathbf{\sum_{i=1}^n \alpha_i y_i = 0}\]

\[\frac{\partial L}{\partial \xi_i} = 0 \implies C - \alpha_i - \mu_i = 0 \implies \mathbf{0 \leq \alpha_i \leq C}\]

Primal Expansion

\[\frac{1}{2}\mathbf{w}^T\mathbf{w} + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[y_i(\mathbf{w}^T x_i + w_0) - 1 + \xi_i\right] - \sum_{i=1}^n \mu_i \xi_i\]

\[ \begin{aligned} = \frac{1}{2} \left(\sum_i \alpha_i y_i x_i \right)^T . \left(\sum_j \alpha_j y_j x_j \right) + C \sum_{i=1}^n \xi_i + \sum_{i=1}^n -\alpha_i y_i \left( \sum_{j=1}^n \alpha_j y_j x_j \right)^T x_i \\ -w_0 \sum_{i=1}^n \alpha_i y_i + \sum_{i=1}^n \alpha_i(1-\xi_i) + \sum_{i=1}^n \mu_i. (-\xi_i) \\ \end{aligned} \]\[ = \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j (x_i \cdot x_j) \underbrace{-w_0\sum_{i=1}^n \alpha_i y_i}_{=0, K.K.T} + \underbrace{\sum_{i=1}^n \xi_i (C -\alpha_i -\mu_i)}_{=0, K.K.T} \]\[ = \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j (x_i \cdot x_j) \]

(Wolfe) ‘Dual' Optimization

\[ \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[y_i(w^T x_i + w_0) - 1 + \xi_i\right] - \sum_{i=1}^n \mu_i \xi_i\]

\[ = \max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{(x_i \cdot x_j)}\]

subject to: $0 \leq \alpha_i \leq C$ and $\sum \alpha_i y_i = 0$

$\alpha_i$= 0, for non support vectors (correct side)
$0 < \alpha_i < C$, for free support vectors (exactly on the margin)
$\alpha_i = C$, for bounded support vectors (misclassified or inside margin)

Note: Sequential Minimal Optimization (SMO) algorithm is used to find optimal $\alpha_i$ values.

Inference Time ⏰

🎯To classify unseen point $x_q$ : $f(x_q) = \text{sign}(w^T x_q + w_0)$

✅ From the KKT stationarity condition, we know: $\mathbf{w} = \sum_{i=1}^n \alpha_i y_i x_i$

👉 Substituting this into the equation:

\[f(x_q) = \text{sign}\left( \sum_{i=1}^n \alpha_i y_i (x_i^T x_q) + w_0 \right)\]

Note: Even if you have 1 million training points, if only 500 are support vectors, the summation only runs for 500 terms.
All other points have $\alpha_i = 0$ and vanish.

Previous: Soft Margin SVM Next: Kernel Trick

End of Section

2.5.5 - Kernel Trick

Kernel Trick

Kernel Trick in Support Vector Machines | Quadratic Kernel Function Example

Intuition 💡

👉If our data is not linearly separable in its original space , we can map it to a higher-dimensional feature space (where D»d) using a transformation function .

Kernel Trick 🪄

Bridge between Dual formulation and the geometry of high dimensional spaces.
It is a way to manipulate inner product spaces without the computational cost 💰 of explicit transformation.

(Wolfe) ‘Dual' Optimization

\[ \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[y_i(w^T x_i + w_0) - 1 + \xi_i\right] - \sum_{i=1}^n \mu_i \xi_i\]

\[= \max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{(x_i \cdot x_j)}\]

subject to: $0 \leq \alpha_i \leq C$ and $\sum \alpha_i y_i = 0$

Observation

💡Actual values of the input vectors $x_i$ and $x_j$ never appear in isolation; only appear as inner product.

\[ \max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{(x_i \cdot x_j)}\]

\[f(x_q) = \text{sign}\left( \sum_{i=1}^n \alpha_i y_i (x_i^T x_q) + w_0 \right)\]

👉The ‘shape’ of the decision boundary is entirely determined by how similar points are to one another, not by their absolute coordinates.

Non-Linear Separation

If our data is not linearly separable in its original space $\mathbb{R}^d$, we can map it to a higher-dimensional $\mathbb{R}^D$ feature space (where D»d) using a transformation function $\phi(x)$ .

Problem 🤔

If we choose a very high-dimensional mapping (e.g. $D = 10^6$ or $D = \infty$ ), calculating and then performing the dot product $\phi(x_i)^T \phi(x_j)$ becomes computationally impossible or extremely slow.

Kernel Trick 👻

So we define a function called ‘Kernel Function’.

The ‘Kernel Trick’ 🪄 is an optimization that replaces the dot product of a high-dimensional mapping with a function of the dot product in the original space.

\[(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle\]

💡How it works ?

Instead of mapping (to higher dimension) $x_i \rightarrow \phi(x_i)$, $x_j \rightarrow \phi(x_j)$, and calculating the dot product.
We simply compute $K(x_i, x_j)$ directly in the original input space.

👉The ‘Kernel Function’ gives the similar mathematical equivalent of mapping it to a higher dimensions and taking the dot product.

Note: For $K(x_i, x_j)$ to be a valid kernel, it must satisfy Mercer’s Condition.

Polynomial (Quadratic) Kernel

Below is an example for a quadratic Kernel function in 2D that is equivalent to mapping the vectors to 3D and taking a dot product in 3D.

\[K(x, z) = (x^T z)^2\]

\[(x_1 z_1 + x_2 z_2)^2 = x_1^2 z_1^2 + 2x_1 z_1 x_2 z_2 + x_2^2 z_2^2\]

The output of above quadratic kernel function is equivalent to the explicit dot product of two vectors in 3D:

\[\phi(x) = [x_1^2, \sqrt{2}x_1x_2, x_2^2]^T\]

\[\phi(z) = [z_1^2, \sqrt{2}z_1z_2, z_2^2]^T\]\[\phi (x)\cdot \phi (z)=x_{1}^{2}z_{1}^{2}+2x_{1}x_{2}z_{1}z_{2}+x_{2}^{2}z_{2}^{2}\]

Advantages ⛳️

Computational Efficiency: Avoids the ‘combinatorial blowup’ 💥 of generating thousands of interaction features manually.
Memory Savings: No need to store 💾 or process high-dimensional coordinates, only the scalar result of the kernel function.

Why Kernel SVMs are Not so Popular ?

Designing special purpose domain specific kernel is very hard.
- Basically, we are trying to replace feature engineering with kernel design.
- Note: Deep learning does feature engineering implicitly for us.
Runtime complexity depends on number of support vectors, whose count is not easy to control.

Note: Runtime Time Complexity ⏰ = $O(n_{SV}\times d)$ , whereas linear SVM,$O(d)$ .

Previous: Primal Dual Equivalence Next: RBF Kernel

End of Section

2.5.6 - RBF Kernel

RBF Kernel

Radial Basis Function (RBF) Kernel | Support Vector Machine | Infinite Dimension Mapping | Explained

Intuition 💡

Unlike the polynomial kernel, which looks at global 🌎 interactions, the RBF kernel acts like a similarity measure.
If ‘x’ and ‘z’ are identical $K(x,z)=1$.
- As they move further apart in Euclidean space, the value decays exponentially towards 0.

Radial Basis Function (RBF) Kernel

\[K(x, z) = \exp\left(-\gamma. \|x - z\|^2\right)\]

\[\text{where, }\gamma = \frac{1}{2\sigma^2}\]

If $x \approx z$ (very close), $K(x,z)=1$
If ‘x’, ‘z’ are far apart, $K(x,z) \approx 0$

Note: Kernel function is the measure of similarity or closeness.

Infinite Dimension Mapping

Say $\sigma = 1$, then Euclidean distance: $\|x - z\|^2 = \|x\|^2 + \|z\|^2 - 2x^Tz$

\[K(x, z) = \exp(-( \|x\|^2 + \|z\|^2 - 2x^T z )) = \exp(-\|x\|^2) \exp(-\|z\|^2) \exp(2x^T z)\]

The Taylor expansion for $e^u= \sum_{n=0}^{\infty} \frac{u^n}{n!}$

\[\exp(2x^T z) = \sum_{n=0}^{\infty} \frac{(2x^T z)^n}{n!} = 1 + \frac{2x^T z}{1!} + \frac{(2x^T z)^2}{2!} + \dots + \frac{(2x^T z)^n}{n!} + \dots\]

\[K(x, z) = e^{-\|x\|^2} e^{-\|z\|^2} \left( \sum_{n=0}^{\infty} \frac{2^n (x^T z)^n}{n!} \right)\]

💡If we expand each $(x^T z)^n$ term, it represents the dot product of all possible n-th order polynomial features.

👉Thus, the implicit feature map is:

\[\phi(x) = e^{-\|x\|^2} \left[ 1, \sqrt{\frac{2}{1!}}x, \sqrt{\frac{2^2}{2!}}(x \otimes x), \dots, \sqrt{\frac{2^n}{n!}}(\underbrace{x \otimes \dots \otimes x}_{n \text{ times}}), \dots \right]^T\]

Important: The tensor product $x\otimes x$ creates a vector (or matrix) containing all combinations of the features. e.g. if $x=[x_{1},x_{2}]$, then $x\otimes x=[x_{1}^{2},x_{1}x_{2},x_{2}x_{1},x_{2}^{2}]$

Note: Because the Taylor series has an infinite number of terms, feature map has an infinite number of dimensions.

Bias-Variance Trade-Off ⚔️

High Gamma(low $\sigma$): Over-Fitting
- Makes the kernel so ‘peaky’ that each support vector only influences its immediate neighborhood.
- Decision boundary becomes highly irregular, ‘wrapping’ tightly around individual data points to ensure they are classified correctly.
Low Gamma(high $\sigma$): Under-Fitting
- The Gaussian bumps are wide and flat.
- Decision boundary becomes very smooth, essentially behaving more like a linear or low-degree polynomial classifier.

Previous: Kernel Trick Next: Support Vector Regression

End of Section

2.5.7 - Support Vector Regression

Support Vector Regression

Support Vector for Regression (SVR) | Support Vector Machines | Explained with Example

Intuition 💡

👉Imagine a ‘tube’ of radius $\epsilon$ surrounding the regression line.

Points inside the tube are considered ‘correct’ and incur zero penalty.
Points outside the tube are penalized based on their distance from the tube’s boundary.

Ignore Errors

👉SVR ignores errors as long as they are within a certain distance ($\epsilon$) from the true value.

🎯This makes SVR inherently robust to noise and outliers, as it does not try to fit every single point perfectly, only those that ‘matter’ to the structure of the data.

Note: Standard regression (like OLS) tries to minimize the squared error between the prediction and every data point.

Optimization (Primal Formulation)

\[\min_{w, w_0, \xi, \xi^*} \underbrace{\frac{1}{2} \|w\|^2}_{\text{Regularization}} + \underbrace{C \sum_{i=1}^n \xi_i, \xi_i^*}_{\text{Error Penalty}}\]

Subject to constraints:

$y_i - (w^T x_i + w_0) \leq \epsilon + \xi_i$: (Upper boundary)
$(w^T x_i + w_0) - y_i \leq \epsilon + \xi_i^*$: (Lower boundary)
$\xi_i, \xi_i^* \geq 0$: (Slack/Error cannot be negative)

Terms:

Epsilon($\epsilon$): The width of the tube. Increasing ‘$\epsilon$’ results in fewer support vectors and a smoother (flatter) model.
Slack Variables ($\xi_i, \xi_i^*$): How far a point lies outside the upper and lower boundaries of the tube.
C: The trade-off between the flatness of the model and the extent to which deviations larger than $\epsilon$ are tolerated.

Loss Function

SVR uses a specific loss function that is 0 when the error<’$\epsilon$’.

\[L_\epsilon(y, f(x)) = \max(0, |y - f(x)| - \epsilon)\]

The solution becomes sparse, because the loss is zero for points inside the tube.
Only the Support Vectors, i.e, points outside or on the boundary of the tube have non-zero Lagrange multipliers ($\alpha_i$).

Note: $\epsilon=0.1$ default value in scikit-learn.

(Wolfe) ‘Dual' Optimization

\[\max_{\alpha, \alpha^*} \sum_{i=1}^n y_i (\alpha_i - \alpha_i^*) - \epsilon \sum_{i=1}^n (\alpha_i + \alpha_i^*) - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n (\alpha_i - \alpha_i^*) (\alpha_j - \alpha_j^*) \mathbf{(x_i^T x_j)}\]

Subject to:

$\sum_{i=1}^n (\alpha_i - \alpha_i^*) = 0$
$0 \leq \alpha_i, \alpha_i^* \leq C$

$\alpha_i = \alpha_i^* = 0$: point is inside the tube.
$|\alpha_i - \alpha_i^*| > 0$ : support vectors; points on or outside the tube.

Note: $\alpha_i , \alpha_i^* $ cannot both be non-zero for the same point; a point cannot be simultaneously above and below the tube.

Inference & Kernel Trick

\[f(z) = \sum_{i \in SV} (\alpha_i - \alpha_i^*) \mathbf{K(x_i, z)} + w_0\]

👉 For non-linear SVR we replace dot product $x_i^T x_j$ with kernel function $K(x_i, x_j)$.
✅ Model needs to store only support vectors, i.e, points where $|\alpha_i - \alpha_i^*| > 0$.
⭐️$\xi_i =0 $ for a point that lies exactly on the boundary, so we can use that to calculate the bias ($w_0$):

\[w_0 = y_i - \sum_{j \in SV} (\alpha_j - \alpha_j^*) K(x_j, x_i) - \epsilon\]

\[ \text{Since, } y_i - (w^T x_i + w_0) \leq \epsilon + \xi_i\]

Previous: RBF Kernel Next: Naive Bayes Intro

End of Section

2.6 - Naive Bayes'

Naive Bayes'

End of Section

2.6.1 - Naive Bayes Intro

Naive Bayes Intro

Naive Bayes Introduction | Bayes' Theorem | Naive Assumption | Explained with Example

Naive Bayes

📘Simple, fast, and highly effective probabilistic machine learning classifier based on Bayes’ theorem.

Use Case

📌Let’s understand Naive Bayes through an Email/Text classification example.

Number of words in an email is not fixed.
Remove all stop words, such as, the, is , are, if, etc.
Keep only relevant words, i.e, $w_1, w_2, \dots , w_d$ words.
We want to do a binary classification - Spam/Not Spam.

Bayes' Theorem

Let’s revise Bayes’ theorem first:

\[P(S|W)=\frac{P(W|S)\times P(S)}{P(W)}\]

$P(S|W)$ is the posterior probability: the probability of the email being spam, given the words inside it.
$P(W|S)$ is the likelihood: how likely is this email’s word pattern if it were spam?
$P(S)$ is the prior probability: The ‘base rate’ of spam.
- If our dataset has 10,000 emails and 2,000 are spam, then $P(S)$=0.2.
$P(W)$ is the prior probability of the predictor (evidence): total probability of seeing these words across all emails.
- 👉Since this is the same for both classes, we treat it as a constant and ignore it during comparison.

Challenge 🤺

👉Likelihood = $P(W|S)$ = $P(w_1, w_2, \dots w_d | S)$

➡️ For computing the joint distribution of say d=1000 words, we need to learn from possible $2^{1000}$ combinations.

$2^{1000}$ > the atoms in the observable 🔭 universe 🌌.
We will never have enough training data to see every possible combination of words even once.
- Most combinations would have a count of zero.

🦉So, how do we solve this ?

Naive Assumption

💡The ‘Naive’ assumption is a ‘Conditional Independence’ assumption, i.e, we assume each word appears independently of the others, given the class Spam/Not Spam.

e.g. In a spam email, the likelihood of ‘Free’ and ‘Money’ 💵 appearing are treated as independent events, even though they usually appear together.

Note: The conditional independence assumption makes the probability calculations easier, i.e, the joint probability simply becomes the product of individual probabilities, conditional on the label.

\[P(W|S) = P(w_1|S)\times P(w_2|S)\times \dots P(w_d|S) \]\[\implies P(S|W)=\frac{P(w_1|S)\times P(w_2|S)\times \dots P(w_d|S)\times P(S)}{P(W)}\]

\[\implies P(S|W)=\frac{\prod_{i=1}^d P(w_i|S)\times P(S)}{P(W)}\]

\[\text{Similarly, } P(NS|W)=\frac{\prod_{i=1}^d P(w_i|NS)\times P(NS)}{P(W)}\]

We can generalize it for any number of class labels ‘y’:

\[\implies P(y|W) \propto \prod_{i=1}^d P(w_i|y)\times P(y) \quad \text{ where, y = class label} \]

\[ P(w_i|y) = \frac{count(w_i ~in~ y)}{\text{total words in class y}} \]

Note: We compute the probabilities for both Spam/Not Spam and assign the final label to email, depending upon which probability is higher.

Performance 🏇

👉Space Complexity: O(d*c)

👉Time Complexity:

Training: O(n*d*c)
Inference: O(d*c)

Where,

d = number of features/dimensions
c = number of classes
n = number of training examples

Previous: Support Vector Regression Next: Naive Bayes Issues

End of Section

2.6.2 - Naive Bayes Issues

Naive Bayes Issues

Naive Bayes' Issues | Laplace Smoothing | Underflow | Explained with Examples

Naive Bayes

⭐️Simple, fast, and highly effective probabilistic machine learning classifier based on Bayes’ theorem.

\[P(y|W) \propto \prod_{i=1}^d P(w_i|y)\times P(y)\]

\[P(w_i|y) = \frac{count(w_i ~in~ y)}{\text{total words in class y}}\]

Problem # 1

🦀What if at runtime we encounter a word that was never seen during training ?

e.g. A word ‘crypto’ appears in the test email that was not present in training emails; P(‘crypto’|S) =0.

👉This will force the entire product to zero.

\[P(w_i|S) = \frac{\text{Total count of } w_i \text{ in all Spam emails}}{\text{Total count of all words in all Spam emails}}\]

Laplace Smoothing

💡Add ‘Laplace smoothing’ to all likelihoods, both during training and test time, so that the probability becomes non-zero.

\[P(x_{i}|y)=\frac{count(x_{i},y)+\alpha }{count(y)+\alpha \cdot |V|}\]

$count(x_{i},y)$ : number of times word appears in documents of class ‘y'.
$count(y)$: The total count of all words in documents of class ‘y'.
$|V|$(or $N_{features}$):Vocabulary size or total number of unique possible words.

Let’s understand this by the examples below:

\[P(w_{i}|S)=\frac{count(w_{i},S)+\alpha }{count(S)+\alpha \cdot |V|}\]

$count(w_{i},S) = 0 $, $count(S) = 100$, $|V|$(or $N_{features}) =2, \alpha = 1$ \[P(w_{i}|S)=\frac{ 0+1 }{100 +1 \cdot 2} = \frac{1}{102}\]
$count(w_{i},S) = 0 $, $count(S) = 100$, $|V|$(or $N_{features}) =2, \alpha = 10,000$ \[P(w_{i}|S)=\frac{ 0+10,000 }{100 +10,000 \cdot 2} = \frac{10,000}{20,100} \approx \frac{1}{2}\]

Note: High alpha value may lead to under-fitting; $\alpha = 1$ recommended.

Problem # 2

🦀What happens if the number of words ‘d’ is very large ?

👉Multiplying 500 times will result in a number so small a computer 💻 cannot store it (underflow).

Note: Computers have a limit to store floating point numbers, e.g., 32 bit: $1.175 x 10^{-38}$

Logarithm

💡Take ‘Logarithm’ that will convert the product to sum.

\[P(y|W) \propto \prod_{i=1}^d P(w_i|y)\times P(y)\]

\[\log(P(y| W)) \propto \sum_{i=1}^d \log(P(w_i|y)) + \log(P(y))\]

Note: In the next section we will solve a problem covering all the concepts discussed in this section.

Previous: Naive Bayes Intro Next: Naive Bayes Example

End of Section

2.6.3 - Naive Bayes Example

Naive Bayes Example

Naive Bayes' Algorithm | Explained with Example

Naive Bayes

⭐️Simple, fast, and highly effective probabilistic machine learning classifier based on Bayes’ theorem.

\[\log(P(y| W)) \propto \sum_{i=1}^d \log(P(w_i|y)) + \log(P(y))\]

\[P(w_{i}|y)=\frac{count(w_{i},y)+\alpha }{count(y)+\alpha \cdot |V|}\]

Email Classification Problem

Let’s solve an email classification problem, below we have list of emails (tokenized) and labelled as Spam/Not Spam for training.

images/machine_learning/supervised/naive_bayes/naive_bayes_example/email_classification.png

Training Phase

🏛️Class Priors:

P(Spam) = 2/4 =0.5
P(Not Spam) = 2/4 = 0.5

📕 Vocabulary = { Free, Money, Inside, Scan, Win, Cash, Click, Link, Catch, Up Today, Noon, Project, Meeting }

|V| = Total unique word count = 14

🧮 Class count:

count(Spam) = 9
count(Not Spam) = 7

✅ Laplace smoothing: $\alpha = 1$

Likelihood of 'free'

\[P(w_{i}|y)=\frac{count(w_{i},y)+\alpha }{count(y)+\alpha \cdot |V|}\]

P(‘free’| Spam) = $\frac{2+1}{9+14} = \frac{3}{23} = 0.13$
P(‘free’| Not Spam) = $\frac{0+1}{7+14} = \frac{1}{21} = 0.047$

Inference Time

👉Say a new email 📧 arrives - “Free money today”; lets classify it as Spam/Not Spam.

Spam:

P(‘free’| Spam) = $\frac{2+1}{9+14} = \frac{3}{23} = 0.13$
P(‘money’| Spam) = $\frac{1+1}{9+14} = \frac{2}{23} = 0.087$
P(‘today’| Spam) = $\frac{0+1}{9+14} = \frac{1}{23} = 0.043$

Not Spam:

P(‘free’| Not Spam) = $\frac{0+1}{7+14} = \frac{1}{21} = 0.047$
P(‘money’| Not Spam) = $\frac{0+1}{7+14} = \frac{1}{21} = 0.047$
P(‘today’| Not Spam) = $\frac{1+1}{7+14} = \frac{2}{21} = 0.095$

Final Score 🏏

\[\log(P(y| W)) \propto \sum_{i=1}^d \log(P(w_i|y)) + \log(P(y))\]

Score(Spam) = log(P(Spam)) + log(P(‘free’|S)) + log(P(‘money’|S)) + log(P(‘today’|S))
= log(0.5) + log(0.13) + log(0.087) + log(0.043) = -0.301 -0.886 -1.06 -1.366 = -3.614
Score(Not Spam) = log(P(Not Spam)) + log(P(‘free’|NS)) + log(P(‘money’|NS)) + log(P(‘today’|NS))
= log(0.5) + log(0.047) + log(0.047) + log(0.095) = -0.301 -1.328 -1.328 -1.022 = -3.979

👉Since, Score(Spam) (-3.614 )> Score(Not Spam) (-3.979) , the model chooses ‘Spam’ as the label for the email.

Previous: Naive Bayes Issues Next: K Means

End of Section

3 - Unsupervised Learning

Unsupervised Machine Learning

Unsupervised Machine Learning | All Videos

End of Section

3.1 - K Means

K Means Clustering

End of Section

3.1.1 - K Means

K Means Clustering

K-Means Clustering | Variance Decomposition | Why NP Hard Problem ? | Explained with Example

Unsupervised Learning

🌍In real-world systems, labeled data is scarce and expensive 💰.

💡Unsupervised learning discovers inherent structure without human annotation.

👉Clustering answers: “Given a set of points, what natural groupings exist?”

Real-World 🌍 Motivations for Clustering

Customer Segmentation: Group users by behavior without predefined categories.
Image Compression: Reduce color palette by clustering pixel colors.
Anomaly Detection: Points far from any cluster are outliers.
Data Exploration: Understand structure before building supervised models.
Preprocessing: Create features from cluster assignments.

Key Insight 💡

💡Clustering assumes that ‘similar’ points should be grouped together.

👉But what is ‘similar’? This assumption drives everything.

Problem Statement ✍️

Given:

Dataset X = {x₁, x₂, …, xₙ} where xᵢ ∈ ℝᵈ.
Desired number of clusters ‘k'.

Find:

Cluster assignments C = {C₁, C₂, …, Cₖ}.
Such that points within clusters are ‘similar’.
And points across clusters are ‘dissimilar’.

Optimization Perspective

This is fundamentally an optimization problem, i.e, find parameters such that the value is minimum/maximum. We need:

An objective function
- what makes a clustering ‘good’?
An algorithm to optimize it
- how do we find good clusters?
Evaluation metrics
- how do we measure quality?

Optimization

Objective function:
👉Minimize the within-cluster sum of squares (WCSS).

\[J(C, \mu) = \sum_{j=1}^k \sum_{x_i \in C_j} \underbrace{\|x_i -\mu_j\|^2}_{\text{distance from mean}} \]

Where:
C = {C₁, …, Cₖ} are cluster assignments.
μⱼ is the centroid (mean) of cluster Cₖ.
||·||² is squared Euclidean distance.

Note: Every point belongs to one and only one cluster.

Variance Decomposition

💡Within-Cluster Sum of Squares (WCSS) is nothing but variance.

⭐️ Total Variance = Within-Cluster Variance + Between-Cluster Variance

👉K-Means minimizes within-Cluster variance, which implicitly maximizes between-cluster separation.

Geometric Interpretation:

Each point is ‘pulled’ toward its cluster center.
The objective measures total squared distance of all points to their centers.
Lower J(C, μ) means tighter, more compact clusters.

Note: K-Means works best when clusters are roughly spherical, similarly sized, and well-separated.

images/machine_learning/unsupervised/k_means/k_means_clustering/slide_11_01.png

Combinatorial Explosion 💥

⭐️The problem requires partitioning ’n’ observations into ‘k’ distinct, non-overlapping clusters, which is given by the Stirling number of the second kind, which grows at a rate roughly equal to $k^n/k!$.

\[S(n,k)=\frac{1}{k!}\sum _{j=0}^{k}(-1)^{k-j}{k \choose j}j^{n}\]

\[S(100,2)=2^{100-1}-1=2^{99}-1\]

\[2^{99}\approx 6.338\times 10^{29}\]

👉This large number of possible combinations makes the problem NP-Hard.

🦉The k-means optimization problem is NP-hard because it belongs to a class of problems for which no efficient (polynomial-time ⏰) algorithm is known to exist.

Previous: Naive Bayes Example Next: Lloyds Algorithm

End of Section

3.1.2 - Lloyds Algorithm

Lloyds Algorithm

Lloyd’s Algorithm | K Means | Explained with Example

Idea 💡

Since, we cannot enumerate all partitions (i.e, partitioning ’n’ observations into ‘k’ distinct, non-overlapping clusters), Lloyd’s algorithm provides a local search heuristic (approximate algorithm).

Lloyd's Algorithm ⚙️

Iterative method for partitioning ’n’ data points into ‘k’ groups by repeatedly assigning data points to the nearest centroid (mean) and then recalculating centroids until assignments stabilize, aiming to minimize within-cluster variance.

📥Input: X = {x₁, …, xₙ}, ‘k’ (number of clusters)

📤Output: ‘C’ (clusters), ‘μ’ (centroids)

👉Steps:

Initialize: Randomly choose ‘k’ cluster centroids μ₁, …, μₖ.
Repeat until convergence, i.e, until cluster assignments and centroids no longer change significantly.

a) Assignment: Assign each data point to the cluster whose centroid is closest (usually using Euclidean distance).
- For each point xᵢ: cᵢ = argminⱼ ||xᵢ - μⱼ||²
b) Update: Recalculate the centroid (mean) of each cluster.
- For each cluster j: μⱼ = (1/|Cⱼ|) Σₓᵢ∈Cⱼ xᵢ

Issues🚨

Initialization sensitive, different initialization may lead to different clusters.
Tries to make each cluster of same size that may not be the case in real world.
Tries to make each cluster with same density(variance)
Does not work well with non-globular(spherical) data.

👉See how 2 different runs of K-Means algorithm gives totally different clusters.

images/machine_learning/unsupervised/k_means/lloyds_algorithm/slide_06_01.png

images/machine_learning/unsupervised/k_means/lloyds_algorithm/slide_06_02.png

👉Also, K-Means does not work well with non-spherical clusters, or clusters with different densities and sizes.

images/machine_learning/unsupervised/k_means/lloyds_algorithm/slide_07_01.png

Solutions

✅ Do multiple runs 🏃‍♂️and choose the clustering with minimum error.
✅ Do not select initial points randomly, but some logic, such as, K-means++ algorithm.
✅ Use hierarchical clustering or density based clustering DBSCAN.

Previous: K Means Next: K Means++

End of Section

3.1.3 - K Means++

K Means++ Algorithm

K Means++ Algorithm | Issues with Random Initialization | Explained with Example

Issues with Random Initialization

If two initial centroids belong to the same natural cluster, the algorithm will likely split that natural cluster in half and be forced to merge two other distinct clusters elsewhere to compensate.
Inconsistent; different runs may lead to different clusters.
Slow convergence; Centroids may need to travel much farther across the feature space, requiring more iterations.

👉Example for different K-Means algorithm runs give different clusters

images/machine_learning/unsupervised/k_means/k_means_plus_plus/slide_02_01.png

images/machine_learning/unsupervised/k_means/k_means_plus_plus/slide_02_02.png

K-Means++ Algorithm

💡Addresses the issue due to random initialization by aiming to spread out the initial centroids across the data points.

Steps:

Select the first centroid: Choose one data point randomly from the dataset to be the first centroid.
Calculate distances: For every data point x not yet selected as a centroid, calculate the distance, D(x), between x and the nearest centroid chosen so far.
Select the next centroid: Choose the next centroid from the remaining data points with a probability proportional to D(x)^2.
This makes it more likely that a point far from existing centroids is selected, ensuring the initial centroids are well-dispersed.
Repeat: Repeat steps 2 and 3 until ‘k’ number of centroids are selected.
Run standard K-means: Once the initial centroids are chosen, the standard k-means algorithm proceeds with assigning data points to the nearest centroid and iteratively updating the centroids until convergence.

Problem 🚨

🦀 If our data is extremely noisy (outliers), the probabilistic logic ($\propto D(x)^2$) might accidentally pick an outlier as a cluster center.

Solution ✅

Do robust preprocessing to remove outliers or use K-Medoids algorithm.

Previous: Lloyds Algorithm Next: K Medoid

End of Section

3.1.4 - K Medoid

K Medoid

K-Medoids Algorithm | Partitioning Around Medoids (PAM) | Explained with Example

Issues with K-Means

In K-Means, the centroid is the arithmetic mean of the cluster. The mean is very sensitive to outliers.
Not interpretable; centroid is the mean of cluster data points and may not be an actual data point, hence not representative.

Medoid

⭐️Medoid is a specific data point from a dataset that acts as the ‘center’ or most representative member of its cluster.

👉It is defined as the object within a cluster whose average dissimilarity (distance) to all other members in that same cluster is the smallest.

K-Medoids (PAM) Algorithm

💡Selects actual data points from the dataset as cluster representatives, called medoids (most centrally located).

👉a.k.a Partitioning Around Medoids(PAM).

Steps:

Initialization: Select ‘k’ data points from the dataset as the initial medoids using K-Means++ algorithm.
Assignment: Calculate the distance (e.g., Euclidean or Manhattan) from each non-medoid point to all medoids and assign each point to the cluster of its nearest medoid.
Update (Swap):

For each cluster, swap current medoid with a non-medoid point from the same cluster.
For each swap, calculate the total cost 💰(sum of distances from medoid).
Pick the medoid with minimum cost 💰.

Repeat🔁: Repeat the assignment and update steps until (convergence), i.e, medoids no longer change or a maximum number of iterations is reached.

Note: Kind of brute-force algorithm, computationally expensive for large dataset.

Advantages

✅ Robust to Outliers: Since medoids are actual data points rather than averages, extreme values (outliers) do not skew the center of the cluster as they do in K-Means.
✅ Flexible Distance Metrics: It can work with any dissimilarity measure (Manhattan, Cosine similarity), making it ideal for categorical or non-Euclidean data.
✅ Interpretable Results: Cluster centers are real observations (e.g., a ‘typical’ customer profile), which makes the output easier to explain to stakeholders.

Previous: K Means++ Next: Clustering Quality Metrics

End of Section

3.1.5 - Clustering Quality Metrics

Clustering Quality Metrics

Clustering Quality Metrics | Elbow Method | Dunn Index | Explained with Example

How to Evaluate Quality of Clustering?

👉 Elbow Method: Quickest to compute; good for initial EDA (Exploratory Data Analysis).
👉 Dunn Index: Focuses on the ‘gap’ between the closest clusters.
👉 Silhouette Score: Balances compactness and separation.
👉 Domain specific knowledge and system constraints.

Elbow Method

⭐️Heuristic used to determine the optimal number of clusters (k) for clustering by visualizing how the quality of clustering improves as ‘k’ increases.

🎯The goal is to find a value of ‘k’ where adding more clusters provides a diminishing return in terms of variance reduction.

images/machine_learning/unsupervised/k_means/clustering_quality_metrics/slide_02_01.png

Dunn Index [0, $\infty$)

⭐️Clustering quality evaluation metric that measures: separation (between clusters) and compactness (within clusters)

Note: A higher Dunn Index value indicates better clustering, meaning clusters are well-separated from each other and compact.

👉Dunn Index Formula:

\[DI = \frac{\text{Minimum Inter-Cluster Distance(between different clusters)}}{\text{Maximum Intra-Cluster Distance(within a cluster)}}\]

\[DI = \frac{\min_{1 \le i < j \le k} \delta(C_i, C_j)}{\max_{1 \le l \le k} \Delta(C_l)}\]

images/machine_learning/unsupervised/k_means/clustering_quality_metrics/slide_06_01.png

👉Let’s understand the terms in the above formula:

$\delta(C_i, C_j)$ (Inter-Cluster Distance):
- Measures how ‘far apart’ the clusters are.
- Distance between the two closest points of different clusters (Single-Linkage distance). \[\delta(C_i, C_j) = \min_{x \in C_i, y \in C_j} d(x, y)\]
$\Delta(C_l)$ (Intra-Cluster Diameter):
- Measures how ‘spread out’ a cluster is.
- Distance between the two furthest points within the same cluster (Complete-Linkage distance). \[\Delta(C_l) = \max_{x, y \in C_l} d(x, y)\]

Measure of Closeness

Single Linkage (MIN): Uses the minimum distance between any two points in different clusters.
Complete Linkage (MAX): Uses the maximum distance between any two points in same cluster.

Previous: K Medoid Next: Silhouette Score

End of Section

3.1.6 - Silhouette Score

Silhouette Score

Silhouette Score | Dunn Index Vs Silhouette Score | Explained with Example

How to Evaluate Quality of Clustering?

✅ Elbow Method: Quickest to compute; good for initial EDA.
✅ Dunn Index: Focuses on the ‘gap’ between the closest clusters.
——- We have seen the above 2 methods in the previous section ———-
👉 Silhouette Score: Balances compactness and separation.
👉 Domain specific knowledge and system constraints.

Silhouette Score [-1, 1]

⭐️Clustering quality evaluation metric that measures how similar a data point is to its own cluster (cohesion) compared to other clusters (separation).

Note: Higher scores (closer to 1) indicate better-defined, distinct clusters, while scores near 0 suggest overlapping clusters, and negative scores mean points might be in the wrong cluster.

Silhouette Score Formula

Silhouette score for point ‘i’ is the difference between separation b(i) and cohesion a(i), normalized by the larger of the two.

\[ s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \]

Note: The Global Silhouette Score is simply the mean of s(i) for all points in the dataset.

👉Example for Silhouette Score:

images/machine_learning/unsupervised/k_means/silhouette_score/slide_04_01.png

👉Example for Silhouette Score of 0(Border Point) and negative(Wrong Cluster).

images/machine_learning/unsupervised/k_means/silhouette_score/slide_05_01.png

🦉Now let’s understand the terms in Silhouette Score in detail.

Cohesion a(i)

Average distance between point ‘i’ and all other points in the same cluster.

\[a(i) = \frac{1}{|C_A| - 1} \sum_{j \in C_A, i \neq j} d(i, j)\]

Note: Lower a(i) means the point is well-matched to its own cluster.

Separation b(i)

Average distance between point ‘i’ and all points in the nearest neighboring cluster (the cluster that ‘i’ is not a part of, but is closest to).

\[b(i) = \min_{C_B \neq C_A} \frac{1}{|C_B|} \sum_{j \in C_B} d(i, j)\]

Note: Higher b(i) means the point is very far from the next closest cluster.

Silhouette Plot

⭐️A silhouette plot is a graphical tool used to evaluate the quality of clustering algorithms (like K-Means), showing how well each data point fits within its cluster.

👉Each bar gives the average silhouette score of the points assigned to that cluster.

images/machine_learning/unsupervised/k_means/silhouette_score/slide_09_01.png

Geometric Interpretation

⛳️ Like K-Means, the Silhouette Score (when using Euclidean distance) assumes convex clusters.
🌘 If we use it on ‘Moon’ shaped clusters, it will give a low score even if the clusters are perfectly separated, because the ‘average distance’ to a neighbor might be small due to the curvature of the manifold.

Silhouette Score Vs Dunn Index

Choose Silhouette Score if:
✅ Need a human-interpretable metric to present to stakeholders.
✅ Dealing with real-world noise and overlapping ‘fuzzy’ boundaries.
✅ Want to see which specific clusters are weak (using the plot).

Choose Dunn Index if:
✅ Performing ‘Hard Clustering’ where separation is a safety or business requirement.
✅ Data is pre-cleaned of outliers (e.g., in a curated embedding space).
✅ Need to compare different clustering algorithms (e.g., K-Means vs. DBSCAN) on a high-integrity task.

Previous: Clustering Quality Metrics Next: Hierarchical Clustering

End of Section

3.2 - Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering | Agglomerative Nesting | Explained with Example

End of Section

3.2.1 - Hierarchical Clustering

Hierarchical Clustering

Issues with K-Means

🤷 We might not know in advance the number of distinct clusters ‘k’ in the dataset.
🕸️ Also, sometimes the dataset may contain a nested structure or some inherent hierarchy, such as, file system, organizational chart, biological lineages, etc.

Hierarchical Clustering

⭐️ Method of cluster analysis that seeks to build a hierarchy of clusters, resulting in a tree like structure called dendrogram.

👉Hierarchical clustering allows us to explore different possibilities (of ‘k’) by cutting the dendrogram at various levels.

images/machine_learning/unsupervised/hierarchical_clustering/hierarchical_clustering/slide_03_01.png

2 Philosophies

Agglomerative (Bottom-Up):
Most common, also known as Agglomerative Nesting (AgNes).

Every data point starts as its own cluster.
In each step, merge the two ‘closest’ clusters.
Repeat step 2, until all points are merged in a single cluster.

Divisive (Top-Down):

All data points start in one large cluster.
In each step, divide the cluster into two halves.
Repeat step 2, until every point is its own cluster.

Agglomerative Clustering Example:

images/machine_learning/unsupervised/hierarchical_clustering/hierarchical_clustering/slide_05_01.png

Closeness of Clusters

Ward’s Method:
- Merges clusters to minimize the increase in the total within-cluster variance (sum of squared errors), resulting in compact, equally sized clusters.
Single Linkage (MIN):
- Uses the minimum distance between any two points in different clusters.
- Prone to creating long, ‘chain-like’ 🔗 clusters, sensitive to outliers.
Complete Linkage (MAX):
- Uses the maximum distance between any two points in different clusters.
- Forms tighter, more spherical clusters, less sensitive to outliers than single linkage.
Average Linkage:
- Combines clusters by the average distance between all points in two clusters, offering a compromise between single and complete linkage.
- A good middle ground, often overcoming limitations of single and complete linkage.
Centroid Method:
- Merges clusters based on the distance between their centroids (mean points).

👉Single Linkage is more sensitive to outlier than Complete Linkage, as Single Linkage can keep linking to the closest point forming a bridge to outlier.

images/machine_learning/unsupervised/hierarchical_clustering/hierarchical_clustering/slide_07_01.png

👉All cluster linkage distances.

images/machine_learning/unsupervised/hierarchical_clustering/hierarchical_clustering/slide_09_01.png

👉We get different clustering using different linkages.

images/machine_learning/unsupervised/hierarchical_clustering/hierarchical_clustering/slide_10_01.png

Hierarchical Clustering | Agglomerative Nesting | Explained with Example

Previous: Silhouette Score Next: DBSCAN

End of Section

3.3 - DBSCAN

Density Based Spatial Clustering of Application with Noise

DBSCAN : Density- Based Spatial Clustering of Applications with Noise | Explained with Example

End of Section

3.3.1 - DBSCAN

Density Based Spatial Clustering of Application with Noise

Issues with K-Means

Non-Convex Shapes: K-Means can not find ‘crescent’ or ‘ring’ shape clusters.
Noise: K-Means forces every point into a cluster, even outliers.

Main Question for Clustering ?

👉K-Means asks:

“Which center is closest to this point?”

👉DBSCAN asks:

“Is this point part of a dense neighborhood?”

Intuition 💡

Cluster is a contiguous region of high density in the data space, separated from other clusters by areas of low density.

DBSCAN

⭐️Groups closely packed data points into clusters based on their density, and marks points that lie alone in low-density regions as outliers or noise.

Note: Unlike K-means, DBSCAN can find arbitrarily shaped clusters and does not require the number of clusters to be specified beforehand.

Hyper-Parameters 🎛️

Epsilon (eps or $\epsilon$):

Radius that defines the neighborhood around a data point.
If it’s too small, many points will be noise, and if too large, distinct clusters may merge.

Minimum Points(minPts or min_samples):

Minimum number of data points required within a point’s -neighborhood for that point to be considered a dense region (a core point).
Defines threshold for ‘density’.
Rule of thumb: minPts dimensions + 1; use larger value for noisy data (minPts 2*dimensions).

Types of Points

Core Point:
- If it has at least minPts (including itself) within its -neighborhood.
- Forms the dense backbone of the clusters and can expand them.
Border Point:
- If it has at fewer than minPts within its -neighborhood but falls within the -neighborhood of a core point.
- Border points belong to a cluster but cannot expand it further.
Noise Point (Outlier):
- If it is neither a core point nor a border point, i.e., it is not density-reachable from any core point.
- Not assigned to any cluster.

DBSCAN Algorithm ⚙️

Random Start:

Mark all points as unvisited; pick an arbitrary unvisited point ‘P’ from the dataset.

Density Check:

Check the point P’s ϵ-neighborhood.
If ‘P’ has at least minPts, it is identified as a core point, and a new cluster is started.
If it has fewer than minPts, the point is temporarily labeled as noise (it might become a border point later).

Cluster Expansion:

Recursively visit all points in P’s ϵ-neighborhood.
If they are also core points, their neighbors are added to the cluster.

Iteration 🔁:

This ‘density-reachable’ logic continues until the cluster is fully expanded.
The algorithm then picks another unvisited point and repeats the process, discovering new clusters or marking more points as noise until all points are processed.

👉DBSCAN can correctly detect non-spherical clusters.

images/machine_learning/unsupervised/dbscan/dbscan/slide_11_01.png

👉DBSCAN Points and Epsilon-Neighborhood.

images/machine_learning/unsupervised/dbscan/dbscan/slide_12_01.png

When DBSCAN Fails ?

Varying Density Clusters:
- Say A cluster is very dense and B is sparse, a single cannot satisfy both clusters.
High Dimensional Data:
- ‘Curse of Dimensionality’ - In high-dimensional space, the distance between any two points converge.

Note: Sensitive to parameter eps and minPts; tricky to get it work.

👉DBSCAN Failure and Epsilon ((\epsilon)

images/machine_learning/unsupervised/dbscan/dbscan/slide_14_01.png

When to use DBSCAN ?

Arbitrary Cluster Shapes:
- When clusters are intertwined, nested, or ‘moon-shaped’; where K-Means would fail by splitting them.
Presence of Noise and Outliers:
- Robust to noise and outliers because it explicitly identifies low-density points as noise (labeled as -1) rather than forcing them into a cluster.

DBSCAN : Density- Based Spatial Clustering of Applications with Noise | Explained with Example

Previous: Hierarchical Clustering Next: Gaussian Mixture Models

End of Section

3.4 - Gaussian Mixture Model

Gaussian Mixture Model (GMM)

End of Section

3.4.1 - Gaussian Mixture Models

Introduction to Gaussian Mixture Models

Gaussian Mixture Model (GMM) | Foundational Concepts | Explained with Example

Issue with K-Means

K-means uses Euclidean distance and assumes that clusters are spherical (isotropic) and have the same variance across all dimensions.
Places a circle or sphere around each centroid.
- What if the clusters are elliptical ? 🤔

👉K-Means Fails with Elliptical Clusters.

images/machine_learning/unsupervised/gaussian_mixture_model/introduction_gaussian_mixture_models/slide_02_01.png

Gaussian Mixture Model (GMM)

💡GMM: ‘Probabilistic evolution’ of K-Means

⭐️ GMM provides soft assignments and can model elliptical clusters by accounting for variance and correlation between features.

Note: GMM assumes that all data points are generated from a mixture of a finite number of Gaussian Distributions with unknown parameters.

👉GMM can Model Elliptical Clusters.

images/machine_learning/unsupervised/gaussian_mixture_model/introduction_gaussian_mixture_models/slide_03_01.png

What is a Mixture Model

💡‘Combination of probability distributions’.

👉Soft Assignment: Instead of a simple ‘yes’ or ’no’ for cluster membership, a data point is assigned a set of probabilities, one for each cluster.

e.g: A data point might have a 60% probability of belonging to cluster ‘A’, 30% probability for cluster ‘B’, and 10% probability for cluster ‘C’.

👉Gaussian Mixture Model Example:

images/machine_learning/unsupervised/gaussian_mixture_model/introduction_gaussian_mixture_models/slide_07_01.png

Gaussian PDF

\[{\displaystyle {\mathcal {N}}({\boldsymbol {\mu }},\,{\boldsymbol {\sigma }})}: f(x\,|\,\mu ,\sigma ^{2})=\frac{1}{\sqrt{2\pi \sigma ^{2}}}\exp \left\{-\frac{(x-\mu )^{2}}{2\sigma ^{2}}\right\}\]

\[ \text{ Multivariate Gaussian, } {\displaystyle {\mathcal {N}}({\boldsymbol {\mu }},\,{\boldsymbol {\Sigma }})}: f(\mathbf{x}\,|\,\mathbf{\mu },\mathbf{\Sigma })=\frac{1}{\sqrt{(2\pi )^{n}|\mathbf{\Sigma }|}}\exp \left\{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu })^{T}\mathbf{\Sigma }^{-1}(\mathbf{x}-\mathbf{\mu })\right\}\]

Note: The term $1/(\sqrt{2\pi \sigma ^{2}})$ is a normalization constant to ensure the total area under the curve = 1.

👉Multivariate Gaussian Example:

Gaussian Mixture

Whenever we have multivariate Gaussian, then the variables may be independent or correlated.

👉Feature Covariance:

images/machine_learning/unsupervised/gaussian_mixture_model/introduction_gaussian_mixture_models/slide_10_01.png

👉Gaussian Mixture with PDF

images/machine_learning/unsupervised/gaussian_mixture_model/introduction_gaussian_mixture_models/slide_11_01.png

👉Gaussian Mixture (2D)

images/machine_learning/unsupervised/gaussian_mixture_model/introduction_gaussian_mixture_models/gmm_2d.png

Previous: DBSCAN Next: Latent Variable Model

End of Section

3.4.2 - Latent Variable Model

Latent Variable Model

Gaussian Mixture Model (GMM) as Latent Variable Model | Why MLE Fails ? | Explained with Example

Gaussian Mixture Model (GMM)

⭐️Probabilistic model that assumes data is generated from a mixture of several Gaussian (normal) distributions with unknown parameters.

🎯GMM represents the probability density function of the data as a weighted sum of ‘K’ component Gaussian densities.

👉Below plot shows the probability of a point being generated by 3 different Gaussians.

images/machine_learning/unsupervised/gaussian_mixture_model/latent_variable_model/slide_01_01.png

Gaussian Mixture PDF

Overall density $p(x_i|\mathbf{\theta })$ for a data point ‘$x_i$’:

\[p(x_i|\mathbf{\mu},\mathbf{\Sigma} )=\sum _{k=1}^{K}\pi _{k}\mathcal{N}(x_i|\mathbf{\mu }_{k},\mathbf{\Sigma }_{k})\]

K: number of component Gaussians.
$\pi_k$: mixing coefficient (weight) of the k-th component, such that, $\pi_k \ge 0$ and $\sum _{k=1}^{K}\pi _{k}=1$.
$\mathcal{N}(x_i|\mathbf{\mu }_{k},\mathbf{\Sigma }_{k})$: probability density function of the k-th Gaussian component with mean $\mu_k$ and covariance matrix $\Sigma_k$.
$\mathbf{\theta }=\{(\pi _{k},\mathbf{\mu }_{k},\mathbf{\Sigma }_{k})\}_{k=1}^{K}$: complete set of parameters to be estimated.

Note: $\pi _{k}\approx \frac{\text{Number\ of\ points\ in\ cluster\ }k}{\text{Total\ number\ of\ points\ }(N)}$

👉 Weight of the cluster is proportional to the number of points in the cluster.

images/machine_learning/unsupervised/gaussian_mixture_model/latent_variable_model/slide_04_01.png

👉Below image shows the weighted Gaussian PDF, given the weights of clusters.

images/machine_learning/unsupervised/gaussian_mixture_model/latent_variable_model/slide_05_01.png

GMM Optimization (Why MLE Fails?)

🎯 Goal of a GMM optimization is to find the set of parameters $\Theta =\{(\pi _{k},\mu _{k},\Sigma _{k})\mid k=1,\dots ,K\}$ that maximize the likelihood of observing the given data.

\[L(\Theta |X)=\sum _{i=1}^{N}\log P(x_i|\Theta )=\sum _{i=1}^{N}\log \left(\sum _{k=1}^{K}\pi _{k}\mathcal{N}(x_i|\mu _{k},\Sigma _{k})\right)\]

🦀 $\log (A+B)$ cannot be simplified.
🦉So, is there any other way ?

Latent Variable (Intuition 💡)

⭐️Imagine we are measuring the heights of people in a college.

We see a distribution with two peaks (bimodal).
We suspect there are two underlying groups:
- Group A (Men) and Group B (Women).

Observation:

Observed Variable (X): Actual height measurements.
Latent Variable (Z): The ‘label’ (Man or Woman) for each person.

Note: We did not record gender, so it is ‘hidden’ or ‘latent’.

Latent Variable Model

A Latent Variable Model assumes that the observed data ‘X’ is generated by first picking a latent state ‘z’ and then drawing a sample from the distribution associated with that state.

GMM as Latent Variable Model

⭐️GMM is a latent variable model, meaning each data point $\mathbf{x}_{i}$ is assumed to have an associated unobserved (latent) variable $z_{i}\in \{1,\dots ,K\}$ indicating which component generated it.

Note: We observe the data point, but we do not observe which cluster it belongs to ($z_i$).

Latent Variable Purpose

👉If we knew the value of $z_i$ (component indicator) for every point, estimating the parameters of each Gaussian component would be straightforward.

Note: The challenge lies in estimating both the parameters of the Gaussians and the values of the latent variables simultaneously.

Cluster Indicator (z) & Log Likelihood (sum)

With ‘z’ unknown:
- maximize: \[ \log \sum _{k}\pi _{k}\mathcal{N}(x_{i}|\mu _{k},\Sigma _{k}) = \log \Big(\pi _{1}\mathcal{N}(x_{i}\mid \mu _{1},\Sigma _{1})+\pi _{2}\mathcal{N}(x_{i}\mid \mu _{2},\Sigma _{2})+ \dots + \pi _{k}\mathcal{N}(x_{i}\mid \mu _{k},\Sigma _{k})\Big)\]
  - $\log (A+B)$ cannot be simplified.
With ‘z’ known:
- The log-likelihood of the ‘complete data’ simplifies into a sum of logarithms: \[\sum _{i}\log (\pi _{z_{i}}\mathcal{N}(x_{i}|\mu _{z_{i}},\Sigma _{z_{i}}))\]
  - Every point is assigned to exactly one cluster, so the sum disappears because there is only one cluster responsible for that point.

Note: This allows the logarithm to act directly on the exponential term of the Gaussian, leading to simple linear equations.

Hard Assignment Simplifies Estimation

👉When ‘z’ is known, every data point is ‘labeled’ with its parent component.
To estimate the parameters (mean $\mu_k$ and covariance $\Sigma_k$) for a specific component ‘k’ :

Gather all data points $x_i$, where $z_i$= k.
Calculate the standard Maximum Likelihood Estimate.(MLE) for that single Gaussian using only those points.

Closed-Form Solution

⭐️ Knowing ‘z’ provides exact counts and component assignments, leading to direct formulae for the parameters:

Mean ($\mu_k$): Arithmetic average of all points assigned to component ‘k’.
Covariance ($\Sigma_k$): Sample covariance of all points assigned to component ‘k’.
Mixing Weight ($\pi_k$): Fraction of total points assigned to component ‘k’.

Previous: Gaussian Mixture Models Next: Expectation Maximization

End of Section

3.4.3 - Expectation Maximization

Expectation Maximization

Expectation Maximization Algorithm | Gaussian Mixture Model | Explained with Example

GMM as Latent Variable Model

⭐️GMM is a latent variable model, where the variable $z_i$ is a latent (hidden) variable that indicates which specific Gaussian component or cluster generated a particular data point.

Chicken 🐓 & Egg 🥚 Problem

If we knew the parameters ($\mu, \Sigma, \pi$) we could easily calculate which cluster ‘z’ each point belongs to (using probability).
If we knew the cluster assignments ‘z’ of each point, we could easily calculate the parameters for each cluster (using simple averages).

🦉But we do not know either of them, as the parameters of the Gaussians - we aim to find and cluster indicator latent variable is hidden.

images/machine_learning/unsupervised/gaussian_mixture_model/expectation_maximization/slide_03_01.png

images/machine_learning/unsupervised/gaussian_mixture_model/expectation_maximization/slide_04_01.png

Break the Loop 🔁

⛓️‍💥Guess one, i.e, cluster assignment ‘z’ to find the other, i.e, parameters $\mu, \Sigma, \pi$.

Goal 🎯

⛳️ Find latent cluster indicator variable $z_{ik}$.

But $z_{ik}$ is a ‘hard’ assignment’ (either ‘0’ or ‘1’).

🦆 Because we do not observe ‘z’, we use another variable ‘Responsibility’ ($\gamma_{ik}$) as a ‘soft’ assignment (value between 0 and 1).
🐣 $\gamma_{ik}$ is the expected value of the latent variable $z_{ik}$, given the observed data $x_{i}$ and parameters $\Theta$. \[\gamma _{ik}=E[z_{ik}\mid x_{i},\theta ]=P(z_{ik}=1\mid x_{i},\theta )\]

Note: $\gamma_{ik}$ is the posterior probability (or ‘responsibility’) that cluster ‘k’ takes for explaining data point $x_{i}$.

Indicator Variable ➡ Responsibility

\[\gamma _{ik}=E[z_{ik}\mid x_{i},\theta ]=P(z_{ik}=1\mid x_{i},\theta )\]

⭐️Using Bayes’ Theorem, we derive responsibility (posterior probability that component ‘k’ generated data point $x_i$) by combining the prior/weights ($\pi_k$) and the likelihood ($\mathcal{N}(x_{i}\mid \mu _{k},\Sigma _{k})$).

\[\gamma _{ik}= P(z_{ik}=1\mid x_{i},\theta ) = \frac{P(z_{ik}=1)P(x_{i}\mid z_{ik}=1)}{P(x_{i})}\]

\[\gamma _{ik}=\frac{\pi _{k}\mathcal{N}(x_{i}\mid \mu _{k},\Sigma _{k})}{\sum _{j=1}^{K}\pi _{j}\mathcal{N}(x_{i}\mid \mu _{j},\Sigma _{j})}\]

👉Bayes’ Theorem: $P(A|B)=\frac{P(B|A)\cdot P(A)}{P(B)}$

👉 GMM Densities at point

images/machine_learning/unsupervised/gaussian_mixture_model/expectation_maximization/slide_08_01.png

👉GMM Densities at point (different cluster weights)

images/machine_learning/unsupervised/gaussian_mixture_model/expectation_maximization/slide_09_01.png

Expectation Maximization Algorithm ⚙️

Initialization: Assign initial values to parameters ($\mu, \Sigma, \pi$), often using K-Means results.
Expectation Step (E): Calculate responsibilities; provides ‘soft’ assignments of points to clusters.
Maximization Step (M): Update parameters using responsibilities as weights to maximize the expected log-likelihood.
Convergence: Repeat ‘E’ and ‘M’ steps until the change in log-likelihood falls below a threshold.

Expectation Step

👉Given our current guess of the clusters, what is the probability that point $x_i$ came from cluster ‘k’ ?

\[\gamma (z_{ik})=p(z_{i}=k|\mathbf{x}_{i},\mathbf{\theta }^{(\text{old})})=\frac{\pi _{k}^{(\text{old})}\mathcal{N}(\mathbf{x}_{i}|\mathbf{\mu }_{k}^{(\text{old})},\mathbf{\Sigma }_{k}^{(\text{old})})}{\sum _{j=1}^{K}\pi _{j}^{(\text{old})}\mathcal{N}(\mathbf{x}_{i}|\mathbf{\mu }_{j}^{(\text{old})},\mathbf{\Sigma }_{j}^{(\text{old})})}\]

$\pi_k$ : Probability that a randomly selected data point $x_i$ belongs to the k-th component before we even look at the specific value of $x_i$, such that $\pi_k \ge 0$ and $\sum _{k=1}^{K}\pi _{k}=1$.

Maximization Step

👉Update the parameters ($\mu, \Sigma, \pi$) by calculating weighted versions of the standard MLE formulas using responsibilities as weight 🏋️‍♀️.

\[ \begin{align*} &\bullet \mathbf{\mu }_{k}^{(\text{new})}=\frac{1}{N_{k}}\sum _{i=1}^{N}\gamma (z_{ik})\mathbf{x}_{i} \\ &\bullet \mathbf{\Sigma }_{k}^{(\text{new})}=\frac{1}{N_{k}}\sum _{i=1}^{N}\gamma (z_{ik})(\mathbf{x}_{i}-\mathbf{\mu }_{k}^{(\text{new})})(\mathbf{x}_{i}-\mathbf{\mu }_{k}^{(\text{new})})^{\top } \\ &\bullet \pi _{k}^{(\text{new})}=\frac{N_{k}}{N} \end{align*} \]

where, $N_{k}=\sum _{i=1}^{N}\gamma (z_{ik})$ is the effective number of points assigned to component ‘k'.

Previous: Latent Variable Model Next: Anomaly Detection

End of Section

3.5 - Anomaly Detection

Anomaly/Outlier/Novelty Detection

End of Section

3.5.1 - Anomaly Detection

Anomaly Detection Introduction

Anomaly/Outlier/Novelty Detection Methods | Explained with Example

What is Anomaly?

🦄 Anomaly is a rare item, event or observation which deviates significantly from the majority of the data and does not conform to a well-defined notion of normal behavior.

Note: Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.

Anomaly Detection

🐙 Anomaly detection (Outlier detection or Novelty detection) is the identification of unusual patterns or anomalies or outliers in a given dataset.

What to do with Outliers ?

❌ Remove Outliers:

Rejection or omission of outliers from the data to aid statistical analysis, for example to compute the mean or standard deviation of the dataset.
Remove outliers for better predictions from models, such as linear regression.

🔦Focus on Outliers:

Fraud detection in banking and financial services.
Cyber-security: intrusion detection, malware, or unusual user access patterns.

Anomaly Detection Methods 🐉

Supervised
Semi-Supervised
Unsupervised (most common) ✅

Note: Labeled anomaly data is often unavailable in real-world scenarios.

Known Methods 🐈

Statistical Methods: Z-Score, large value means outlier, IQR, point beyond fences (Q1 - 1.5*IQR or Q3 + 1.5*IQR) is flagged as an outlier.
Distance Based: KNN, points far from their neighbors as potential anomalies.
Density Based: DBSCAN, points in low density regions are considered outliers.
Clustering Based: K-Means, points far from cluster centroids that do not fit any cluster are anomalies.

Unsupervised Methods 🦅

Elliptic Envelope (MCD - Minimum Covariance Determinant)
One-Class SVM (OC-SVM)
Local Outlier Factor (LOF)
Isolation Forest (iForest)
RANSAC (Random Sample Consensus)

Previous: Expectation Maximization Next: Elliptic Envelope

End of Section

3.5.2 - Elliptic Envelope

Elliptic Envelope

Elliptic Envelope | Anomaly Detection | Minimum Covariance Determinant (MCD)| Explained with Example

Use Case 🐝

Detect anomalies in multivariate Gaussian data, such as, biometric data (height/weight) where features are normally distributed and correlated.

Assumption: The data can be modeled by a Gaussian distribution.

Intuition 💡

In a normal distribution, most data points cluster around the mean, and the probability density decreases as we move farther away from the center.

images/machine_learning/unsupervised/anomaly_detection/elliptic_envelope/slide_03_01.png

Issue with Euclidean Distance 🐲

🌍 Euclidean distance measures the simple straight-line distance from the center of the cloud.

👉If the data is spherical, this works fine.

🦕 However, real-world data is often stretched or skewed (e.g., taller people are generally heavier), due to correlations between variables, forming an elliptical shape.

images/machine_learning/unsupervised/anomaly_detection/elliptic_envelope/slide_05_01.png

Mahalanobis Distance (Solution)

⭐️Mahalanobis distance essentially re-scales the data so that the elliptical distribution appears spherical, and then measures the Euclidean distance in that transformed space.

👉This way, it measures how many standard deviations($\sigma$) away a point is from the mean, considering the data’s spread and correlation (covariance).

\[D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}\]

Problem 🦀

Standard methods (like Z-score $z = \frac{x-\mu}{\sigma}$) fail because they are easily skewed by the outliers they are trying to find.

Solution 🦉

💡Instead of using all data, we find a ‘clean’ subset of the data that is most tightly packed and use only that subset to define the ‘normal’ ellipse.

Goal 🎯

👉Find the most dense core of the data.

\[D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}\]

🦣 Determinant of covariance matrix $\Sigma$ represents the volume of the ellipsoid.

⏲️ Therefore, minimize $|\Sigma|$ to find the tight core.

👉 $\text {Small} ~ \Sigma \rightarrow \text {Large} ~\Sigma ^{-1}\rightarrow \text {Large} ~ D_{M} ~\text {for outliers}$

Minimum Covariance Determinant (MCD) Algorithm ⚙️

MCD algorithm is used to find the covariance matrix $\Sigma$ with minimum determinant, so that the volume of the ellipsoid is minimized.

Initialization: Select several random subsets of size h < n (default h = $\frac{n+d+1}{2}$, d = # dimensions), representing ‘robust’ majority of the data.
Calculate preliminary mean ($\mu$) and covariance ($\Sigma$) for each random subset.
Concentration Step: Iterative core of the algorithm designed to ‘tighten’ the ellipsoid.
- Calculate Distances: Compute the Mahalanobis distance of all ’n’ points in the dataset from the current subset’s mean ($\mu$) and covariance ($\Sigma$).
- Select New Subset: Identify the ‘h’ points with the smallest Mahalanobis distances.
  - These are the points most centrally located relative to the current ellipsoid.
- Update Estimates: Calculate a new and based only on these ‘h’ most central points.
- Repeat 🔁: The steps repeat until the determinant stops shrinking.

Note: Select the best subset that achieved the absolute minimum determinant.

Limitations

Assumptions
- Gaussian data.
- Unimodal data (single center).
Cost 💰of covariance matrix $\Sigma^{-1}$ inversion is O(d^3).

Previous: Anomaly Detection Next: One Class SVM

End of Section

3.5.3 - One Class SVM

One Class SVM

One Class Support Vector Machine (OC-SVM) | Anomaly Detection | Explained with Examples

Use Case (Novelty Detection)🐝

⭐️Only one class of data (normal, non-outlier) is available for training, making standard supervised learning models impossible.

e.g. Only normal observations are available for fraud detection, cyber attack, fault detection etc.

Intuition

images/machine_learning/unsupervised/anomaly_detection/one_class_svm/slide_02_01.png

Problem 🦀

🦂 The core problem is to build a model that can distinguish between ‘normal’ and ‘anomalous’ data when we only have examples of the ‘normal’ class during training.

🦖 We need to find a decision boundary that is as compact as possible while still encompassing the bulk of the training data.

Solution 🦉

💡Instead of finding a hyperplane that separates two different classes, we find a hyperplane that best separates the normal data points from the origin (0,0) in the feature space 🚀.

Goal 🎯

🦍 Define a boundary for a single class in high-dimensional space where data might be non-linearly distributed (e.g.‘U’ shape).

🦧 Use the Kernel Trick to project data into a higher-dimensional space and find a hyperplane that separates the data from the origin with the maximum margin.

One Class SVM

⭐️OC-SVM, as introduced by Bernhard Schölkopf et al., uses a hyperplane ‘H’ defined by a weight vector $\mathbf{w}$ and a bias term $\rho$.

👉Solve the following optimization problem:

\[\min _{\mathbf{w},\xi _{i},\rho }\frac{1}{2}||\mathbf{w}||^{2}+\frac{1}{\nu N}\sum _{i=1}^{N}\xi _{i}-\rho \]

Subject to constraints:

\[\mathbf{w}\cdot \phi (\mathbf{x}_{i})\ge \rho -\xi _{i}\quad \text{and}\quad \xi _{i}\ge 0,\quad \text{for\ }i=1,\dots ,N\]

Explanation of Terms

$\mathbf{x}_{i}$: i-th training data point.
$\phi (\mathbf{x}_{i})$: RBF kernel function $K(x, y) = \exp(-\gamma \|x-y\|^2)$ that maps the data into a higher-dimensional feature space, making it easier to separate from the origin.
$\mathbf{w}$: normal vector to the separating hyperplane.
$\rho$: scalar bias term that determines the offset of the hyperplane from the origin.
$\xi_i$: Slack variables that allow some data points to fall on the ‘wrong’ side of the hyperplane (inside the anomalous region) to prevent overfitting.
N: total number of training points.
$\nu$: hyper-parameter between 0 and 1. It acts as an upper bound on the fraction of outliers (training data points outside the boundary) and a lower bound on the fraction of support vectors.

Working 🦇

$\frac{1}{2}\|\mathbf{w}\|^{2}$: aims to maximize the margin/compactness of the region.
$\frac{1}{\nu N}\sum _{i=1}^{N}\xi _{i}-\rho$: penalizes points (outliers) that violate the boundary constraints.

After solving the optimization problem using standard quadratic programming techniques, we obtain the optimal $\mathbf{w}^{*}$ and $\rho ^{*}$.

For a new data point $x_{new}$, decision function is:

\[f(\mathbf{x}_{\text{new}})=\text{sign}(\mathbf{w}^{*}\cdot \phi (\mathbf{x}_{\text{new}})-\rho ^{*})\]

$f(\mathbf{x}_{\text{new}})\ge 0$: normal point.
$f(\mathbf{x}_{\text{new}})< 0$: anomalous point (outlier).

Previous: Elliptic Envelope Next: Local Outlier Factor

End of Section

3.5.4 - Local Outlier Factor

Local Outlier Factor

Local Outlier Factor (LOF) | Anomaly Detection | Explained with Example

Use Case 🐝

⭐️Geographic fraud detection:
A $100 transaction might be ‘normal’ in New York but an ‘outlier’ in a small rural village.

Intuition 💡

‘Local context matters.’

Global distance metrics fail when density is non-uniform.

🦄 An outlier is a point that is ‘unusual’ relative to its immediate neighbors, regardless of how far it is from the center of the entire dataset.

Problem 🦀

💡Traditional distance-based outlier detection methods, such as, KNN, often struggle with datasets where data is clustered at varying densities.

A point in a sparse region might be considered an outlier by a global method, even if it is a normal part of that sparse cluster.
Conversely, a point very close to a dense cluster might be an outlier relative to that specific neighborhood.

Solution 🦉

👉Calculate the relative density of a point compared to its immediate neighborhood.

e.g. If the neighbors are in a dense crowd and the point is not, it is an outlier.

Goal 🎯

📌Compare the density of a point to the density of its neighbors.

Local Outlier Factor (LOF)

Local Outlier Factor (LOF) is a density-based algorithm designed to detect anomalies by measuring the local deviation of a data point relative to its neighbors.

👉Size of the red circle represents the LOF score.

images/machine_learning/unsupervised/anomaly_detection/local_outlier_factor/slide_08_01.png

LOF Score Calculation 🔢

K-Distance ($k\text{-dist}(p)$):
- The distance from point ‘p’ to its k-th nearest neighbor.
Reachability Distance ($\text{reach-dist}_{k}(p,o)$): \[\text{reach-dist}_{k}(p,o)=\max \{k\text{-dist}(o),\text{dist}(p,o)\}\]
- $\text{dist}(p,o)$: actual Euclidean distance between ‘p’ and ‘o’.
- This acts as ‘smoothing’ factor.
- If point ‘p’ is very close to ‘o’ (inside o’s k-neighborhood), round up distance to $k\text{-dist}(o)$.
Local Reachability Density ($\text{lrd}_{k}(p)$):
- The inverse of the average reachability distance from ‘p’ to its k-neighbors ($N_{k}(p)$). \[\text{lrd}_{k}(p)=\left[\frac{1}{|N_{k}(p)|}\sum _{o\in N_{k}(p)}\text{reach-dist}_{k}(p,o)\right]^{-1}\]
  - High LRD: Neighbors are very close; the point is in a dense region.
  - Low LRD: Neighbors are far away; the point is in a sparse region.
Local Outlier Factor ($\text{LOF}_{k}(p)$):
- The ratio of the average ‘lrd’ of p’s neighbors to p’s own ‘lrd’. \[\text{LOF}_{k}(p)=\frac{1}{|N_{k}(p)|}\sum _{o\in N_{k}(p)}\frac{\text{lrd}_{k}(o)}{\text{lrd}_{k}(p)}\]

images/machine_learning/unsupervised/anomaly_detection/local_outlier_factor/slide_10_01.png

LOF Score 🔢 Interpretation

LOF ≈ 1: Point ‘p’ has similar density to its neighbors (inlier).
LOF > 1: Point p’s density is much lower than its neighbors’ density (outlier).

Previous: One Class SVM Next: Isolation Forest

End of Section

3.5.5 - Isolation Forest

Isolation Forest

Isolation Forest (iForest) | Anomaly Detection | Explained with Example

Use Case 🐝

‘Large scale tabular data.’

Credit card fraud detection in datasets with millions of rows and hundreds of features.

Note: Supervised learning requires balanced, labeled datasets (normal vs. anomaly), which are rarely available in real-world scenarios like fraud or cyber-attacks.

Intuition 💡

🦥 ‘Flip the logic.’

🦄 ‘Anomalies’ are few and different, so they are much easier to isolate from the rest of the data than normal points.

Problem 🦀

🐦‍🔥 ‘Curse of dimensionality.’

🐎 Distance based (K-NN), and density based (LOF) algorithms require calculation of distance between all pair of points.

🐙 As the number of dimensions and data points grows, these calculations become exponentially more expensive 💰 and less effective.

Solution 🦉

Use a tree-based 🌲 approach with better time ⏰ complexity O(nlogn), making it highly scalable for massive datasets and robust in high-dimensional spaces without needing expensive distance metrics.

Goal 🎯

‘Randomly partition the data.’

🦄 If a point is an outlier, it will take fewer partitions (splits) to isolate it into a leaf 🍃 node compared to a point that is buried deep within a dense cluster of normal data.

Isolation Forest (iForest) 🌲🌳

🌳Isolation Forest uses an ensemble of ‘Isolation Trees’ (iTrees) 🌲.

👉iTree (Isolation Tree) 🌲 is a proper binary tree structure specifically designed to separate individual data points through random recursive partitioning.

Algorithm ⚙️

Sub-sampling:
- Select a random subset of data (typically 256 instances) to build an iTree.
Tree Construction: Randomly select a feature.
- Randomly select a split value between the minimum and maximum values of that feature.
- Divide the data into two branches based on this split.
- Repeat recursively until the point is isolated or a height limit is reached.
Forest Creation:
- Repeat 🔁 the process to create ‘N’ trees (typically 100).
Inference:
- Pass a new data point through all trees, calculate the average path length, and compute the anomaly score.

Scoring Function 📟

⭐️🦄 Assign an anomaly score based on the path length h(x) required to isolate a point ‘x’.

Path Length (h(x)): The number of edges ‘x’ traverses from the root node to a leaf node.
Average Path Length (c(n)): Since iTrees are structurally similar to Binary Search Trees (BST), the average path length for a dataset of size ’n’ is given by: \[c(n)=2H(n-1)-\frac{2(n-1)}{n}\]

where, H(i) is the harmonic number, estimated as $\ln (i)+0.5772156649$ (Euler’s constant).

🦄 Anomaly Score

To normalize the score between 0 and 1, we define it as:

\[s(x,n)=2^{-\frac{E(h(x))}{c(n)}}\]

👉E(H(x)): is the average path length of across a forest of trees 🌲.

$s\rightarrow 1$: Point is an anomaly; Path length is very short.
$s\approx 0.5$: Point is normal, path length approximately equal to c(n).
$s\rightarrow 0$: Point is normal; deeply buried point, path length is much larger than c(n).

images/machine_learning/unsupervised/anomaly_detection/isolation_forest/slide_12_01.tif

Drawbacks

Axis-Parallel Splits:
- Standard iTrees 🌲 split only on one feature at a time, so:
  - We do not get a smooth decision boundary.
  - Anything off-axis has a higher probability of being marked as an outlier.
  - Note: Extended Isolation Forest fixes this by using random slopes.
Score Sensitivity: The threshold for what constitutes an ‘anomaly’ often requires manual tuning or domain knowledge.

Previous: Local Outlier Factor Next: RANSAC

End of Section

3.5.6 - RANSAC

RANSAC

Random Sampling Consensus (RANSAC) | Anomaly Detection | Explained with Example

Use Case 🐝

⭐️Estimate the parameters of a model from a set of observed data that contains a significant number of outliers.

Intuition 💡

👉Ordinary Least Squares use all data points to find a fit.

However, a single outlier can ‘pull’ the resulting line significantly, leading to a poor representative model.

💡If we pick a very small random subset of points, there is a higher probability that this small subset contains only good data (inliers) compared to a large set.

Problem 🦀

Ordinary Least Squares (OLS) minimizes the Sum of Errors.
- A huge outlier has an exponentially large impact on the final line.

Solution 🦉

💡Instead of using all points, iteratively pick the smallest possible random subset to fit a model, then check (votes) how many other points in the dataset ‘agree’ with that model.

This gives the name to our algorithm:

Random: Random subsets.
Sampling: Small subsets.
Consensus: Agreement with other points.

RANSAC Algorithm ⚙️

Random Sampling:
- Randomly select a Minimal Sample Set (MSS) of ’n’ points from the input data ‘D’.
- e.g. n=2 for a line, or n=3 for a plane in 3D.
Model Fitting:
- Compute the model parameters using only these ’n’ points.
Test:
- For all other points in ‘D’, calculate the error relative to the model.
- 👉 Points with error < $\tau$(threshold) are added to the ‘Consensus Set’.
Evaluate:
- If the consensus set is larger than the previous best, save this model and set.
Repeat 🔁:
- Iterate ‘k’ times.
Refine (Optional):
- Once the best model is found, re-estimate it using all points in the final consensus set (usually via Least Squares) for a more precise fit.

👉Example:

images/machine_learning/unsupervised/anomaly_detection/ransac/slide_09_01.png

How Many Iterations ‘k’ ?

👉To ensure the algorithm finds a ‘clean’ sample set (no outliers) with a desired probability(often 99%), we use the following formula:

\[k=\frac{\log (1-P)}{\log (1-w^{n})}\]

k: Number of iterations required.
P: Probability that at least one random sample contains only inliers.
w: Ratio of inliers in the dataset (number of inliers / total points).
n: Number of points in the Minimal Sample Set.

Previous: Isolation Forest Next: PCA

End of Section

3.6 - Dimensionality Reduction

Dimensionality Reduction Techniques

End of Section

3.6.1 - PCA

PCA

Principal Component Analysis (PCA) | Dimensionality Reduction | Explained with Example

Use Case 🐝

Data Compression
Noise Reduction
Feature Extraction: Create a smaller set of meaningful features from a larger one.
Data Visualization: Project high-dimensional data down to 2 or 3 dimensions.

Assumption: Linear relationship between features.

Intuition 💡

💡‘Information = Variance’

☁️ Imagine a cloud of points in 2D space.
👀 Look for the direction (axis) along which the data is most ‘spread out’.

🚀 By projecting data onto this axis, we retain the maximum amount of information (variance).

👉Example 1: Var(Feature 1) < Var(Feature 1)

images/machine_learning/unsupervised/dimensionality_reduction/pca/slide_03_01.png

👉Example 2: Red line shows the direction of maximum variance

images/machine_learning/unsupervised/dimensionality_reduction/pca/slide_04_01.png

Principal Component Analysis

🧭 Finds the direction of maximum variance in the data.

Note: Some loss of information will always be there in dimensionality reduction, because there will be some variability in data along the direction that is dropped, and that will be lost.

Goal 🎯

Fundamental goal of PCA is to find the new set of orthogonal axes, called the principal components, onto which the data can be projected, such that, the variance of the projected data is maximum.

PCA as Optimization Problem 🦀

PCA seeks a direction 🧭, represented by a unit vector $\hat{u}$ onto which data can be projected to maximize variance.
The projection of a mean centered data point $x_i$ onto $u$ is $u^Tx_i$.
The variance of these projections can be expressed as $u^{T}\Sigma u$, where $\Sigma$ is the data’s covariance matrix.

Why Variance of Projection is $u^{T}\Sigma u$?

Data: Let $X = \{x_{1},x_{2},\dots ,x_{n}\}$ is the mean centered ($\bar{x} = 0$) dataset.
Projection of point $x_i$ on $u$ is $z_i = u^Tx_i$
Variance of projected points (since $\bar{x} = 0$): \[\text{Var}(z)=\frac{1}{n}\sum _{i=1}^{n}z_{i}^{2}=\frac{1}{n}\sum _{i=1}^{n}(x_{i}^{T}u)^{2}\]
💡Since, $(x_{i}^{T}u)^{2} = (u^{T}x_{i})(x_{i}^{T}u)$ $ \implies\text{Var}(z)=u^{T}\left(\frac{1}{n}\sum _{i=1}^{n}x_{i}x_{i}^{T}\right)u$
💡Since, Covariance Matrix, $\Sigma = \left(\frac{1}{n}\sum _{i=1}^{n}x_{i}x_{i}^{T}\right)$
👉 Therefore, $\text{Var}(z)=u^{T}\Sigma u$

Constrained 🐣 Optimization

👉 To prevent infinite variance, PCA constrains $u$ to be a unit vector ($\|u\|=1$).

\[\text{maximize\ }u^{T}\Sigma u, \quad \text{subject\ to\ }u^{T}u=1\]

Note: This constraint forces the optimization algorithm to focus purely on the direction that maximizes variance, rather than allowing it to artificially inflate the variance by increasing the length of the vector.

Constrained Optimization Solution 🦉

⏲️ Lagrangian function: $L(u,\lambda )=u^{T}\Sigma u-\lambda (u^{T}u-1)$
🔦 To find $u$ that maximizes above constrained optimization:

\[\frac{\partial L}{\partial u} = 0\]

\[\implies 2\Sigma u - 2\lambda u = 0 \implies \Sigma u = \lambda u\]

\[\because \frac{\partial }{\partial x}x^{T}Ax=2Ax \text{ for symmetric } A\]

💎 This is the standard Eigenvalue Equation.
🧭 So, the optimal directions $u$ are the eigenvectors of the covariance matrix $\Sigma$.

👉 To see which eigenvector maximizes variance, substitute the result back into the variance equation:

\[\text{Variance}=u^{T}\Sigma u=u^{T}(\lambda u)=\lambda (u^{T}u)=\lambda \]

🧭 Since the variance equals the eigenvalue $\lambda$, the direction $u$ that maximizes variance is the eigenvector associated with the largest eigenvalue.

PCA Algorithm ⚙️

Center the data: $X = X - \mu$
Compute the Covariance Matrix: $\Sigma = \frac{1}{n-1} X^T X$
Compute Eigenvectors and Eigenvalues of $\Sigma$ .
Sort eigenvalues in descending order and select the top ‘k’ eigenvectors.
Project the original data onto the subspace: $Z = X W_k$ where, $W_{k}=[u_{1},u_{2},\dots ,u_{k}]$ , matrix formed by ‘k’ eigenvectors corresponding to ‘k’ largest eigenvalues.

images/machine_learning/unsupervised/dimensionality_reduction/pca/slide_13_01.png

Drawbacks

Cannot model non-linear relationships.
Sensitive to outliers.

Previous: RANSAC Next: t-SNE

End of Section

3.6.2 - t-SNE

t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-Distributed Stochastic Neighbor Embedding ( t-SNE) | Dimensionality Reduction | Explained

Use Case 🐝

⭐️ Visualizing complex datasets, such as MNIST handwritten digits, text embeddings, or biological data, where clusters are expected to form naturally.

Intuition 💡

👉 PCA preserves variance, not neighborhoods.
🔬 t-SNE focuses on the ‘neighborhood’ (local structure).

💡Tries to keep points that are close together in high-dimensional space close together in low-dimensional space.

t-SNE

⭐️ Non-linear dimensionality reduction technique to visualize high-dimensional data (like images, gene expressions, text embeddings) in a lower-dimensional space (typically 2D or 3D) by preserving local structures, making clusters and patterns visible.

Problem 🦀

👉 Map high-dimensional points to low-dimensional points , such that the pairwise similarities are preserved, while solving the ‘crowding problem’ (where points collapse into a single cluster).

👉 Crowding Problem

images/machine_learning/unsupervised/dimensionality_reduction/tsne/slide_05_01.png

Solution 🦉

📌 Convert Euclidean distances into conditional probabilities representing similarities.
⚖️ Minimize the divergence between the probability distributions of the high-dimensional (Gaussian) and low-dimensional (t-distribution) spaces.

Note: Probabilistic approach to defining neighbors is the core ‘stochastic’ element of the algorithm’s name.

High Dimensional Space 🚀(Gaussian)

💡The similarity of datapoint $x_j$ to datapoint $x_i$ is the conditional probability $p_{j|i}$, that $x_i$ would pick $x_j$ as its neighbor.

Note: If neighbors are picked in proportion to their probability density under a Gaussian centered at $x_i$.

\[p_{j|i} = \frac{\exp(-||x_i - x_j||^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-||x_i - x_k||^2 / 2\sigma_i^2)}\]

$p_{j|i}$ high: nearby points.
$p_{j|i}$ low: widely separated points.

Note: Make probabilities symmetric: $p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}$

Low Dimensional Space 🚀 (t-distribution)

🧠 To solve the crowding problem, we use a heavy-tailed 🦨 distribution (Student’s-t distribution with degree of freedom $\nu=1$).

\[q_{ij} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\sum_{k \neq l} (1 + ||y_k - y_l||^2)^{-1}}\]

Note: Heavier tail spreads out dissimilar points more effectively, allowing moderately distant points to be mapped further apart, preventing clusters from collapsing and ensuring visual separation and cluster distinctness.

images/machine_learning/unsupervised/dimensionality_reduction/tsne/slide_10_01.png

Optimization 🕸️

👉 Measure the difference between the distributions ‘p’ and ‘q’ using the Kullback-Leibler (KL) divergence, which we aim to minimize:

\[C = KL(P||Q) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}\]

Gradient Descent 🎢

🏔️ Use gradient descent to iteratively adjust the positions of the low-dimensional points $y_i$.

👉 The gradient of the KL divergence is:

\[\frac{\partial C}{\partial y_{i}}=4\sum _{j\ne i}(p_{ij}-q_{ij})(y_{i}-y_{j})(1+||y_{i}-y_{j}||^{2})^{-1}\]

Meaning of Terms

C : t-SNE cost function (sum of all KL divergences).
$y_i, y_j$: coordinates of data points and in the low-dimensional space (typically 2D or 3D).
$p_{ij}$: high-dimensional similarity (joint probability) between points $x_i$ and $x_j$, calculated using a Gaussian distribution.
$q_{ij}$: low-dimensional similarity (joint probability) between points $y_i$ and $y_j$, calculated using a Student’s t-distribution with one degree of freedom.
$(1+||y_{i}-y_{j}||^{2})^{-1}$: term comes from the heavy-tailed Student’s t-distribution, which helps mitigate the ‘crowding problem’ by allowing points that are moderately far apart to have a small attractive force.

Interpretation 🦒

💡 The gradient can be understood as a force acting on each point $y_i$ in the low-dimensional map:

\[\frac{\partial C}{\partial y_{i}}=4\sum _{j\ne i}(p_{ij}-q_{ij})(y_{i}-y_{j})(1+||y_{i}-y_{j}||^{2})^{-1}\]

Attractive forces: If $p_{ij}$ is large ⬆️ and $q_{ij}$ is small ⬇️ (meaning two points were close in the high-dimensional space but are far in the low-dimensional space), the term $(p_{ij}-q_{ij})$ is positive, resulting in a strong attractive force pulling $y_i$ and $y_j$ closer.
Repulsive forces: If $p_{ij}$ is small ⬇️ and $q_{ij}$ is large ⬆️ (meaning two points were far in the high-dimensional space but are close in the low-dimensional space), the term $(p_{ij}-q_{ij})$ is negative, resulting in a repulsive force pushing $y_i$ and $y_j$ apart.

Gradient Descent Update Step

👉 Update step of in low dimensions:

\[y_{i}^{(t+1)}=y_{i}^{(t)}-\eta \frac{\partial C}{\partial y_{i}}\]

Attractive forces ($p_{ij}>q_{ij}$):
- The negative gradient moves $y_i$ opposite to the ($y_i - y_j$) vector, pulling it toward $y_j$.
Repulsive forces ($p_{ij} < q_{ij}$):
- The negative gradient moves $y_i$ in the same direction as the ($y_i - y_j$) vector, pushing it away from $y_j$.

Perplexity 😵‍💫

🏘️ User-defined parameter that loosely relates to the effective number of neighbors.

Note: Variance $\sigma_i^2$ (Gaussian) is adjusted for each point to maintain a consistent perplexity.

t-SNE Plot of MNIST Digits

images/machine_learning/unsupervised/dimensionality_reduction/tsne/slide_18_01.png

Previous: PCA Next: UMAP

End of Section

3.6.3 - UMAP

Uniform Manifold Approximation and Projection (UMAP)

Uniform Manifold Approximation and Projection (UMAP) | Dimensionality Reduction | Explained

Use Case 🐝

🐢Visualizing massive datasets where t-SNE is too slow.

⭐️ Creating robust low-dimensional inputs for subsequent machine learning models.

Intuition 💡

⭐️ Using a world map 🗺️ (2D)instead of globe 🌍 for spherical(3D) earth 🌏.

👉It preserves the neighborhood relationships of countries (e.g., India is next to China), and to a good degree, the global structure.

UMAP

⭐️ Non-linear dimensionality reduction technique to visualize high-dimensional data (like images, gene expressions) in a lower-dimensional space (typically 2D or 3D), preserving its underlying structure and relationships.
👉 Constructs a high-dimensional graph of data points and then optimizes a lower-dimensional layout to closely match this graph, making complex datasets understandable by revealing patterns, clusters, and anomalies.

Note: Similar to t-SNE but often faster and better at maintaining global structure.

Problem 🦀

👉 Create a low-dimensional representation that preserves the topological connectivity and manifold structure of the high-dimensional data efficiently.

Solution 🦉

💡 Create a weighted graph (fuzzy simplicial set) representing the data’s topology and then find a low-dimensional graph that is as structurally similar as possible.

Note: Fuzzy means instead of using binary 0/1, we use weights in the range [0,1] for each edge.

High Dimensional Graph (Manifold Approximation)

UMAP determines local connectivity based on a user-defined number of neighbors (n_neighbors).
Normalizes distances locally using the distance to the nearest neighbor ($\rho_i$) and a scaling factor ($\sigma_i$) adjusted to enforce local connectivity constraints.
The weight $W_{ij}$ (fuzzy similarity) in high-dimensional space is: \[W_{ij}=\exp \left(-\frac{\max (0,\|x_{i}-x_{j}\|-\rho _{i})}{\sigma _{i}}\right)\]

Note: This ensures that the closest point always gets a weight of 1, preserving local structure.

Low Dimensional Space (Optimization)

👉In the low-dimensional space (e.g., 2D), UMAP uses a simple curve (similar to the t-distribution used in t-SNE) for edge weights:

\[Z_{ij}=(1+a\|y_{i}-y_{j}\|^{2b})^{-1}\]

Note: The parameters ‘a’ and ‘b’ are typically fixed based on the ‘min_dist’ user parameter (e.g. min_dist = 0.1, then a≈1,577, b≈0.895).

Optimization

⭐️ Unlike t-SNE’s KL divergence, UMAP minimizes the cross-entropy between the high-dimensional weights $W_{ij}$ and the low-dimensional weights $Z_{ij}$.

👉 Cost 💰 Function (C):

\[C=\sum _{i,j}\left(W_{ij}\log \frac{W_{ij}}{Z_{ij}}+(1-W_{ij})\log \frac{1-W_{ij}}{1-Z_{ij}}\right)\]

Cross Entropy Loss

Cost 💰 Function (C):

\[C=\sum _{i,j}\left(W_{ij}\log \frac{W_{ij}}{Z_{ij}}+(1-W_{ij})\log \frac{1-W_{ij}}{1-Z_{ij}}\right)\]

🎯 Goal : Reduce overall Cross Entropy Loss.

Attractive Force: $W_{ij}$ high; make $Z_{ij}$ high to make the log $\frac{W_{ij}}{Z_{ij}} $term small.
Repulsive Force: $W_{ij}$ low; make $Z_{ij}$ low to make the $log\frac{1-W_{ij}}{1-Z_{ij}}$ term small.

Note: This push and pull of 2 ‘forces’ will make the data in low dimensions settle into a position that is overall a good representation of the original data in higher dimensions.

Stochastic Gradient Descent

👉 Optimization uses stochastic gradient descent (SGD) to minimize this cross-entropy, balancing attractive forces (edges present in high-dimension, $W_{ij} \approx 1$) and repulsive forces (edges absent in high-dimension, $W_{ij} \approx 0$).

UMAP Plot of MNIST Digits

images/machine_learning/unsupervised/dimensionality_reduction/umap/slide_13_01.png

Drawbacks 🦂

Mathematically complex.
Requires tuning (n_neighbors, min_dist).

Previous: t-SNE Next: Data Pre Processing

End of Section

4 - Feature Engineering

Feature Engineering

End of Section

4.1 - Data Pre Processing

Data Pre Processing

Data Pre Processing in Machine Learning | Missing Values | Outliers | Scaling | Normalization

Real World 🌎 Data

Messy and Incomplete.
We need to pre-process the data to make it:

Clean
Consistent
Mathematically valid
Computationally stable

👉 So that, the machine learning algorithm can safely consume the data.

Missing Values

Missing Completely At Random (MCAR)
- The missingness occurs entirely by chance, such as due to a technical glitch during data collection or a random human error in data entry.
Missing At Random (MAR)
- The probability of missingness depends on the observed data and not on the missing value itself.
- e.g. In some survey, the age of many females are missing, because they may not like to disclose the information.
Missing Not At Random (MNAR)
- The probability of missingness is directly related to the unobserved missing value itself.
- e.g. Individuals with very high incomes 💰may intentionally refuse to report their salary due to privacy concerns, making the missing data directly dependent on the high income 💰value itself.

Handle Missing Values (Imputation)

Simple Imputation:
- Mean: Normally distributed numerical features.
- Median: Skewed numerical features.
- Mode: Categorical features, most frequent.
KNN Imputation:
- Replace the missing value with mean/median/mode of ‘k’ nearest (similar) neighbors of the missing value.
Predictive Imputation:
- Use another ML model to estimate missing values.
Multivariate Imputation by Chained Equations (MICE):
- Iteratively models each variable with missing values as a function of other variables using flexible regression models (linear regression, logistic regression, etc.) in a ‘chained’ or sequential process.
- Creates multiple datasets, using slightly different random starting points.

Handle Outliers 🦄

🦄 Outliers are extreme or unusual data points, can mislead models, causing inaccurate predictions.

Remove invalid or corrupted data.
Replace (Impute): Median or capped value to reduce impact.
Transform: Apply log or square root to reduce skew.

👉 For example: Log and Square Root Transformed Data

images/machine_learning/feature_engineering/data_pre_processing/slide_05_01.png

Scaling and Normalization

💡 If one feature ranges from 0-1 and another from 0-1000, larger feature will dominate the model.

Standardization (Z-score) :
- μ=0, σ=1; less sensitive to outliers.
- $x_{std} = (x − μ) / σ$
Min-Max Normalization:
- Maps data to specific range, typically [0,1]; sensitive to outliers.
- $x_{minmax} = (x − min) / (max − min)$
Robust Scaling:
- Transforms features using median and IQR; resilient to outliers.
- $x_{scaled}=(x-\text{median})/\text{IQR}$

👉 Standardization Example

images/machine_learning/feature_engineering/data_pre_processing/slide_07_01.png

Previous: UMAP Next: Categorical Variables

End of Section

4.2 - Categorical Variables

Categorical Variables

Previous: Data Pre Processing Next: Feature Engineering

Categorical Variables

💡 ML models operate on numerical vectors.

👉 Categorical variables must be transformed (encoded) while preserving information and semantics.

One Hot Encoding (OHE)
Label Encoding
Ordinal Encoding
Frequency/Count Encoding
Target Encoding
Hash Encoding

One Hot 🔥 Encoding (OHE)

⭐️ When the categorical data (nominal) is without any inherent ordering.

Create binary columns per category.
- e.g.: Colors: Red, Blue, Green.
- Colors: [1,0,0], [0,1,0], [0,0,1]

Note: Use when low cardinality, or small number of unique values (<20).

Label 🏷️ Encoding

⭐️ Assigns a unique integer (e.g., 0, 1, 2) to each category.

When to use ?
- Target variable, i.e, unordered (nominal) data, in classification problems.
- e.g. encoding a city [“Paris”, “Tokyo”, “Amsterdam”] -> [1, 2, 0], (Alphabetical: Amsterdam=0, Paris=1, Tokyo=2).
When to avoid ?
- For nominal data in linear models, because it can mislead the model to assume an order/hierarchy, when there is none.

Ordinal Encoding

⭐️ When categorical data has logical ordering.

Best for: Ordered (ordinal) input features.

Frequency/Count 📟 Encoding

⭐️ Replace categories with their frequency or count in the dataset.

Useful for high-cardinality features where many unique values exist.

👉 Example

images/machine_learning/feature_engineering/categorical_variables/slide_06_01.png

👉 Frequency of Country

images/machine_learning/feature_engineering/categorical_variables/slide_06_03.png

👉 Country replaced with Frequency

images/machine_learning/feature_engineering/categorical_variables/slide_06_02.png

Target 🎯 Encoding

⭐️ Replace a category with the mean of the target variable for that specific category.

When to use ?
- For high-cardinality nominal features, where one hot encoding is inefficient, e.g., zip code, product id, etc.
- Strong correlation between the category and the target variable.
When to avoid ?
- With small datasets, because the category averages (encodings) are based on too few samples, making them unrepresentative.
- Also, it can lead to target leakage and overfitting unless proper smoothing or cross-validation techniques (like K-fold or Leave-One-Out) are used.

Hash 🌿 Encoding

⭐️ Maps categories to a fixed number of features using a hash function.

Useful for high-cardinality features where we want to limit the dimensionality.

End of Section

4.3 - Feature Engineering

Feature Engineering

Previous: Categorical Variables Next: Data Leakage

Feature Engineering

Use domain knowledge 📕 to create new or transform existing features to improve model performance.

Polynomial 🐙 Features

Create polynomial features, such as, x^2, x^3, etc., to learn non-linear relationship.

images/machine_learning/feature_engineering/feature_engineering/slide_04_01.png

Feature Crossing 🦓

⭐️ Combine 2 or more features to capture non-linear relationship.

e.g. combine latitude and longitude into one location feature ‘lat-long'.

Hash 🌿 Encoding

⭐️ Memory-efficient 🧠 technique to convert categorical (string) data into a fixed-size numerical feature vector.

Pros:
- Useful for high-cardinality features where we want to limit the dimensionality.
Cons:
- Hash collisions.
- Reduced interpretability.

👉 Hash Encoding (Example)

images/machine_learning/feature_engineering/feature_engineering/slide_08_01.png

Binning (Discretization)

⭐️ Group continuous numerical values into discrete categories or ‘bin’.

e.g. divide age into groups 18-24, 25-35, 35-45, 45-55, >55 years etc.

End of Section

4.4 - Data Leakage

Data Leakage

Data Leakage in ML | Target Leakage | Temporal Leakage | Train Test Contamination | Explained

Data Leakage

⭐️ Occurs when information ℹ️ NOT available at inference time is used during training 🏃‍♂️, leading to good training performance, but poor real‑world 🌎 performance.

Target Leakage

⭐️ Including features that are only available after the event we are trying to predict.

e.g. Including number_of_late_payments in a model to predict whether a person applying for a bank loan 💵 will default ?

Temporal Leakage

⭐️ Using future data to predict the past.

Fix: Use Time-Series ⏰ Cross-Validation (Walk-forward validation) instead of random shuffling.

Train-Test Contamination

⭐️ Applying preprocessing (like global StandardScaler or Mean_Imputation) on the entire dataset before splitting.

Fix: Compute mean, variance, etc. only on the training 🏃‍♂️data and use the same for validation and test data.

Previous: Feature Engineering Next: Model Interpretability

End of Section

4.5 - Model Interpretability

Model Interpretability

Model Interpretability in ML | Why Model Interpretability is Important ?

House Price Prediction

images/machine_learning/feature_engineering/model_interpretability/slide_01_01.png

Can we explain why the model made a certain prediction ?

👉 Because without this capability the machine learning is like a black box to us.

👉 We should be able to answer which features had most influence on output.

⭐️ Let’s understand ‘Feature Importance’ and why the ML model output’s interpretability is important ?

Feature Importance

\[\hat{y_i} = w_0 + w_1x_{i_1} + w_2x_{i_2} + \dots + w_dx_{i_d}\]\[w_1 > w_2 : f_1 \text{ is more important feature than } f_2\]\[ \begin{align*} w_j &> 0: f_j \text { is directly proportional to target variable} \\ w_j &= 0: f_j \text { has no relation to target variable} \\ w_j &< 0: f_j \text { is inversely proportional to target variable} \\ \end{align*} \]

Note: Weights 🏋️‍♀️ represent the importance of feature with standardized data.

Why Model Interpretability Matters ?

💡 Overall model behavior + Why this prediction?

Trust: Stakeholders must trust predictions.
Model Debuggability: Detect leakage, spurious correlations.
Feature engineering: Feedback loop.
Regulatory compliance: Data privacy, GDPR.

Trust

⭐️ Stakeholders Must Trust Predictions.

Users, executives, and clients are more likely to trust and adopt an AI system if they understand its reasoning.
This transparency is fundamental, especially in high-stakes applications like healthcare, finance, or law, where decisions can have a significant impact.

Model Debuggability

⭐️ By examining which features influence predictions, developers can identify if the model is using misleading or spurious correlations, or if there is data leakage (where information not available in a real-world scenario is used during training).

Feature Engineering

⭐️ Insights gained from an interpretable model can provide a valuable feedback loop for domain experts and engineers.

Regulatory Compliance

⭐️ In many industries, regulations mandate the ability to explain decisions made by automated systems.

For instance, the General Data Protection Regulation (GDPR) in Europe includes a “right to explanation” for individuals affected by algorithmic decisions.
Interpretability ensures that organizations can meet these legal and ethical requirements.

Previous: Data Leakage Next: Data Distribution Shift

End of Section

5 - ML System

Machine Learning System

End of Section

5.1 - Data Distribution Shift

Data Distribution Shift

Data Distribution Shift | Covariate Shift | Label Shift | Concept Drift | Explained with Example

Distribution Shift or Data Drift 🦣

⭐️ The data a model works with changes over time ⏰, which causes this model’s predictions to become less accurate as time passes⏳.

Bayes' Theorem

\[P(Y|X)=\frac{P(X|Y)\cdot P(Y)}{P(X)}\]

P(X | Y): Likelihood of X given Y (joint distribution)
P(Y | X) : Model (Posterior)
P(Y): Prior probability of the output Y.
P(X): Evidence (marginal probability of the input X).

Covariate Shift (P(X) Changes)

⭐️The input data distribution seen during training is different from the distribution seen during inference.

👉 P(X)(input) changes, but P(Y|X) (model) remains same.

e.g. Self-driving car 🚗 trained on a bright, sunny day is used during foggy winter.

Label Shift or Prior Probability Shift (P(Y) Changes)

⭐️The output distribution changes, but for a given output, the input distribution remains the same.

👉 P(Y) (output) changes, but P(X|Y) remains the same.

😷 e.g. Flu-detection model is trained during summer, when only 1% of patients have flu.
- The same model is used during winter when 40% of patients have flu.
- 🍎 Prior probability of having flu P(Y) has changed from 1% to 40%, but the symptoms for a person to have flu P(X|Y) remains same.

Concept Drift or Posterior Shift (P(Y|X) Changes)

⭐️ The relationship between inputs and outputs changes.
i.e the very definition of what you are trying to predict changes.

👉 Concept drifts are cyclic or seasonal.

e.g. ‘Normal’ spending behavior in 2019 became ‘Abnormal’ during 2020 lockdowns 🔐.

Previous: Model Interpretability Next: Retraining Strategies

End of Section

5.2 - Retraining Strategies

Retraining Strategies

Model Retraining Strategies | Periodic Retraining | Trigger- Based Retraining | Continual Learning

Why Retrain 🦮 a ML Model?

⭐️In a production ML environment, retraining is the ‘maintenance engine’ ⚙️ that keeps our models from becoming obsolete.

❌ Don’t ask: When do we retrain?

✅ Ask: “How do we automate the decision to retrain while balancing compute cost 💰, model risk, and data freshness?”

Periodic Retraining (Fixed Interval) ⏳

👉 The model is retrained on a regular schedule (e.g., daily, weekly, or monthly).

Best for:
- Stable environments where data changes slowly.
  (e.g. long-term demand forecast or a credit scoring model).
Pros:
- Highly predictable; easy to schedule compute resources; simple to implement via a cron job or Airflow DAG.
Cons:
- Inefficient. You might retrain when not needed (wasting money 💵) or fail to retrain during a sudden market shift (losing accuracy).

Trigger-Based Retraining (Reactive) 🔫

👉 Retraining is initiated only when a specific performance or data metric crosses a pre-defined threshold.

Metric Triggers:
- Performance Decay: A drop in Precision, Recall, or RMSE (requires ground-truth labels).
- Drift Detection: A high PSI (Population Stability Index) or K-S test score indicating covariate shift.
Pros:
- Cost-effective; reacts to the ‘reality’ of the data rather than the calendar.
Cons:
- Requires a robust monitoring stack 📺.
  If the ‘trigger’ logic is buggy, the model may never update.

Continual Learning (Online/Incremental) 🛜

👉 Instead of retraining from scratch on a massive batch, the model is updated incrementally as new data ‘streams’ into the system.

Mechanism: Using ‘Warm Starts’ where the model weights from the previous version are used as the starting point for the next few gradient descent steps.
Best for:
- Recommendation engines (Netflix/TikTok) or High-Frequency Trading 💰where patterns change by the minute.
Pros:
- Extreme ‘freshness’; low latency between data arrival and model update.
Cons:
- High risk of ‘Catastrophic Forgetting’ (the model forgets old patterns) and high infrastructure complexity.

Previous: Data Distribution Shift Next: Deployment Patterns

End of Section

5.3 - Deployment Patterns

Deployment Patterns