This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Notes

AI & ML Course Overview

1 - Machine Learning

Classical Machine Learning



End of Section

1.1 - Introduction

Introduction To Machine Learning



End of Section

1.1.1 - What is ML ?

What is Machine Learning ?


Learn
To gain knowledge or understanding of a skill, by study, instruction, or experience.
Machine Learning
To teach computers 💻 to learn from data, find patterns 🧮, and make decisions or predictions without being explicitly programmed for every task, as humans🧍‍♀️🧍 learn from experience.

Phases of Machine Learning

The machine learning lifecycle ♼ is generally divided into two main stages:

  • Training Phase
  • Runtime (Inference) Phase

Training Phase:
Where the machine learning model is developed and taught to understand a specific task using a large volume of historical data.

images/machine_learning/introduction/training.png

Runtime (Inference) Phase:
Where the fully trained and deployed model is put to practical use in a real-world 🌎 environment, i.e., to make predictions on new, unseen data.

images/machine_learning/introduction/inference.png



End of Section

1.1.2 - Types of ML

Types of Machine Learning


images/machine_learning/introduction/types_of_ml.png
Supervised Learning

Supervised Learning uses labelled data (input-output pairs) to predict outcomes, such as, spam filters.

  • Regression
  • Classification
Unsupervised Learning

Unsupervised Learning finds hidden patterns in unlabelled data (like customer segmentation).

  • Clustering (k-means, hierarchical)
  • Dimensionality Reduction and Data Visualization (PCA, t-SNE, UMAP)
Semi-Supervised Learning

Semi-Supervised Learning uses a mix of both, leveraging a small amount of labelled data with a large amount of unlabelled data to improve accuracy.

  • Pseudo-labeling
  • Graph-based methods
Types of Semi-Supervised Learning

1. Pseudo-labelling:

  • A model is initially trained on the available, limited labelled dataset.
  • This trained model is then used to predict labels for the unlabelled data.These predictions are called ‘pseudo-labels’.
  • The model is then retrained using both the original labelled data and the newly pseudo-labelled data.

Benefit:
It effectively expands the training data by assigning labels to previously unlabelled examples, allowing the model to learn from a larger dataset.

2. Graph-based methods:

  • Data points (both labelled and unlabelled) are represented as nodes in a graph.
  • Edges are established between nodes based on their similarity or proximity in the feature space.The weight of an edge often reflects the degree of similarity.
  • The core principle is that similar data points should have similar labels.This assumption is propagated through the graph, effectively ‘spreading’ the labels from the labelled nodes to the unlabelled nodes based on the graph structure.
  • Various algorithms, such as label propagation or graph neural networks (GNNs), can be employed to infer the labels of unlabelled nodes.

Benefit:
These methods are particularly useful when the data naturally exhibits a graph-like structure or when local neighborhood information is crucial for classification.

Reinforcement Learning

Agent learns to make optimal decisions by interacting with an environment, receiving rewards (positive feedback) or penalties (negative feedback) for its actions.

  • Mimic human trial-and-error learning to achieve a goal 🎯.
Key Components of Reinforcement Learning
  • Agent: The learning entity that makes decisions and takes actions within the environment.
  • Environment: The external system with which the agent interacts.It defines the rules, states, and the consequences of the agent’s actions.
  • State: A specific configuration or situation of the environment at a given point in time.
  • Action: A move or decision made by the agent in a particular state.
  • Reward: A numerical signal received by the agent from the environment, indicating the desirability of an action taken in a specific state.
    Positive rewards encourage certain behaviors, while negative rewards (penalties) discourage them.
  • Policy: The strategy or mapping that defines which action the agent should take in each state to maximize long-term rewards 💰.
How Reinforcement Learning Works ?
  • Exploration: The agent tries out new actions to discover their effects and potentially find better strategies.
  • Exploitation: The agent utilizes its learned knowledge to choose actions that have yielded high rewards in the past.

Note: The agent continuously balances exploration and exploitation to refine its policy and achieve the optimal behavior.

Large Language Models

Large Language Models (LLMs) are deep learning models that often employ unsupervised learning techniques during their pre-training phase.

LLMs are trained on massive amounts of raw, unlabelled text data (e.g., books, articles, web pages) to predict the next word in a sequence or fill in masked words.
This process, often called self-supervised learning, allows the model to learn grammar, syntax, semantics, and general world knowledge by identifying statistical relationships within the text.

LLMs generally also undergo supervised fine-tuning (SFT) for specific tasks, where they are trained on labeled datasets to improve performance on those tasks.

Reinforcement Learning from Human Feedback (RLHF) allows LLMs to learn from human judgment, enabling them to generate more nuanced, context-aware, and ethically aligned outputs that better meet human expectations.



End of Section

1.2 - Supervised Learning

Supervised Machine Learning



End of Section

1.2.1 - Linear Regression

Linear Regression



End of Section

1.2.1.1 - Meaning of 'Linear'

Meaning of ‘Linear’ in Linear Regression


What is the meaning of “linear” in Linear Regression ?

Equation of a line is of the form \(y = mx + c\).
To represent a line in 2D space, we need 2 things:

  1. m = slope or direction of the line
  2. c = y-intercept or distance from the origin
images/machine_learning/supervised/linear_regression/line.png

Similarly,
A hyperplane is a lower (d-1) dimensional sub-space that divides a d-dimensional space into 2 distinct parts. Equation of a hyperplane:

\[y = w_1x_1 + w_2x_2+ \dots + w_nx_n + w_0 \\[5pt] \implies y = w^Tx + w_0\]

Here, ‘y’ is expressed as a linear combination of parameters - \( w_0, w_1, w_2, \dots, w_n \)
Hence - Linear means the model is ‘linear’ with respect to its parameters NOT the variables.
Read more about Hyperplane

images/machine_learning/supervised/linear_regression/hyperplane.png
Polynomial Features ✅
\[ y = w_1x_1 + w_2x_2 + w_3x_1^2 + w_4x_2^3 + w_0 \]

can be rewritten as:

\[y = w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 + w_0\]

where, \(x_3 = x_1^2 ~ and ~ x_4 = x_2^3 \)
\(x_3 ~ and ~ x_4 \) are just 2 new (polynomial) variables.
And, ‘y’ is still a linear combination of parameters: \(w_0, w_1, \dots w_4\)

images/machine_learning/supervised/linear_regression/hypersurface.png
Non-Linear Features ✅
\[ y = w_1log(x) + w_2\sqrt{x}+ w_0 \]

can be rewritten as:

\[y = w_1x_1 + w_2x_2 + w_0\]

where, \(x_1 = log(x) ~ and ~ x_2 = \sqrt{x} \)
\(x_1 ~ and ~ x_2 \) are are transformations of variable \(x\).
And, ‘y’ is still a linear combination of parameters: \(w_0, w_1, ~and~ w_2\)

images/machine_learning/supervised/linear_regression/non_linear_features.png

Non-Linear Parameters ❌
\[ y = x_1^{w_1} + x_2^{w_2} + w_0 \]

Even if we take log, we get:

\[log(y) = w_1log(x_1) + w_2log(x_2) + log(w_0)\]

here, \(log(w_0) \) parameter is NOT linear.

images/machine_learning/supervised/linear_regression/exponential.png

Importance of Linearity
Linearity in parameters allows to use Ordinary Least Squares (OLS) to find the best-fit coefficients by solving a set of linear equations, making estimation straightforward.



End of Section

1.2.1.2 - Meaning of 'Regression'

Meaning of ‘Regression’ in Linear Regression


What is the meaning of “Regression” in Linear Regression ?

Regression = Going Back

Regression has a very specific historical origin that is different from its current statistical meaning.

Sir Francis Galton (19th century), cousin of Charles Darwin, coined 🪙 this term.

Observation:
Galton observed that -
the children 👶 of unusually tall ⬆️ parents 🧑‍🧑‍🧒‍🧒, tended to be shorter ⬇️ than their parents 🧑‍🧑‍🧒‍🧒,
and children 👶 of unusually short ⬇️ parents 🧑‍🧑‍🧒‍🧒, tended to be taller ⬆️ than their parents 🧑‍🧑‍🧒‍🧒.

Galton named this biological tendency - ‘regression towards mediocrity/mean’.

Galton used method of least squares to model this relationship, by fitting a line to the data 📊.

Regression = Fitting a Line
Over time ⏳, the name ‘regression’ got permanently attached to the method of fitting line to the data 📊.

Today in statistics and machine learning, ‘regression’ universally refers to the method of finding the
‘line of best fit’ for a set of data points, NOT the concept of ‘regressing towards the mean’.

images/machine_learning/supervised/linear_regression/line_of_best_fit.png



End of Section

1.2.1.3 - Linear Regression

Linear Regression


Predict Salary

Let’s understand linear regression using an example to predict salary.

Predict the salary 💰 of an IT employee, based on various factors, such as, years of experience, domain, role, etc.

images/machine_learning/supervised/linear_regression/salary_prediction.png

Let’s start with a simple problem and predict the salary using only one input feature.

Goal 🎯 : Find the line of best fit.

Plot: Salary vs Years of Experience

\[y = mx + c = w_1x + w_0\]

Slope = \(m = w_1 \)
Intercept = \(c = w_0\)

images/machine_learning/supervised/linear_regression/salary_yoe.png

Similarly, if we include other factors/features impacting the salary 💰, such as, domain, role, etc, we get an equation of a fitting hyperplane:

\[y = w_1x_1 + w_2x_2 + \dots + w_dx_d + w_0\]

where,

\[ \begin{align*} x_1 &= \text{Years of Experience} \\ x_2 &= \text{Domain (Tech, BFSI, Telecom, etc.)} \\ x_3 &= \text{Role (Dev, Tester, DevOps, ML, etc.)} \\ x_d &= d^{th} ~ feature \\ w_0 &= \text{Salary of 0 years experience} \\ \end{align*} \]
What is the dimensions of the fitting hyperplane?

Space = ’d’ features + 1 target variable = ’d+1’ dimensions
In a ’d+1’ dimensional space, we try to fit a ’d’ dimensional hyperplane.

images/machine_learning/supervised/linear_regression/fitting_hyperplane.png
Parameters/Weights of the Model

Let, data = \( {(x_i, y_i)}_{i=1}^N ; ~ x_i \in R^d , y_i \in R^d\)
where, N = number of training samples.

images/machine_learning/supervised/linear_regression/linear_regression_data.png

Note: Fitting hyperplane (\(y = w_1x_1 + w_2x_2 + \dots + w_dx_d + w_0\)) is the model.
Objective 🎯: find the parameters/weights (\(w_0, w_1, w_2, \dots w_d \)) of the model.

\(\mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix}_{\text{d x 1}}\) \( \mathbf{x_i} = \begin{bmatrix} x_{i_1} \\ x_{i_2} \\ \vdots \\ x_{i_d} \end{bmatrix}_{\text{d x 1}} \) \( \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_i \\ \vdots \\ y_n \end{bmatrix}_{\text{n x 1}} \) \( X = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{i1} & x_{i2} & \cdots & x_{id} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \\ \end{bmatrix} _{\text{n x d}} \)
Prediction:

\[ \hat{y_i} = w_1x_{i_1} + w_2x_{i_2} + \dots + w_dx_{i_d} + w_0 \]

\[ i^{th} \text{ Prediction } = \hat{y_i} = x_i^Tw + w_0\]

Error = Actual - Predicted

\[ \epsilon_i = y_i - \hat{y_i}\]

Goal 🎯: Minimize error between actual and predicted.

Loss 💸 Function

We can quantify the error for a single data point in following ways:

  • Absolute error = \(|y_i - \hat{y_i}|\)
  • Squared error = \((y_i - \hat{y_i})^2\)
Issues with Absolute Value function
  • Not differentiable at x=0, required for gradient descent.

  • Constant gradient, i.e \(\pm 1\), model learns at same rate, whether the error is large or small.

    images/machine_learning/supervised/linear_regression/absolute_value_function.png
Cost 💰 Function

Average loss across all data points.

Mean Squared Error (MSE) =

\[ J(w) = \frac{1}{n} \sum_{i=1}^N (y_i - \hat{y_i})^2 \]
images/machine_learning/supervised/linear_regression/mean_squared_error.png
Optimization
\[ \begin{align*} \underset{w_0, w}{\mathrm{min}}\ J(w) &= \underset{w_0, w}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^N (y_i - \hat{y_i})^2 \\ &= \underset{w_0, w}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^N (y_i - (x_i^Tw + w_0))^2 \\ &= \underset{w_0, w_1, w_2, \dots w_d}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^N (y_i - (w_1x_{i_1} + w_2x_{i_2} + \dots + w_dx_{i_d} + w_0))^2 \\ \underset{w_0, w}{\mathrm{min}}\ J(w) &= \underset{w_0, w_1, w_2, \dots w_d}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^N y_i^2 + w_0^2 + w_1^2x_{i_1}^2 + w_2^2x_{i_2}^2 + \dots + w_d^2x_{i_d}^2 + \dots \\ \end{align*} \]

The above equation is quadratic in \(w_0, w_1, w_2, \dots w_d \).

Below is an image of a Paraboloid in 3D, similarly we will have a Paraboloid in ’d’ dimensions.

images/machine_learning/supervised/linear_regression/paraboloid.png
Find the Minima

In order to find the minima of the cost function we need to take its derivative w.r.t weights and equate to 0.

\[ \begin{align*} \frac{\partial{J(w)}}{\partial{w_0}} = 0 \\ \frac{\partial{J(w)}}{\partial{w_1}} = 0 \\ \frac{\partial{J(w)}}{\partial{w_2}} = 0 \\ \vdots \\ \frac{\partial{J(w)}}{\partial{w_d}} = 0 \\ \end{align*} \]

We have ‘d+1’ linear equations to solve for ‘d+1’ weights \(w_0, w_1, w_2, \dots , w_d\).

But solving ‘d+1’ system of linear equations (called the ’normal equations’) is tedious and NOT used for practical purposes.

Matrix Form of Cost Function
\[ J(w) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y_i})^2 \]

\[ J(w) = \frac{1}{n} (y - Xw)^2 \]

where, \(\mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix}_{\text{d x 1}}\) \( \mathbf{x_i} = \begin{bmatrix} x_{i_1} \\ x_{i_2} \\ \vdots \\ x_{i_d} \end{bmatrix}_{\text{d x 1}} \) \( \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_i \\ \vdots \\ y_n \end{bmatrix}_{\text{n x 1}} \) \( X = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{i1} & x_{i2} & \cdots & x_{id} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \\ \end{bmatrix} _{\text{n x d}} \)
Prediction:

\[ \hat{y_i} = w_1x_{i_1} + w_2x_{i_2} + \dots + w_dx_{i_d} + w_0 \]


Let’s expand the cost function J(w):

\[ \begin{align*} J(w) &= \frac{1}{n} (y - Xw)^2 \\ &= \frac{1}{n} (y - Xw)^T(y - Xw) \\ &= \frac{1}{n} (y^T - w^TX^T)(y - Xw) \\ J(w) &= \frac{1}{n} (y^Ty - w^TX^Ty - y^TXw + w^TX^TXw) \end{align*} \]

Since,\(w^TX^Ty\), is a scalar, so it is equal to its transpose.

\[ w^TX^Ty = (w^TX^Ty)^T = y^TXw\]

\[ J(w) = \frac{1}{n} (y^Ty - y^TXw - y^TXw + w^TX^TXw)\]

\[ J(w) = \frac{1}{n} (y^Ty - 2y^TXw + w^TX^TXw) \]

Note: \(X^2 = X^TX\) and \((AB)^T = B^TA^T\)

Normal Equation

To find the minimum, take the derivative of cost function J(w) w.r.t ‘w’, and equate to 0 vector.

\[\frac{\partial{J(w)}}{\partial{w}} = \vec{0}\]\[ \begin{align*} &\frac{\partial{[\frac{1}{n} (y^Ty - 2y^TXw + w^TX^TXw)]}}{\partial{w}} = 0\\ & \implies 0 - 2X^Ty + (X^TX + X^TX)w = 0 \\ & \implies \cancel{2}X^TXw = \cancel{2} X^Ty \\ & \therefore \mathbf{w} = (X^TX)^{-1}X^T\mathbf{y} \end{align*} \]

Note: \(\frac{\partial{(a^T\mathbf{x})}}{\partial{\mathbf{x}}} = a\) and \(\frac{\partial{(\mathbf{x}^TA\mathbf{x})}}{\partial{\mathbf{x}}} = (A + A^T)\mathbf{x}\)

This is the closed-form solution of normal equations.

Issues with Normal Equation
  • Inverse may NOT exist (non-invertible).
  • Time complexity of calculating the inverse is O(n^3).
Pseudo Inverse

If the inverse does NOT exist then we can use the approximation of the inverse, also called Pseudo Inverse or Moore Penrose Inverse (\(A^+\)).

Moore Penrose Inverse ( \(A^+\)) is calculated using Singular Value Decomposition (SVD).

SVD of \(A = U \Sigma V^T\)

Pseudo Inverse \(A^+ = V \Sigma^+ U^T\)

Where, \(\Sigma^+\) is a transpose of \(\Sigma\) with reciprocals of non-zero singular values on its diagonals.
e.g:

\[ \Sigma = \begin{bmatrix} 5 & 0 & 0 \\\\ 0 & 2 & 0 \end{bmatrix} \]

\[ \Sigma ^{+}=\left[\begin{matrix}1/5&0\\ 0&1/2\\ 0&0\end{matrix}\right]=\left[\begin{matrix}0.2&0\\ 0&0.5\\ 0&0\end{matrix}\right] \]

Note: Time Complexity = O(mn^2)



End of Section

1.2.1.4 - Probabilistic Interpretation

Probabilistic Interpretation of Linear Regression


Probabilistic Interpretation
Explains why we use ordinary least squares error to find the model weights/parameters.
Model Assumptions

Error = Random Noise, Un-modeled effects

\[ \begin{align*} \epsilon_i = y_i - \hat{y_i} \\ \implies y_i = \hat{y_i} + \epsilon_i \\ \because \hat{y_i} = x_i^Tw \\ \therefore y_i = x_i^Tw + \epsilon_i \\ \end{align*} \]

Actual value(\(y_i\)) = Deterministic linear predictor(\(x_i^Tw\)) + Error term(\(\epsilon_i\))

Error Assumptions
  • Independent and Identically Distributed (I.I.D):
    Each error term is independent of others.
  • Normal (Gaussian) Distributed:
    Error follows a normal distribution with mean = 0 and a constant variance, .

This implies that the target variable itself is a random variable, normally distributed around the linear relationship.

\[(y_{i}|x_{i};w)∼N(x_{i}^{T}w,\sigma^{2})\]
images/machine_learning/supervised/linear_regression/probabilistic_interpretation/slide_04_01.png
images/machine_learning/supervised/linear_regression/probabilistic_interpretation/slide_05_01.png
Why is Error terms distribution considered to be Gaussian ?

Central Limit Theorem (CLT) states that for a sequence of I.I.D random variables, the distribution of the sample mean(sum) approximates to a normal distribution, regardless of the original population distribution.

images/machine_learning/supervised/linear_regression/probabilistic_interpretation/slide_07_01.png
Probability Vs Likelihood
  • Probability (Forward View):
    Quantifies the chance of observing a specific outcome given a known, fixed model.
  • Likelihood (Backward/Inverse View):
    Inverse concept used for inference (working backward from results to causes). It is a function of the parameters and measures how ‘likely’ a specific set of parameters makes the observed data appear.
Maximum Likelihood Estimate (MLE)

‘Find the most plausible explanation for what I see.'

The goal of the probabilistic interpretation is to find the parameters ‘w’ that maximize the probability (likelihood) of observing the given dataset.

Assumption: Training data is I.I.D.

\[ \begin{align*} Likelihood &= \mathcal{L}(w) \\ \mathcal{L}(w) &= p(y|x;w) \\ &= \prod_{i=1}^N p(y_i| x_i; w) \\ &= \prod_{i=1}^N \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y_i-x_i^Tw)^2}{2\sigma^2}} \end{align*} \]
Issue with Likelihood

Maximizing the likelihood function is mathematically complex due to the product term and the exponential function.

A common simplification is to maximize the log-likelihood function instead, which converts the product into a sum.

Note: Log is a strictly monotonically increasing function.

Solution: Log Likelihood
\[ \begin{align*} log \mathcal{L}(w) &= log \prod_{i=1}^N \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y_i-x_i^Tw)^2}{2\sigma^2}} \\ &= \sum_{i=1}^N log(\frac{1}{\sigma\sqrt{2\pi}}) + log (e^{-\frac{(y_i-x_i^Tw)^2}{2\sigma^2}}) \\ log \mathcal{L}(w) &= Nlog(\frac{1}{\sigma\sqrt{2\pi}}) - \sum_{i=1}^N \frac{(y_i-x_i^Tw)^2}{2\sigma^2} \\ \end{align*} \]

Note: The first term is constant w.r.t ‘w'.

So, we need to find parameters ‘w’ that maximize the log likelihood.

\[ \begin{align*} log \mathcal{L}(w) & \propto -\frac{1}{2\sigma^2} \sum_{i=1}^N (y_i-x_i^Tw)^2 \\ & \because \frac{1}{2\sigma^2} \text{ is constant} \\ log \mathcal{L}(w) & \propto -\sum_{i=1}^N (y_i-x_i^Tw)^2 \\ \end{align*} \]
Ordinary Least Squares
\[ \begin{align*} log \mathcal{L}(w) &\propto -\sum_{i=1}^N (y_i-x_i^Tw)^2 \\ \underset{w}{\mathrm{max}}\ -\sum_{i=1}^N (y_i-x_i^Tw)^2 &= \underset{w}{\mathrm{min}}\ \sum_{i=1}^N (y_i-x_i^Tw)^2 \end{align*} \]

Maximizing the log-likelihood is equivalent to minimizing the sum of squared errors, which is the exact objective of the ordinary least squares (OLS) method.

\[ \underset{w}{\mathrm{min}}\ J(w) = \underset{w}{\mathrm{min}}\ \sum_{i=1}^N (y_i - x_i^Tw)^2 \]



End of Section

1.2.1.5 - Convex Function

Convex Function


Convexity

Refers to a property of a function where a line segment connecting any two points on its graph lies above or on the graph itself.

  • A convex function is curved upwards.

    images/machine_learning/supervised/linear_regression/convex_function/slide_01_01.png
Is MSE Cost function Convex ? YES ✅

MSE cost function J(w) is convex because its Hessian (H) is always positive semi definite.

\[\nabla J(\mathbf{w})=\frac{1}{n}\mathbf{X}^{T}(\mathbf{Xw}-\mathbf{y})\]

\[\mathbf{H}=\frac{\partial (\nabla J(\mathbf{w}))}{\partial \mathbf{w}^{T}}=\frac{1}{n}\mathbf{X}^{T}\mathbf{X}\]

\[\therefore \mathbf{H} = \nabla^2J(w) = \frac{1}{n} \mathbf{X}^{T}\mathbf{X}\]
images/machine_learning/supervised/linear_regression/convex_function/slide_05_01.png
MSE: Positive Semi Definite (Proof)

A matrix H is positive semi-definite if and only if for any non-zero vector ‘z’, the quadratic form \(z^THz \ge 0\).

For the Hessian of J(w),

\[ z^THz = z^T(\frac{1}{n}X^TX)z = \frac{1}{n}(Xz)^T(Xz) \]

\((Xz)^T(Xz) = \lVert Xz \rVert^2\) = squared L2 norm (magnitude) of the vector

Note: The squared norm of any real-valued vector is always \(\ge 0\).



End of Section

1.2.1.6 - Gradient Descent

Gradient Descent


Goal 🎯

Minimize the cost 💰function.

\[J(w)=\frac{1}{2n}(y-Xw)^{2}\]

Note: The 1/2 term is included simply to make the derivative cleaner (it cancels out the 2 from the square).

Issues with Normal Equation

Normal Equation (Closed-form solution) jumps straight to the optimal point in one step.

\[w=(X^{T}X)^{-1}X^{T}y\]

But it is not always feasible and computationally expensive 💰(due to inverse calculation 🧮)

Gradient Descent 🎢

An iterative optimization algorithm slowly nudges parameters ‘w’ towards a value that minimize the cost💰 function.

images/machine_learning/supervised/linear_regression/gradient_descent/slide_05_01.png
Algorithm ⚙️
  1. Initialize the weights/parameters with random values.
  2. Calculate the gradient of the cost function at current parameter values.
  3. Update the parameters using the gradient. \[ w_{new} = w_{old} - \eta \frac{\partial{J(w)}}{\partial{w_{old}}} \] \( \eta \) = learning rate or step size to take for each parameter update.
  4. Repeat 🔁 steps 2 and 3 iteratively until convergence (to minima).
images/machine_learning/supervised/linear_regression/gradient_descent/slide_07_01.png
Gradient Calculation
\[ \begin{align*} &J(w) = \frac{1}{2n} (y - Xw)^2 \\ &\frac{\partial{J(w)}}{\partial{w}} = \frac{\partial{(\frac{1}{2n} (y - Xw)^2)}}{\partial{w}} \end{align*} \]

Applying chain rule:

\[ \begin{align*} &\text{Let } u = (y - Xw) \\ &\text{Derivative of } u^2 \text{ w.r.t 'w' }= 2u.\frac{\partial{u}}{\partial{w}} \\ \frac{\partial{(\frac{1}{2n} (y - Xw)^2)}}{\partial{w}} &= \frac{1}{\cancel2n}.\cancel2(y - Xw).\frac{\partial{(y - Xw)}}{\partial{w}} \\ &= \frac{1}{n}(y - Xw).X^T.(-1) \\ \therefore \frac{\partial{J(w)}}{\partial{w}} &= \frac{1}{n}X^T(Xw - y) \end{align*} \]

Note: \(\frac{∂(a^{T}x)}{∂x}=a\)

Update parameter using gradient:

\[ w_{new} = w_{old} - \eta'. X^T(Xw - y) \]
Learning Rate
  • Most important hyper parameter of gradient descent.
  • Dictates the size of the steps taken down the cost function surface.

Small \(\eta\) ->

images/machine_learning/supervised/linear_regression/gradient_descent/slide_11_01.png

Large \(\eta\) ->

images/machine_learning/supervised/linear_regression/gradient_descent/slide_11_02.png
Learning Rate Techniques
  • Learning Rate Schedule:
    The learning rate is decayed (reduced) over time.
    Large steps initially and fine-tuning near the minimum, e.g., step decay or exponential decay.
  • Adaptive Learning Rate Methods:
    Automatically adjust the learning rate for each parameter ‘w’ based on the history of gradients.
    Preferred in modern deep learning as they require less manual tuning, e.g., Adagrad, RMSprop, and Adam.
Types of Gradient Descent 🎢

Batch, Stochastic, Mini-Batch are classified by data usage for gradient calculation in each iteration.

  • Batch Gradient Descent (BGD): Entire Dataset
  • Stochastic Gradient Descent (SGD): Random Point
  • Mini-Batch Gradient Descent (MBGD): Subset
Batch Gradient Descent 🎢 (BGD)

Computes the gradient using all the data points in the dataset for parameter update in each iteration.

\[w_{new} = w_{old} - \eta.\text{(average of all ’n’ gradients)}\]

🔑Key Points:

  • Slow 🐢 steps towards convergence, i.e, TC = O(n).
  • Smooth, direct path towards minima.
  • Minimum number of steps/iterations.
  • Not suitable for large datasets; impractical for Deep Learning, as n = millions/billions.
Stochastic Gradient Descent 🎢 (SGD)

Uses only 1 data point selected randomly from dataset to compute gradient for parameter update in each iteration.

\[w_{new} = w_{old} - \eta.\text{(gradient at any random data point)}\]

🔑Key Points:

  • Computationally fastest 🐇 per step; TC = O(1).
  • Highly noisy, zig-zag path to minima.
  • High variance in gradient estimation makes path to minima volatile, requiring a careful decay of learning rate to ensure convergence to minima.
Mini Batch Gradient Descent 🎢 (MBGD)
  • Uses small randomly selected subsets of dataset, called mini-batch, (1<k<n), to compute gradient for parameter update in each iteration. \[w_{new} = w_{old} - \eta.\text{(average gradient of ‘k' data points)}\]

🔑Key Points:

  • Moderate time ⏰ consumption per step; TC = O(k<n).
  • Less noisy, and more reliable convergence than stochastic gradient descent.
  • More efficient and faster than batch gradient descent.
  • Standard optimization algorithm for Deep Learning.Note: Vectorization on GPUs allows for parallel processing of mini-batches; also GPUs are the reason for the mini-batch size to be a power of 2.
BGD vs SGD vs Mini-BGD
images/machine_learning/supervised/linear_regression/types_of_gradient_descent/slide_08_01.png
images/machine_learning/supervised/linear_regression/types_of_gradient_descent/slide_09_01.png



End of Section

1.2.1.7 - Polynomial Regression

Polynomial Regression


What if our data is more complex than a straight line?

We can use a linear model to fit non-linear data.

Add powers of each feature as new features, then train a linear model on this extended set of features.

images/machine_learning/supervised/linear_regression/polynomial_regression/slide_02_01.png
Polynomial Regression

Linear: \(\hat{y_i} = w_0 + w_1x_{i_1} \)

Polynomial (quadratic): \(\hat{y_i} = w_0 + w_1x_{i_1} + w_2x_{i_1}^2\)

Polynomial (n-degree): \(\hat{y_i} = w_0 + w_1x_{i_1} + w_2x_{i_1}^2 +w_3x_{i_1}^3 + \dots + w_nx_{i_1}^n \)

Above polynomial can be re-written as linear equation:

\[\hat{y_i} = w_0 + w_1X_1 + w_2X_2 +w_3X_3 + \dots + w_nX_n \]

where, \(X_1 = x_{i_1}, X_2 = x_{i_1}^2, X_3 = x_{i_1}^3, \dots, X_n = x_{i_1}^n\)

=> the model is still linear w.r.t to its parameters/weights \(w_0, w_1, w_2, \dots , w_n \).

e.g:

images/machine_learning/supervised/linear_regression/polynomial_regression/slide_04_01.png
Strategy to find Polynomial Features
  • Fit a linear model to the data points.
  • Plot the errors.
  • If the variance of errors is too high, then try polynomial features.

Note: Detect and remove outliers from error distribution.

High Degree Polynomial Regression
images/machine_learning/supervised/linear_regression/polynomial_regression/slide_07_01.png
Conclusion
  • Polynomial model : Over-fitting ❌
  • Linear model : Under-fitting ❌
  • Quadratic model: Generalizes best ✅



End of Section

1.2.1.8 - Data Splitting

Data Splitting


Why Data-Splitting is Required ?
To avoid over-fitting (memorize), so that, the model generalizes well, improving its performance on unseen data.
Train/Validation/Test Split
  • Training Data: Learn model parameters (Textbook + Practice problems)
  • Validation Data: Tune hyper-parameters (Mock tests)
  • Test Data: Evaluate model performance (Real (final) exam)
Data Leakage

Data leakage occurs when information from the validation or test set is inadvertently used to train 🏃‍♂️ the model.

The model ‘cheats’ by learning to exploit information it should not have access to, resulting in artificially inflated performance metrics during testing 🧪.

Typical Split Ratios
  • Small datasets(1K-100K): 60/20/20, 70/15/15 or 80/10/10
  • Large datasets(>1M): 98/1/1 would suffice, as 1% of 1M is still 10K.

Note: There is no fixed rule, its trial and error.

Imbalanced Data
Imbalanced data refers to a dataset where the target classes are represented by an unequal or highly skewed distribution of samples, such that the majority class significantly outnumbers the minority class.
Stratified Sampling

If there is class imbalance in the dataset, (e.g., 95% class A , 5% class B), a random split might result in the validation set having 99% class A.

Solution: Use stratified sampling to ensure class proportions are maintained across all splits (train️/validation/test).

Note: Non-negotiable for imbalanced data.

Time-Series ⏳ Data
  • In time-series ⏰ data, divide the data chronologically, not randomly, i.e, training data time ⏰ should precede validation data time ⏰.
  • We always train 🏃‍♂️ on past data to predict future data.

Golden rule: Never look 👀 into the future.



End of Section

1.2.1.9 - Cross Validation

Cross Validation


Core Idea 💡

Do not trust one split of the data; validate across many splits, and average the result to reduce randomness and bias.

Note: Two different splits of the same dataset can give very different validation scores.

Cross-validation

Cross-validation is a statistical resampling technique used to evaluate how well a machine learning model generalizes to an independent, unseen dataset.

It works by systematically partitioning the available data into multiple subsets, or ‘folds’, and then training and testing the model on different combinations of these folds.

  • K-Fold Cross-Validation
  • Leave-One-Out Cross-Validation (LOOCV)
K-Fold Cross-Validation
  1. Shuffle the dataset randomly (except time-series ⏳).
  2. Split data into k equal subsets(folds).
  3. Iterate through each unique fold, using it as the validation set.
  4. Use remaining k-1 fold for training 🏃‍♂️.
  5. Take an average of the results.Note: Common choice for k=5 or 10.
  • Iteration 1: [V][T][T][T][T]
  • Iteration 2: [T][V][T][T][T]
  • Iteration 3: [T][T][V][T][T]
  • Iteration 4: [T][T][T][V][T]
  • Iteration 5: [T][T][T][T][V]
Leave-One-Out Cross-Validation (LOOCV)

Model is trained 🏃‍♂️on all data points except one, and then tested 🧪on that remaining single observation.

LOOCV is an extreme case of k-fold cross-validation, where, k=n (number of data points).

  • Pros:
    Useful for small (<1000) datasets.

  • Cons:
    Computationally 💻 expensive 💰.



End of Section

1.2.1.10 - Bias Variance Tradeoff

Bias Variance Tradeoff


Bias-Variance Decomposition

Mean Squared Error (MSE) = \(\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y_i})^2\)

Total Error = Bias^2 + Variance + Irreducible Error

  • Bias = Systematic Error
    • Bias measures how far the average prediction of a model is from the true value.
  • Variance = Sensitivity to Data
    • Variance measures how much the predictions of a model vary for different training datasets.
  • Irreducible Error = Sensor noise, Human randomness
    • Inherent uncertainty in the data generation process itself and cannot be reduced by any model.
Bias

Systematic error from overly simplistic assumptions or strong opinion in the model.

e.g. House 🏠 prices 💰 = Rs. 10,000 * Area (sq. ft).

Note: This is over simplified view, because it ignores, amenities, location, age, etc.

Variance

Error from sensitivity to small fluctuations 📈 in the data.

e.g. Deep neural 🧠 network trained on a small dataset.

Note: Memorizes everything, including noise.

Say a house 🏠 in XYZ street was sold for very low price 💰.

Reason: Distress selling (outlier), or incorrect entry (noise).

Note: Model will make wrong(lower) price 💰predictions for all houses in XYZ street.

Linear (High Bias), Polynomial(High Variance)
images/machine_learning/supervised/linear_regression/bias_variance_tradeoff/slide_06_01.png
Bias Variance Table
images/machine_learning/supervised/linear_regression/bias_variance_tradeoff/bias_variance_table.png
Bias-Variance Trade-Off

Goal 🎯 is to minimize total error.

Find a sweet-spot balance ⚖️ between Bias and Variance.

A good model ‘generalizes’ well, i.e.,

  • Is not too simple or has a strong opinion.
  • Does not memorize 🧠 everything in the data, including noise.
Fix 🩹 High Bias (Under-Fitting)
  • Make model more complex.
    • Add more features, add polynomial features.
  • Decrease Regularization.
  • Train 🏃‍♂️longer, the model has not yet converged.
Fix 🩹 High Variance (Over-Fitting)
  • Add more data (most effective).
    • Harder to memorize 🧠 1 million examples than 100.
    • Use data augmentation, if getting more data is difficult.
  • Increase Regularization.
  • Early stopping 🛑, prevents memorization 🧠.
  • Dropout (DL), randomly kill neurons, prevents co-adaptation.
  • Use Ensembles.
  • Averaging reduces variance.

Note: Co-adaptation refers to a phenomenon where neurons in a neural network become highly dependent on each other to detect features, rather than learning independently.



End of Section

1.2.1.11 - Regularization

Regularization


Over-Fitting

Over-Fitting happens when we have a complex model that creates highly erratic curve to fit every single data point, including the random noise.

  • Excellent performance on training 🏃‍♂️ data, but poor performance on unseen data.
How to Avoid Over-Fitting ?
  • Penalty for overly complex models, i.e, models with excessively large or numerous parameters.
  • Focus on learning general patterns rather than memorizing 🧠 everything, including noise in training 🏃‍♂️data.
Regularization

Set of techniques that prevents over-fitting by adding a penalty term to the model’s loss function.

\[ J_{reg}(w) = J(w) + \lambda.\text{Penalty(w)} \]

\(\lambda\) = Regularization strength hyper parameter, bias-variance tradeoff knob.

  • High \(\lambda\): High ⬆️ penalty, forces weights towards 0, simpler model.
  • Low \(\lambda\): Weak ⬇️ penalty, closer to un-regularized model.
Regularization introduces Bias
  • By intentionally simplifying a model (shrinking weights) to reduce its complexity, which prevents it from overfitting.
  • Penalty pulls feature weights closer to zero, making the model less faithful representation of the training data’s true complexity.
Common Regularization Techniques
  • L2 Regularization (Ridge Regression)
  • L1 Regularization (Lasso Regression)
  • Elastic Net Regularization
  • Early Stopping
  • Dropout (Neural Networks)
L2 Regularization
\[ \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^n (y_i - x_i^Tw)^2 + \lambda.\sum_{j=1}^n w_j^2 \]
  • Penalty term: \(\ell_2\) norm - penalizes large weights quadratically.
  • Pushes the weights close to 0 (not exactly 0), making models more stable by distributing importance across weights.
  • Splits feature importance across co-related features.
  • Use case: Best when most features are relevant and co-related.
  • Also known as Ridge regression or Tikhonov regularization.
L1 Regularization
\[ \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^n (y_i - x_i^Tw)^2 + \lambda.\sum_{j=1}^n |w_j| \]
  • Penalty term: \(\ell_1\) norm.
  • Shrinks some weights exactly to 0, effectively performing feature selection, giving sparse solutions.
  • For a group of highly co-related features, arbitrarily selects one feature and shrinks others to 0.
  • Use case: Best when using high-dimensional datasets (d»n) where we suspect many features are irrelevant or redundant, or when model interpretability matters.
  • Also known as Lasso (Least Absolute Shrinkage and Selection Operator) regression.
  • Computational hack: \(\frac{\partial{|w_j|}}{\partial{w_j}} = 0\), since absolute function is not differentiable at 0.
Elastic Net Regularization
\[ \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^n (y_i - x_i^Tw)^2 + \lambda.((1-\alpha).\sum_{j=1}^n w_j^2 + \alpha.\sum_{j=1}^n |w_j|) \]
  • Penalty term : linear combination of and norm.
  • Sparsity(feature-selection) of L1 and stability/grouping effect of L2 regularization.
  • Use case: Best when we have high dimensional data with correlated features and we want sparse and stable solution.
L1/L2/Elastic Net Regularization Comparison
images/machine_learning/supervised/linear_regression/regularization/slide_12_01.png
Why weights shrink exactly to 0 in L1 regularization but NOT in L2 regularization ?
  • Because the gradient of L1 penalty (absolute function) is a constant value, i.e, \(\pm 1\), this means a constant reduction in weight at each step, making it gradually reach to 0 in finite steps.
  • Whereas, the derivative of L2 penalty is proportional to the weight (\(2w_j\)) and as the weight reaches close to 0, the gradient also becomes very small, this means that the weight will become very close to 0, but not exactly equal to 0.
L1 vs L2 Regularization Comparison
images/machine_learning/supervised/linear_regression/regularization/slide_14_01.png



End of Section

1.2.1.12 - Regression Metrics

Regression Metrics


Regression Metrics
Quantify the difference between the actual values and the predicted values.
Mean Absolute Error(MAE)

MAE = \(\frac{1}{n} \sum_{i=1}^n |y_i - \hat{y_i}|\)

  • Treats each error equally.
    • Robust to outliers.
  • Not differentiable x=0.
    • Using gradient descent requires computational hack.
  • Easy to interpret, as same units as target variable.
Mean Squared Error(MSE)

MSE = \(\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y_i})^2\)

  • Heavily penalizes large errors.
    • Sensitive to outliers.
  • Differentiable everywhere.
    • Used by gradient descent and most other optimization algorithms.
  • Difficult to interpret, as it has squared units.
Root Mean Squared Error(RMSE)

RMSE = \(\sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y_i})^2}\)

  • Easy to interpret, as it has same units as target variable.
  • Useful when we need outlier-sensitivity of MSE but the interpretability of MAE.
R^2 Metric

Measures improvement over mean model.

\[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y_i})^2}{\sum_{i=1}^n (y_i - \bar{y_i})^2} \]

Good R^2 value depends upon the use case, e.g. :

  • Car 🚗 sale, R =0.8 is good enough.
  • Cancer 🧪 prediction R 0.95, as life depends on it.

Range of values:

  • Best value = 1
  • Baseline value = 0
  • Worst value = \(- \infty\)

Note: Example for bad model is all the points lie along x-axis and model predicts y-axis.

Huber Loss

Quadratic for small errors; Linear for large errors.

\[ L_{\delta}(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{for } |y - \hat{y}| \le \delta \\ \\ \delta (|y - \hat{y}| - \frac{1}{2}\delta) & \text{otherwise} \end{cases} \]
  • Robust to outliers.
  • Differentiable at 0; smooth convergence to minima.
  • Delta (\(\delta\)) knob(hyper parameter) to control.
  • \(\delta\) high: MSE
  • \(\delta\) low: MAE

Note: Tune \(\delta\): MAE, for outliers > \(\delta\); MSE, for small errors < \(\delta\).
e.g: = 95th percentile of errors or 1.35\(\sigma\) for standard Gaussian data.

Huber loss (Green) and Squared loss (blue)

images/machine_learning/supervised/linear_regression/regression_metrics/slide_09_01.png



End of Section

1.2.1.13 - Assumptions

Assumptions Of Linear Regression


Assumptions

Linear Regression works reliably only when certain key 🔑 assumptions about the data are met.

  • Linearity
  • Independence of Errors (No Auto-Correlation)
  • Homoscedasticity (Equal Variance)
  • Normality of Errors
  • No Multicollinearity
  • No Endogeneity (Target not correlated with the error term)
Linearity

Relationship between features and target 🎯 is linear in parameters.

Note: Polynomial regression is linear regression.
\(y=w_0 +w_1x_1+w_2x_2^2 + w_3x_3^3\)

Independence of Errors (No Auto-Correlation)

Residuals (errors) should not have a visible pattern or correlation with one another (most common in time-series ⏰ data).

Risk:
If errors are correlated, standard errors will be underestimated, making variables look ‘statistically significant’ when they are not.

Test:

  • Durbin–Watson test

  • Autocorrelation plots (ACF)

  • Residuals vs time

    images/machine_learning/supervised/linear_regression/assumptions_of_linear_regression/slide_05_01.png
Homoscedasticity

Constant variance of errors; Var(ϵ|X)=σ

Risk:
Standard errors become biased, leading to unreliable hypothesis tests (t-tests, F-tests).

Test:

  • Breusch–Pagan test
  • White test

Fix:

  • Log transform
  • Weighted Least Squares(WLS)
Normality of Errors

Error terms should follow a normal distribution; (Required for small datasets.)

Note: Because of Central Limit Theorem, with a large enough sample size, this becomes less critical for estimation.

Risk: Hypothesis testing (calculating p-values and confidence intervals), we assume the error terms follow a normal distribution.

Test:

  • Q-Q plot

  • Shapiro-Wilk Test

    images/machine_learning/supervised/linear_regression/assumptions_of_linear_regression/slide_08_01.png
    images/machine_learning/supervised/linear_regression/assumptions_of_linear_regression/slide_09_01.png
No Multicollinearity

Features should not be highly correlated with each other.

Risk:

  • High correlation makes it difficult to determine the unique, individual impact of each feature. This leads to high variance in model parameter estimates, small changes in data cause large swings in parameters.
  • Model interpretability issues.

Test:

  • Variance Inflation Factor(VIF)VIF > 5 → concern, VIF > 10 → serious issue

Fix:

  • PCA
  • Remove redundant features
No Endogeneity (Exogeneity)

Error term must be uncorrelated with the features; E[ϵ|X] = 0

Risk:

  • Parameters will be biased and inconsistent.

Test:

  • Hausman Test
  • Durbin-Wu-Hausman (DWH) Test

Fix:

  • Controlled experiments.



End of Section

1.2.2 - Logistic Regression

Logistic Regression



End of Section

1.2.2.1 - Binary Classification

Binary Classification

Binary Classification
images/machine_learning/supervised/logistic_regression/binary_classification_intro/slide_02_01.tif
Why can’t we use Linear Regression for binary classification ?

Linear regression tries to find the best fit line, but we want to find the line or decision boundary that clearly separates the two classes.

images/machine_learning/supervised/logistic_regression/binary_classification_intro/slide_03_01.tif
Goal 🎯

Find the decision boundary, i.e, the equation of the separating hyperplane.

\[z=w^{T}x+w_{0}\]
Decision Boundary

Value of \(z = \mathbf{w^Tx} + w_0\) tells us how far is the point from the decision boundary and on which side.

Note: Weight 🏋️‍♀️ vector ‘w’ is normal/perpendicular to the hyperplane, pointing towards the positive class (y=1).

Distance of Points from Separating Hyperplane
  • For points exactly on the decision boundary \[z = \mathbf{w^Tx} + w_0 = 0 \]
  • Positive (+ve) labeled points \[ z = \mathbf{w^Tx} + w_0 > 0 \]
  • Negative (-ve) labeled points \[ z = \mathbf{w^Tx} + w_0 < 0 \]
Missing Link 🔗
The distance of a point from the hyperplane can range from \(-\infty\) to \(+ \infty\).
So we need a link 🔗 to transform the geometric distance to probability.
Sigmoid Function (a.k.a Logistic Function)

Maps the output of a linear equation to a value between 0 and 1, allowing the result to be interpreted as a probability.

\[\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}\]
  • If the distance ‘z’ is large and positive, \(\hat{y} \approx 1\) (High confidence).

  • If the distance ‘z’ is 0, \(\hat{y} = 0.5\) (Maximum uncertainty).

    images/machine_learning/supervised/logistic_regression/binary_classification_intro/slide_10_01.png
Why is it called Logistic Regression ?
Because, we use the logistic (sigmoid) function as the ‘link function’🔗 to map 🗺️ the continuous output of the regression into a probability space.



End of Section

1.2.2.2 - Log Loss

Log Loss

Log Loss

Log Loss = \(\begin{cases} -log(\hat{y_i}) & \text{if } y_i = 1 \\ \\ -log(1-\hat{y_i}) & \text{if } y_i = 0 \end{cases} \)

Combining the above 2 conditions into 1 equation gives:

Log Loss = \(-[y_ilog(\hat{y_i}) + (1-y_i)log(1-\hat{y_i})]\)

images/machine_learning/supervised/logistic_regression/log_loss/slide_04_01.png
Cost Function
\[J(w) = -\frac{1}{n}\sum [y_ilog(\hat{y_i}) + (1-y_i)log(1-\hat{y_i})]\]

We need to find the weights 🏋️‍♀️ ‘w’ that minimize the cost 💵 function.

Gradient Descent
  • Weight update: \[w_{new}=w_{old}-η.\frac{∂J(w)}{∂w_{old}}\]

We need to find the gradient of log loss w.r.t weight ‘w’.

Gradient Calculation

Chain Rule:

\[\frac{\partial{J(w)}}{\partial{w}} = \frac{\partial{J(w)}}{\partial{\hat{y}}}.\frac{\partial{\hat{y}}}{\partial{z}}.\frac{\partial{z}}{\partial{w}}\]
  • Cost Function: \(J(w) = -\frac{1}{n}\sum [y_ilog(\hat{y_i}) + (1-y_i)log(1-\hat{y_i})]\)
  • Prediction: \(\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}\)
  • Distance of Point: \(z = \mathbf{w^Tx} + w_0\)
Cost 💰Function Derivative
\[ J(w) = -\sum [ylog(\hat{y}) + (1-y)log(1-\hat{y})]\]

How loss changes w.r.t prediction ?

\[ \begin{align*} \frac{\partial{J(w)}}{\partial{\hat{y}}} &= - [\frac{y}{\hat{y}} - \frac{1-y}{1-\hat{y}}] \\ &= -[\frac{y- \cancel{y\hat{y}} -\hat{y} + \cancel{y\hat{y}}}{\hat{y}(1-\hat{y})}] \\ \therefore \frac{\partial{J(w)}}{\partial{\hat{y}}} &= \frac{\hat{y} - y}{\hat{y}(1-\hat{y})} \end{align*} \]
Prediction Derivative
\[ \hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}} \]

How prediction changes w.r.t distance ?

\[ \begin{align*} \frac{\partial{\hat{y}}}{\partial{z}} &= \frac{\partial{\sigma(z)}}{\partial{z}} = \sigma'(z) \\ \sigma'(z) &= \sigma(z)(1-\sigma(z)) \\ \therefore \frac{\partial{\hat{y}}}{\partial{z}} &= \hat{y}(1-\hat{y}) \end{align*} \]
Sigmoid Derivative
\[ \sigma(z) = \frac{1}{1 + e^{-z}} \]\[ \begin{align} &\text {Let } u = 1 + e^{-z} \nonumber \\ &\implies \sigma(z) = \frac{1}{u}, \quad \text{so, } \nonumber \\ &\frac{\partial{\sigma(z)}}{\partial{z}} = \frac{\partial{\sigma(z)}}{\partial{u}}. \frac{\partial{u}}{\partial{z}} \nonumber \\ &\frac{\partial{\sigma(z)}}{\partial{u}} = -\frac{1}{u^2} = - \frac{1}{(1 + e^{-z})^2} \\ &\text{and } \frac{\partial{u}}{\partial{z}} = \frac{\partial{(1 + e^{-z})}}{\partial{z}} = -e^{-z} \end{align} \]

from equations (1) & (2):

\[ \begin{align*} \because \frac{\partial{\sigma(z)}}{\partial{z}} &= \frac{\partial{\sigma(z)}}{\partial{u}}. \frac{\partial{u}}{\partial{z}} \\ \implies \frac{\partial{\sigma(z)}}{\partial{z}} &= - \frac{1}{(1 + e^{-z})^2}. -e^{-z} = \frac{e^{-z}}{(1 + e^{-z})^2} \\ 1 - \sigma(z) & = 1 - \frac{1}{1 + e^{-z}} = \frac{e^{-z}}{1 + e^{-z}} \\ \frac{\partial{\sigma(z)}}{\partial{z}} &= \frac{1}{1 + e^{-z}}.\frac{e^{-z}}{1 + e^{-z}} \\ \therefore \frac{\partial{\sigma(z)}}{\partial{z}} &= \sigma(z).(1-\sigma(z)) \end{align*} \]
Distance Derivative
\[z=w^{T}x+w_{0}\]

How distance changes w.r.t weight 🏋️‍♀️ ?

\[ \frac{\partial{z}}{\partial{w}} = \mathbf{x} \]

\[\because \frac{\partial{(a^T\mathbf{x})}}{\partial{\mathbf{x}}} = a\]
Gradient Calculation (combined)

Chain Rule:

\[ \begin{align*} \frac{\partial{J(w)}}{\partial{w}} &= \frac{\partial{J(w)}}{\partial{\hat{y}}}.\frac{\partial{\hat{y}}}{\partial{z}}.\frac{\partial{z}}{\partial{w}} \\ &= \frac{\hat{y} - y}{\cancel{\hat{y}(1-\hat{y})}}.\cancel{\hat{y}(1-\hat{y})}.x \\ \therefore \frac{\partial{J(w)}}{\partial{w}} &= (\hat{y} - y).x \end{align*} \]
Cost 💰Function Derivative
\[\frac{\partial{J(w)}}{\partial{w}} = \sum (\hat{y_i} - y_i).x_i\]

Gradient = Error x Input

  • Error = \((\hat{y_i}-y_i)\): how far is prediction from the truth?
  • Input = \(x_i\): contribution of specific feature to the error.
Gradient Descent (update)

Weight update:

\[w_{new} = w_{old} - \eta. \sum_{i=1}^n (\hat{y_i} - y_i).x_i\]
Why MSE can NOT be used as Loss Function?

Mean Squared Error (MSE) can not be used to quantify error/loss in binary classification because:

  • Convexity : MSE combined with Sigmoid is non-convex, so, Gradient Descent can get trapped in local minima.
  • Penalty: MSE does not appropriately penalize mis-classifications in binary classification.
    • e.g: If actual value is class 1 but the model predicts class 0, then MSE = \((1-0)^2 = 1\), which is very low, whereas los loss = \(-log(0) = \infty\)



End of Section

1.2.2.3 - Regularization

Regularization in Logistic Regression

What happens to the weights of Logistic Regression if the data is perfectly linearly separable?

The weights 🏋️‍♀️ will tend towards infinity, preventing a stable solution.

The model tries to make probabilities exactly 0 or 1, but the sigmoid function never reaches these limits, leading to extreme weights 🏋️‍♀️ to push probabilities near the extremes.

  • Distance of Point: \(z = \mathbf{w^Tx} + w_0\)

  • Prediction: \(\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}\)

  • Log loss: \(-[y_ilog(\hat{y_i}) + (1-y_i)log(1-\hat{y_i})] \)

    images/machine_learning/supervised/logistic_regression/logistic_regularization/slide_06_01.tif
Why is it a problem ?
Overfitting:
Model becomes perfectly accurate on training 🏃‍♂️data but fails to generalize, performing poorly on unseen data.
Solution 🦉
Regularization:
Adds a penalty term to the loss function, discouraging weights 🏋️‍♀️ from becoming too large.
L1 Regularization
\[ \begin{align*} \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ & \underbrace{- \sum_{i=1}^n [y_i\log(\hat{y_i}) + (1-y_i)\log(1-\hat{y_i})]}_{\text{Log Loss}} \\ & \underbrace{+ \lambda_1 \sum_{j=1}^n |w_j|}_{\text{L1 Regularization}} \\ \end{align*} \]
L2 Regularization
\[ \begin{align*} \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ & \underbrace{- \sum_{i=1}^n [y_i\log(\hat{y_i}) + (1-y_i)\log(1-\hat{y_i})]}_{\text{Log Loss}} \\ & \underbrace{+ \lambda_2 \sum_{j=1}^n w_j^2}_{\text{L2 Regularization}} \\ \end{align*} \]

Read more about Regularization



End of Section

1.2.2.4 - Log Odds

Log Odds

What is the meaning of Odds ?

Odds compare the likelihood of an event happening vs. not happening.

Odds = \(\frac{p}{1-p}\)

  • p = probability of success
Log Odds (Logit) Assumption

In logistic regression we assume that Log-Odds (the log of the ratio of positive class to negative class) is a linear function of inputs.

Log-Odds (Logit) = \(log_e \frac{p}{1-p}\)

Log Odds (Logit)

Log Odds = \(log_e \frac{p}{1-p} = z\)

\[z=w^{T}x+w_{0}\]

\[ \begin{align*} &log_{e}(\frac{p}{1-p}) = z \\ &⟹\frac{p}{1-p} = e^{z} \\ &\implies p = e^z - p.e^z \\ &\implies p = \frac{e^z}{1+e^z} \\ &\text { divide numerator and denominator by } e^z \\ &\implies p = \frac{1}{1+e^{-z}} \quad \text { i.e, Sigmoid function} \end{align*} \]
Sigmoid Function
Sigmoid function is the inverse of log-odds (logit) function, it converts the log-odds back to probability, and vice versa.
Range of Values
  • Probability: 0 to 1
  • Odds: 0 to + \(\infty\)
  • Log Odds: -\(\infty\) to +\(\infty\)



End of Section

1.2.2.5 - Probabilistic Interpretation

Probabilistic Interpretation of Logistic Regression

Why do we use Log Loss in Binary Classification?
To understand that let’s have a look 👀at the statistical assumptions.
Bernoulli Assumption

We assume that our target variable ‘y’ follows a Bernoulli distribution, i.e, has exactly 2 outputs success/failure.

  • P(Y=1|X) = p
  • P(Y=0|X) = 1- p

Combining above 2 into 1 equation gives:

  • P(Y=y|X) = \(p^y(1-p)^{1-y}\)
Maximum Likelihood Estimate (MLE)

‘Find the most plausible explanation for what I see.’

We want to find the weights 🏋️‍♀️‘w’ that maximize the likelihood of seeing the data.

  • Data, D = \(\{ (x_i, y_i)_{i=1}^n , \quad y_i \in \{0,1\} \}\)

We do this by maximizing likelihood function.

Likelihood Function
\[\mathcal{L}(w) = \prod_{i=1}^n [p_i^{y_i}(1-p_i)^{1-y_i}]\]

Assumption: Training data is I.I.D.

Problem 🦀
Multiplying many small probabilities is computationally difficult and prone to numerical errors.
Solution🦉

A common simplification is to maximize the log-likelihood function instead, which converts the product into a sum.

Note: Log is a strictly monotonically increasing function.

Log Likelihood Function
\[ \begin{align*} log\mathcal{L}(w) &= \sum_{i=1}^n log [p_i^{y_i}(1-p_i)^{1-y_i}] \\ \therefore log\mathcal{L}(w) &= \sum_{i=1}^n [ y_ilog(p_i) + (1-y_i)log(1-p_i)] \end{align*} \]

Maximizing the log-likelihood is same as minimizing the negative of log-likelihood.

\[ \begin{align*} \underset{w}{\mathrm{max}}\ log\mathcal{L}(w) &= \underset{w}{\mathrm{min}} - log\mathcal{L}(w) \\ \underset{w}{\mathrm{min}} - log\mathcal{L}(w) &= - \sum_{i=1}^n [ y_ilog(p_i) + (1-y_i)log(1-p_i)] \\ \underset{w}{\mathrm{min}} - log\mathcal{L}(w) &= \text {Log Loss} \end{align*} \]
Inference
Log Loss is not chosen arbitrarily, but it follows directly from Bernoulli assumption and MLE.



End of Section

1.2.3 - K Nearest Neighbors

K Nearest Neighbors (KNN)



End of Section

1.2.3.1 - KNN Introduction

K Nearest Neighbors Introduction

Issues with Linear/Logistic Regression
  • Parametric models:
    • Rely on assumption that relationships between data points are linear.
    • For polynomial regression we need to find the degree of polynomial.
  • Training:
    • We need to train 🏃‍♂️the model for prediction.
K Nearest Neighbors
  • Simple: Intuitive way to classify data or predict values by finding similar existing data points (neighbors).

  • Non-Parametric: Makes no assumptions about the underlying data distribution.

  • No Training Required: KNN is a ‘lazy learner’, it does not require a formal training 🏃‍♂️ phase.

    images/machine_learning/supervised/k_nearest_neighbors/knn_intro/slide_03_01.tif
    images/machine_learning/supervised/k_nearest_neighbors/knn_intro/slide_01_01.png
KNN Algorithm

Given a query point \(x_q\) and a dataset, D = {\((x_i,y_i)_{i=1}^n, \quad x_i,y_i \in \mathbb{R}^d\)}, the algorithm finds a set of ‘k’ nearest neighbors \(\mathcal{N}_k(x_q) \subseteq D\).

Inference:

  1. Choose a value of ‘k’ (hyper-parameter); odd number.
  2. Calculate distance (Euclidean, Cosine etc.) between and every point in dataset and store in a distance list.
  3. Sort the distance list in ascending order; choose top ‘k’ data points.
  4. Make prediction:
    • Classification: Take majority vote of ‘k’ nearest neighbors and assign label.
    • Regression: Take the mean/median of ‘k’ nearest neighbors.

Note: Store entire dataset.

Time & Space Complexity
  • Storing Data: Space Complexity: O(nd)
  • Inference: Time Complexity ⏰: O(nd + nlogn)

Explanation:

  • Distance to all ’n’ points in ‘d’ dimensions: O(nd)
  • Sorting all ’n’ data points : O(nlogn)

Note: Brute force 🔨 KNN is unacceptable when ’n’ is very large, say billions.



End of Section

1.2.3.2 - KNN Optimizations

KNN Optimizations

Optimizations

Naive KNN needs some improvements to fix some of its drawbacks.

  • Standardization
  • Distance-Weighted KNN
  • Mahalanobis Distance
Standardization

⭐️Say one feature is ‘Annual Income’ (0-1M), and another feature is ‘Years of Experience’ (0-40).

👉The Euclidean distance will be almost entirely dominated by income 💵.

💡So, we do standardization of each feature, such that it has a mean, \(\mu\)=0 and variance,\(\sigma\)=1.

\[z=\frac{x-\mu}{\sigma}\]
Distance-Weighted KNN

⭐️Vanilla KNN treats the 1st nearest neighbor and the k-th nearest neighbor as equal.

💡A neighbor that is 0.1units away should have more influence than a neighbor that is 10 units away.

👉We assign weight 🏋️‍♀️ to each neighbor; most common strategy is inverse of squared distance.

\[w_i = \frac{1}{d(x_q, x_i)^2 + \epsilon}\]

Improvements:

  • Noise/Outlier: Reduces the impact of ‘noise’ or ‘outlier’ (distant neighbors).
  • Imbalanced Data: Closer points dominate, mitigating impact of imbalanced data.
    • e.g: If you have a query point surrounded by 2 very close ‘Class A’ points and 3 distant ‘Class B’ points, weighted 🏋️‍♀️ KNN will correctly pick ‘Class A'.
Mahalanobis Distance

⭐️Euclidean distance makes assumption that all the features are independent and provide unique information.

💡‘Height’ and ‘Weight’ are highly correlated.

👉If we use Euclidean distance, we are effectively ‘double-counting’ the size of the person.

🏇Mahalanobis distance measures distance in terms of standard deviations from the mean, accounting for the covariance between features.

\[d(x, y) = \sqrt{(x - y)^T \Sigma^{-1} (x - y)}\]

\(\Sigma\): Covariance matrix of the data

  • If \(\Sigma\) is identity matrix, Mahalanobis distance reduces to Euclidean distance.
  • If \(\Sigma\) is a diagonal matrix, Mahalanobis distance reduces to Normalized Euclidean distance.
Runtime Issue

🦀Naive KNN shifts all computation 💻 to inference time ⏰, and it is very slow.

  • To find the neighbor for one query, we must touch every single bit of the ‘nxd’ matrix.
  • If n=10^9,a single query would take seconds, but we need milliseconds.
Advanced Optimizations
  • Distance Weighted KNN
    • K-D Trees (d<20): Recursively partitions space into axis-aligned hyper-rectangles. O(log N ) search.
    • Ball Trees : High dimensional data; Haversine distance for geospatial 🌎 data.
  • Approximate Nearest Neighbors (ANN)
    • Locality Sensitive Hashing (LSH): Uses ‘bucketizing’ 🗑️ hashes. Points that are close have a high probability of having the same hash.
    • Hierarchical Navigable Small World (HNSW); Graph of vectors; Search is a ‘greedy walk’ across levels.
  • Product Quantization (Reduce memory 🧠 footprint 👣 of high dimensional vectors)
    • ScaNN (Google)
    • FAISS (Meta)
  • Dimensionality Reduction (Mitigate ‘Curse of Dimensionality’)
    • PCA



End of Section

1.2.3.3 - Curse Of Dimensionality

Curse Of Dimensionality

Euclidean Distance

While Euclidean distance(L norm) is the most frequently discussed, ‘Curse of Dimensionality’ impacts all Minkowski norms (\(L_p\))

\[L_p = (\sum |x_i|^p)^{\frac{1}{p}} \]

Note: ‘Curse of Dimensionality’ is largely a function of the exponent (p) in the distance calculation.

Issues with High Dimensional Data

Coined 🪙 by mathematician John Bellman in the 1960s while studying dynamic programming.

High dimensional data created following challenges:

  • Distance Concentration
  • Data Sparsity
  • Exponential Sample Requirement
Distance Concentration

💡Consider a hypercube in d-dimensions of side length = 1; Volume = \(1^d\) = 1
🧊 A smaller inner cube with side length = 1 - \(\epsilon\) ; Volume = \((1 -\epsilon)^d\)

\[\lim_{d \rightarrow \infty} (1 - \epsilon)^d = 0\]

🧐 This implies that almost all the volume of the high-dimensional cube lies near the ‘crust’.
👉e.g: if \(\epsilon\)= 0.01, d = 500; Volume of inner cube = \((1 -0.01)^{500}\) = \(0.99^{500}\) = 0.006 = 0.6%
🤔Consequently, all points become nearly equidistant, and the concept of ‘nearest’ or ‘neighborhoodloses its meaning.

Data Sparsity

⭐️The volume of the feature space increases exponentially with each added dimension.

👉To maintain the same data density found in a 1D space with 10 points, we would need \(10^{10}\)(10 billion) points in 10D space.

💡Because real-world datasets are rarely this large, the data becomes “sparse,” making it difficult to find truly similar neighbors.

Exponential Sample Requirement

⭐️To maintain a reliable result, the amount of training data needed must grow exponentially with the number of dimensions.

👉Without this growth, the model is highly prone to overfitting, where it learns from noise in the ‘sparse’ data rather than actual underlying patterns.

Note: For modern embeddings (often 768 or 1536 dimensions), it is mathematically impossible to collect enough data to ‘fill’ the space.

Solution
  • Cosine Similarity
  • Normalization
Cosine Similarity

Cosine similarity measures the cosine of the angle between 2 vectors.

\[\text{cos}(\theta) = \frac{A \cdot B}{\|A\|\|B\|} = \frac{\sum_{i=1}^{D} A_i B_i}{\sqrt{\sum_{i=1}^{D} A_i^2} \sqrt{\sum_{i=1}^{D} B_i^2}}\]

Note: Cosine similarity mitigates the ‘curse of dimensionality" problem.

Normalization

⭐️Normalize the vector, i.e, make its length =1, a unit vector.

💡By normalizing, we project all points onto the surface of a unit hypersphere.

  • We are no longer searching in the ‘empty’ high-dimensional volume of a hypercube.
  • Now, we are searching on a constrained manifold (the shell).

Note: By normalizing, we move the data from the volume of the D-dimensional space onto the surface of a (D-1)-dimensional hypersphere.

Euclidean Distance Squared of Normalized Vectors:

\[ \begin{align*} \|A - B\|^2 &= (A - B) \cdot (A - B) \\ &= \|A\|^2 + \|B\|^2 - 2(A \cdot B)\\ \because \|A\| &= \|B\| = 1 \\ \|A - B\|^2 &= 1 + 1 - 2\cos(\theta) \\ \therefore \|A - B\|^2 &= 2(1 - \cos(\theta))\\ \end{align*} \]

Note: This formula proves that maximizing ‘Cosine similarity’ is identical to minimizing ‘Euclidean distance’ on the hypersphere.



End of Section

1.2.3.4 - Bias Variance Tradeoff

Bias Variance Tradeoff

KNN Dataset

Let’s use this dataset to understand the impact of number of neighbours ‘k’.

images/machine_learning/supervised/k_nearest_neighbors/bias_variance_tradeoff/slide_01_01.tif
High Bias, Low Variance

👉If ‘k’ is very large, say, k=n,

  • model simply predicts the majority class of the entire dataset for every query point , i.e, under-fitting.
High Variance, Low Bias

👉If ‘k’ is very small, say, k=1,

  • model is highly sensitive to noise or outliers, as it looks at only 1 nearest neighbor, i.e, over-fitting.
‘K' Hyper-Parameter Tuning

Let’s plot Error vs ‘K’ neighbors:

images/machine_learning/supervised/k_nearest_neighbors/bias_variance_tradeoff/slide_04_01.tif
Over-Fitting Vs Under-Fitting
  • Figure 1: k=1, Over-fitting

  • Figure 2: k=n, Under-fitting

  • Figure 3: k=11, Lowest Error (Optimum)

    images/machine_learning/supervised/k_nearest_neighbors/bias_variance_tradeoff/slide_05_01.tif



End of Section

1.2.4 - Decision Tree

Decision Tree



End of Section

1.2.4.1 - Decision Trees Introduction

Decision Trees Introduction

How do we classify the below dataset ?
images/machine_learning/supervised/decision_trees/decision_trees_introduction/slide_01_01.tif

💡It can be written as nested 🕸️ if else statements.

e.g: To classify the left bottom corner red points we can write:
👉if (FeatureX1 <1 & FeatureX2 <1)

⭐️Extending the logic for all, we have an if else ladder like below:

images/machine_learning/supervised/decision_trees/decision_trees_introduction/slide_03_01.tif

👉Final decision boundaries will be something like below:

images/machine_learning/supervised/decision_trees/decision_trees_introduction/slide_04_01.tif
What is a Decision Tree 🌲?
  • Non-parametric model.
  • Recursively partitions the feature space.
  • Top-down, greedy approach to iteratively select feature splits.
  • Maximize purity of a node, based on metrics, such as, Information Gain 💵 or Gini 🧞‍♂️Impurity.

Note: We can extract the if/else logic of the decision tree and write in C++/Java for better performance.

Computation 💻
images/machine_learning/supervised/decision_trees/decision_trees_introduction/computation_complexity.png
Decision Tree 🌲 Analysis

⭐️Building an optimal decision tree 🌲 is a NP-Hard problem.
👉(Time Complexity: Exponential; combinatorial search space)

  • Pros
    • No standardization of data needed.
    • Highly interpretable.
    • Good runtime performance.
    • Works for both classification & regression.
  • Cons
    • Number of dimensions should not be too large. (Curse of dimensionality)
    • Overfitting.
When to use Decision Tree 🌲
  • As base learners in ensembles, such as, bagging(RF), boosting(GBDT), stacking, cascading, etc.
  • As a baseline, interpretable, model or for quick feature selection.
  • Runtime performance is important.



End of Section

1.2.4.2 - Purity Metrics

Purity Metrics

Pure Leaf 🍃 Node ?

Decision trees recursively partition the data based on feature values.

images/machine_learning/supervised/decision_trees/purity_metrics/slide_02_01.tif

Pure Leaf 🍃 Node: Terminal node where every single data point belongs to the same class.

💡Zero Uncertainty.

So, what should be the logic to partition the data at each step or each node ?

The goal of a decision tree algorithm is to find the split that maximizes information gain, meaning it removes the most uncertainty from the data.

So, what is information gain ?
How do we reduce uncertainty ?

Let’s understand few terms first, before we understand information gain.

Entropy

Measure ⏱ of uncertainty, randomness, or impurity in a data.

\[H(S)=-\sum _{i=1}^{n}p_{i}\log(p_{i})\]

Binary Entropy:

images/machine_learning/supervised/decision_trees/purity_metrics/slide_04_01.png
Surprise 😮 Factor

💡Entropy can also be viewed as the ‘average surprise'.

  • A highly certain event provides little information when it occurs (low surprise).

  • An unlikely event provides a lot of information (high surprise).

    images/machine_learning/supervised/decision_trees/purity_metrics/slide_06_01.png
Information Gain 💰

⭐️ Measures the reduction in entropy (uncertainty) achieved by splitting a dataset based on a specific attribute.

\[IG=Entropy(Parent)-\left[\frac{N_{left}}{N_{parent}}Entropy(Child_{left})+\frac{N_{right}}{N_{parent}}Entropy(Child_{right})\right] \]

Note: The goal of a decision tree algorithm is to find the split that maximizes information gain, meaning it removes the most uncertainty from the data.

Gini 🧞‍♂️Impurity

⭐️ Measures the probability of an element being incorrectly classified if it were randomly labeled according to the distribution of labels in a node.

\[Gini(S)=1-\sum_{i=1}^{n}(p_{i})^{2}\]
  • Range: 0 (Pure) - 0.5 (Maximum impurity)

Note: Gini is used in libraries like Scikit-Learn (as the default), because it avoids the computationally expensive 💰 log function.

Gini Impurity Vs Entropy
  • Gini Impurity is a first-order approximation of Entropy.

  • For most of the real-world cases, choosing one over the other results in the exact same tree structure or negligible differences in accuracy.

  • When we plot the two functions, they follow nearly identical shapes.

    images/machine_learning/supervised/decision_trees/purity_metrics/slide_10_01.tif



End of Section

1.2.4.3 - Decision Trees For Regression

Decision Trees For Regression

Decision Trees for Regression

Decision Trees can also be used for Regression tasks but using a different metrics.

⭐️Metric:

  • Mean Squared Error (MSE)
  • Mean Absolute Error (MAE)

👉Say we have a following dataset, that we need to fit using decision trees:

images/machine_learning/supervised/decision_trees/decision_trees_for_regression/slide_02_01.tif

👉Decision trees try to find the decision splits, building step functions that approximate the actual curve, as shown below:

images/machine_learning/supervised/decision_trees/decision_trees_for_regression/slide_03_01.tif

👉Internally the decision tree (if else ladder) looks like below:

images/machine_learning/supervised/decision_trees/decision_trees_for_regression/slide_04_01.tif
Can we use decision trees for all kinds of regression? Or is there any limitation ?

Decision trees cannot predict values outside the range of the training data, i.e, extrapolation.

Let’s understand the interpolation and extrapolation cases one by one.

Interpolation ✅

⭐️Predicting values within the range of features and targets observed during training 🏃‍♂️.

  • Trees capture discontinuities perfectly, because they are piece-wise constant.
  • They do not try to force a smooth line where a ‘jump’ exists in reality.

e.g: Predicting a house 🏡 price 💰 for a 3-BHK home when you have seen 2-BHK and 4-BHK homes in that same neighborhood.

Extrapolation ❌

⭐️Predicting values outside the range of training 🏃‍♂️data.

Problem:
Because a tree outputs the mean of training 🏃‍♂️ samples in a leaf, it cannot predict a value higher than the highest ‘y’ it saw during training 🏃‍♂️.

  • Flat-Line: Once a feature ‘X’ goes beyond the training boundaries, the tree falls into the same ‘last’ leaf forever.

e.g: Predicting the price 💰 of a house 🏡 in 2026 based on data from 2010 to 2025.



End of Section

1.2.4.4 - Regularization

Regularization

Over-Fitting
⭐️Because trees🌲 are non-parametric and ‘greedy’💰, they will naturally try to grow 📈 until every leaf 🍃 is pure, effectively memorizing noise and outliers rather than learning generalizable patterns.
Tree 🌲 Size
👉As the tree 🌲 grows, the amount of data in each subtree decreases 📉, leading to splits based on statistically insignificant samples.
Regularization Techniques
  • Pre-Pruning ✂️ &
  • Post-Pruning ✂️
Pre-Pruning ✂️

⭐️ ‘Early stopping’ heuristics (hyper-parameters).

  • max_depth: Limits how many levels of ‘if else’ the tree can have; most common.
  • min_samples_split: A node will only split, if it has at least ‘N’ samples; smooths the model (especially in regression), by ensuring predictions are based on an average of multiple points.
  • max_leaf_nodes: Limiting the number of leaves reduces the overall complexity of the tree, making it simpler and less likely to memorize the training data’s noise.
  • min_impurity_decrease: A split is only made if it reduces the impurity (Gini/MSE) by at least a certain threshold.
Max Depth Hyper Parameter Tuning

Below is an example for one of the hyper-parameter’s max_depth tuning.

As we can see below the cross-validation error decreases till depth=6 and after that reduction in error is not so significant.

images/machine_learning/supervised/decision_trees/regularization/slide_05_01.tif
Post-Pruning ✂️

Let the tree🌲 grow to its full depth (overfit) and then ‘collapse’ nodes that provide little predictive value.

Most common algorithm:

  • Minimal Cost Complexity Pruning
Minimal Cost Complexity Pruning ✂️

💡Define a cost-complexity 💰 measure that penalizes the tree 🌲 for having too many leaves 🍃.

\[R_\alpha(T) = R(T) + \alpha |T|\]
  • R(T): total misclassification rate (or MSE) of the tree
  • |T|: number of terminal nodes (leaves)
  • \(\alpha\): complexity parameter (the ‘tax’ 💰 on complexity)

Logic:

  • If \(\alpha\)=0, the tree is the original overfit tree.
  • As \(\alpha\) increases 📈, the penalty for having many leaves grows 📈.
  • To minimize the total cost 💰, the model is forced to prune branches🪾 that do not significantly reduce R(T).
  • Use cross-validation to find the ‘sweet spot’ \(\alpha\) that minimizes validation error.



End of Section

1.2.4.5 - Bagging

Bagging

Issues with Decision Tree ?
  • A single decision tree is highly sensitive to the specific training dataset.
    Small changes, such as, a few different rows or the presence of an outlier, can lead to a completely different tree structure.

  • Unpruned decision trees often grow until they perfectly classify the training set, essentially ‘memorizing’ noise and outliers, i.e, high variance, rather than finding general patterns.

What does Bagging mean 🤔?

Bagging = ‘Bootstrapped Aggregation’

Bagging 🎒is a parallel ensemble technique that reduces variance (without significantly increasing the bias) by training multiple versions of the same model on different random subsets of data and then combining their results.

Note: Bagging uses deep trees (overfit) and combines them to reduce variance.

Bootstrapping

Bootstrapping = ‘Without external help’

Given a training 🏃‍♂️set D of size ’n’, we create B new training sets D by sampling ’n’ observations from D ‘with replacement'.

Bootstrapped Samples

💡Since, we are sampling ‘with replacement’, so, some data points may be picked multiple times, while others may not be picked at all.

  • The probability that a specific observation is not selected in a bootstrap sample of size ’n’ is: \[\lim_{n \to \infty} \left(1 - \frac{1}{n}\right)^n = \frac{1}{e} \approx 0.368\]

🧐This means each tree is trained on roughly 63.2% of the unique data, while the remaining 36.8% (the Out-of-Bag or OOB set) can be used for cross validation.

Aggregation

⭐️Say we train ‘B’ models (base-learners), each with variance \(\sigma^2\) .

👉Average variance of ‘B’ models (trees) if all are independent:

\[Var(X)=\frac{\sigma^{2}}{B}\]

👉Since, bootstrap samples are derived from the same dataset, the trees are correlated with some correlation coefficient ‘\(\rho\)'.

So, the true variance of bagged ensemble is:

\[Var(f_{bag}) = \rho \sigma^2 + \frac{1-\rho}{B} \sigma^2\]
  • \(\rho\)= 0; independent models, most reduction in variance.
  • \(\rho\)= 1; fully correlated models, no improvement in variance.
  • 0<\(\rho\)<1; As correlation decreases, variance reduces .
Why Bagging is Better than a Single Model?
images/machine_learning/supervised/decision_trees/bagging/slide_09_01.png



End of Section

1.2.4.6 - Random Forest

Random Forest

Problem with Bagging

💡If one feature is extremely predictive (e.g., ‘Area’ for house prices), almost every bootstrap tree will split on that feature at the root.

👉This makes the trees(models) very similar, leading to a high correlation ‘\(\rho\)’.

\[Var(f_{bagging})=ρ\sigma^{2}+\frac{1-ρ}{B}\sigma^{2}\]
Feature Sub Sampling

💡Choose a random subset of ‘m’ features from the total ‘d’ features, reducing the correlation ‘\(\rho\)’ between trees.

👉By forcing trees to split on ‘sub-optimal’ features, we intentionally increase the variance of individual trees; also the bias is slightly increased (simpler trees).

Standard heuristics:

  • Classification: \(m = \sqrt{d}\)
  • Regression: \(m = \frac{d}{3}\)
Math of De-Correlation

💡Because ‘\(\rho\)’ is the dominant factor in the variance of the ensemble when B is large, the overall ensemble variance Var(\(f_{rf}\)) drops significantly lower than standard Bagging.

\[Var(f_{rf})=ρ\sigma^{2}+\frac{1-ρ}{B}\sigma^{2}\]
Over-Fitting

💡A Random Forest will never overfit by adding more trees (B).

It only converges to the limit: ‘\(\rho\sigma^2\)’.

Overfitting is controlled by:

  • depth of the individual trees.
  • size of the feature subset ‘m'.
When to use Random Forest ?
  • High Dimensionality: 100s or 1000s of features; RF’s feature sampling prevents a few features from masking others.
  • Tabular Data (with Complex Interactions): Captures non-linear relationships without needing manual feature engineering.
  • Noisy Datasets: The averaging process makes RF robust to outliers (especially if using min_samples_leaf).
  • Automatic Validation: Need a quick estimate of generalization error without doing 10-fold CV (via OOB Error).



End of Section

1.2.4.7 - Extra Trees

Extra Trees

Issue with Decision Trees/Random Forest

💡In a standard Decision Tree or Random Forest, the algorithm searches for the optimal split point (the threshold ’s’) that maximizes Information Gain or minimizes MSE.

👉This search is:

  • computationally expensive (sort + mid-point) and
  • tends to follow the noise in the training 🏃‍♂️data.
Adding Randomness

Adding randomness (right kind) in ensemble averaging reduces correlation/variance.

\[Var(f_{bag})=ρ\sigma^{2}+\frac{1-ρ}{B}\sigma^{2}\]
Extremely Randomized (ExtRa) Trees
  • Random Thresholds: Instead of searching for the best split point (computationally expensive 💰) for a feature, it picks a threshold at random from a uniform distribution between the feature’s local minimum and maximum.
  • Entire Dataset: Uses entire training dataset (default) for every tree; no bootstrapping.
  • Random Feature Subsets: Random subset of m<n features is used in each decision tree.
Mathematical Intuition
\[Var(f_{et})=ρ\sigma^{2}+\frac{1-ρ}{B}\sigma^{2}\]

Picking thresholds randomly has two effects:

  • Structural correlation between trees becomes extremely low.
  • Individual trees are ‘weaker’ and have higher bias than a standard optimized tree.

👉The massive drop in ‘\(\rho\)’ often outweighs the slight increase in bias, leading to an overall ensemble that is smoother and more robust to noise than a standard Random Forest.

Note: Extra Trees are almost always grown to full depth, as they may need extra splits to find the same decision boundary.

When to use Extra Trees ?
  • Performance: Significantly faster to train, as it does not sort data to find optimal split.
    Note: If we are working with billions of rows or thousands of features, ET can be 3x to 5x faster than a Random Forest(RF).
  • Robustness to Noise: By picking thresholds randomly, tends to ‘handle’ the noise more effectively than RF.
  • Feature Importance: Because ET is so randomized, it often provides more ‘stable’ feature importance scores.

Note: It is less likely to favor a high-cardinality feature (e.g. zip-code) just because it has more potential split points.



End of Section

1.2.4.8 - Boosting

Boosting

Intuition 💡

⭐️In Bagging 🎒we trained multiple strong(over-fit, high variance) models (in parallel) and then averaged them out to reduce variance.

💡Similarly, we can train many weak(under-fit, high bias) models sequentially, such that, each new model corrects the errors of the previous ones to reduce bias.

Boosting

⚔️ An ensemble learning approach where multiple ‘weak learners’ (typically simple models like shallow decision trees 🌲 or ‘stumps’) are sequentially combined to create a single strong predictive model.

⭐️The core principle is that each subsequent model focuses 🎧 on correcting the errors made by its predecessors.

Why is Boosting Better ?
👉Boosting generally achieves better predictive performance because it actively reduces bias by learning 📖from ‘past mistakes’, making it ideal for achieving state-of-the-art 🖼️ results.
Popular Boosting Algorithms
  • AdaBoost(Adaptive Boosting)
  • Gradient Boosting Machine (GBM)
    • XGBoost
    • LightGBM (Microsoft)
    • CatBoost (Yandex)



End of Section

1.2.4.9 - AdaBoost

AdaBoost

Adaptive Boosting (AdaBoost)

💡Works by increasing 📈 the weight 🏋️‍♀️ of misclassified data points after each iteration, forcing the next weak learner to ‘pay more attention’🚨 to the difficult cases.

⭐️ Commonly used for classification.

Decision Stumps

👉Weak learners are typically ‘Decision Stumps’, i.e, decision trees🌲with a depth of only one (1 split, 2 leaves 🍃).

images/machine_learning/supervised/decision_trees/adaboost/slide_04_01.png
Algorithm
  1. Assign an equal weight 🏋️‍♀️to every data point; \(w_i = 1/n\), where ’n’=number of samples.
  2. Build a decision stump that minimizes the weighted classification error.
  3. Calculate total error; \(E_m = \Sigma w_i\).
  4. Determine ‘amount of say’, i.e, the weight 🏋️‍♀️ of each stump in final decision. \[\alpha_m = \frac{1}{2}ln\left( \frac{1-E_m}{E_m} \right)\]
    • Low error results in a high positive \(\alpha\) (high influence).
    • 50% error (random guessing) results in an \(\alpha = 0\) (no influence).
  5. Update sample weights 🏋️‍♀️.
    • Misclassified samples: Weight 🏋️‍♀️ increases by \(e^{\alpha_m}\).
    • Correctly classified samples: Weight 🏋️‍♀️ decreases by \(e^{-\alpha_m}\).
    • Normalization: All new weights 🏋️‍♀️ are divided by their total sum so they add up back to 1.
  6. Iterate for a specified number of estimators (n_estimators).
Final Prediction 🎯

👉 To classify a new data point, every stump makes a prediction (+1 or -1).

These are multiplied by their respective ‘amount of say’ \(\alpha_m\) and summed.

\[H(x)=sign\sum_{m=1}^{M}\alpha_{m}⋅h_{m}(x)\]

👉 If the total weighted 🏋️‍♀️ sum is positive, the final class is +1; otherwise -1.

Note: Sensitive to outliers; Because AdaBoost aggressively increases weights 🏋️‍♀️ on misclassified points, it may ‘over-focus’ on noisy outliers, hurting performance.



End of Section

1.2.4.10 - Gradient Boosting Machine

Gradient Boosting Machine Introduction

Idea 💡
👉GBM fits new models to the ‘residual errors’ (the difference between actual and predicted values) of the previous models.
Gradient Boosting Machine

GBM treats the final model \(F_m(x)\) as weighted 🏋️‍♀️ sum of ‘m’ weak learners:

\[ F_{M}(x)=\underbrace{F_{0}(x)}_{\text{Initial\ Guess}}+\nu \sum _{m=1}^{M}\underbrace{\left(\sum _{j=1}^{J_{m}}\gamma _{jm}\mathbb{I}(x\in R_{jm})\right)}_{\text{Weak\ Learner\ }h_{m}(x)}\]
  • \(F_0(x)\): The initial base model (usually a constant).
  • M: The total number of boosting iterations (number of trees).
  • \(\gamma_{jm}\)(Leaf Weight): The optimized value for leaf in tree .
  • \(\nu\)(Nu): The Learning Rate or Shrinkage; prevent overfitting.
  • \(\mathbb{I}(x\in R_{jm})\): ‘Indicator Function’; It is 1 if data point falls into leaf of the tree, and 0 otherwise.
  • \(R_{jm}\)(Regions): Region of \(j_{th}\) leaf in \(m_{th}\)tree.
Gradient Descent in Function Space

📍In Gradient Descent, we update parameters ‘\(\Theta\)';

📍In GBM, we update the predictions F(x) themselves.

🦕We move the predictions in the direction of the negative gradient of the loss function L(y, F(x)).

🎯We want to minimize loss:

\[\mathcal{L}(F) = \sum_{i=1}^n L(y_i, F(x_i))\]

✅ In parameter optimization we update weights 🏋️‍♀️:

\[w_{t+1} = w_t - \eta \cdot \nabla_{w}\mathcal{L}(w_t)\]

✅ In gradient boosting, we update the prediction function:

\[F_m(x) = F_{m-1}(x) -\eta \cdot \nabla_F \mathcal{L}(F_{m-1}(x))\]

➡️ The gradient is calculated w.r.t. predictions, not weights.

Pseudo Residuals

In GBM we can use any loss function as long as it is differentiable, such as, MSE, log loss, etc.

Loss(MSE) = \((y_i - F_m(x_i))^2\)

\[\frac{\partial L}{F_m(x_i)} = -2 (y-F_m(x_i))\]

\[\implies \frac{\partial L}{F_m(x_i)} \propto - (y-F_m(x_i))\]

👉Pseudo Residual (Error) = - Gradient

Why initial model is Mean model for MSE ?

💡To minimize loss, take derivative of loss function w.r.t ‘\(\gamma\)’ and equate to 0:

\[F_0(x) = \arg\min_{\gamma} \sum_{i=1}^n L(y_i, \gamma)\]

MSE Loss = \(\mathcal{L}(y_i, \gamma) = \sum_{i=1}^n(y_i -\gamma)^2\)

\[ \begin{aligned} &\frac{\partial \mathcal{L}(y_i, \gamma)}{\partial \gamma} = -2 \cdot \sum_{i=1}^n(y_i -\gamma) = 0 \\ &\implies \sum_{i=1}^n (y_i -\gamma) = 0 \\ &\implies \sum_{i=1}^n y_i = n.\gamma \\ &\therefore \gamma = \frac{1}{n} \sum_{i=1}^n y_i \end{aligned} \]
Why optimal leaf 🍃value is the ‘Mean' of the residuals for MSE ?

💡To minimize cost, take derivative of cost function w.r.t ‘\(\gamma\)’ and equate to 0:

Cost Function = \(J(\gamma )\)

\[J(\gamma )=\sum _{x_{i}\in R_{jm}}\frac{1}{2}(y_{i}-(F_{m-1}(x_{i})+\gamma ))^{2}\]

We know that:

\[ r_{i}=y_{i}-F_{m-1}(x_{i})\]

\[\implies J(\gamma )=\sum _{x_{i}\in R_{jm}}\frac{1}{2}(r_{i}-\gamma )^{2}\]

\[\frac{d}{d\gamma }\sum _{x_{i}\in R_{jm}}\frac{1}{2}(r_{i}-\gamma )^{2}=\sum _{x_{i}\in R_{jm}}-(r_{i}-\gamma )=0\]

\[\implies \sum _{x_{i}\in R_{jm}}\gamma -\sum _{x_{i}\in R_{jm}}r_{i}=0\]

👉Since, \(\gamma\) is constant for all \(n_j\) samples in the leaf, \(\sum _{x_{i}\in R_{jm}}\gamma =n_{j}\gamma \)

\[n_{j}\gamma =\sum _{x_{i}\in R_{jm}}r_{i}\]

\[\implies \gamma =\frac{\sum _{x_{i}\in R_{jm}}r_{i}}{n_{j}}\]

Therefore, \(\gamma\) = average of all residuals in the leaf.

Note: \(R_{jm}\)(Regions): Region of \(j_{th}\) leaf in \(m_{th}\)tree.



End of Section

1.2.4.11 - GBDT Algorithm

GBDT Algorithm

Gradient Boosted Decision Tree (GBDT)

Gradient Boosted Decision Tree (GBDT) is a decision tree based implementation of Gradient Boosting Machine (GBM).

GBM treats the final model \(F_m(x)\) as weighted 🏋️‍♀️ sum of ‘m’ weak learners (decision trees):

\[ F_{M}(x)=\underbrace{F_{0}(x)}_{\text{Initial\ Guess}}+\nu \sum _{m=1}^{M}\underbrace{\left(\sum _{j=1}^{J_{m}}\gamma _{jm}\mathbb{I}(x\in R_{jm})\right)}_{\text{Decision\ Tree\ }h_{m}(x)}\]
  • \(F_0(x)\): The initial base model (usually a constant).
  • M: The total number of boosting iterations (number of trees).
  • \(\gamma_{jm}\)(Leaf Weight): The optimized value for leaf in tree .
  • \(\nu\)(Nu): The Learning Rate or Shrinkage; prevent overfitting.
  • \(\mathbb{I}(x\in R_{jm})\): ‘Indicator Function’; It is 1 if data point falls into leaf of the tree, and 0 otherwise.
  • \(R_{jm}\)(Regions): Region of \(j_{th}\) leaf in \(m_{th}\)tree.
Algorithm
  • Step 1: Initialization.
  • Step 2: Iterative loop 🔁 : for i=1 to m.
  • 2.1: Calculate pseudo residuals ‘\(r_{im}\)'.
  • 2.2: Fit a regression tree 🌲‘\(h_m(x)\)'.
  • 2.3:Compute leaf 🍃weights 🏋️‍♀️ ‘\(\gamma_{jm}\)'.
  • 2.4:Update the model.
Step 1: Initialization

Start with a function that minimizes our loss function;
for MSE its mean.

\[F_0(x) = \arg\min_{\gamma} \sum_{i=1}^n L(y_i, \gamma)\]

MSE Loss = \(\mathcal{L}(y_i, \gamma) = \sum_{i=1}^n(y_i -\gamma)^2\)

Step 2.1: Calculate pseudo residuals ‘\(r_{im}\)'

Find the ‘gradient’ of error;
Tells us the direction and magnitude needed to reduce the loss.

\[r_{im}=-\frac{∂L(y_{i},F(x_{i}))}{∂F(x_{i})}_{F(x)=F_{m-1}(x)}\]
Step 2.2: Fit regression tree ‘\(h_m(x)\)'

Train a tree to predict the residuals ‘\(h_m(x)\)';

  • Tree 🌲 partitions the data into leaves 🍃 (\(R_{jm}\)regions )
Step 2.3: Compute leaf weights ‘\(\gamma_{jm}\)'

Within each leaf 🍃, we calculate the optimal value ‘\(\gamma_{jm}\)’ that minimizes the loss for the samples in that leaf 🍃.

\[\gamma_{jm} = \arg\min_{\gamma} \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma)\]

➡️ The optimal leaf 🍃value is the ‘Mean’(MSE) of the residuals; \(\gamma = \frac{\sum r_i}{n_j}\)

Step 2.4: Update the model.

Add the new ‘correction’ to the existing model, scaled by the learning rate.

\[F_{m}(x)=F_{m-1}(x)+\nu \cdot \underbrace{\sum _{j=1}^{J_{m}}\gamma _{jm}\mathbb{I}(x\in R_{jm})}_{h_{m}(x)}\]
  • \(\mathbb{I}(x\in R_{jm})\): ‘Indicator Function’; It is 1 if data point falls into leaf of the tree, and 0 otherwise.



End of Section

1.2.4.12 - GBDT Example

GBDT Example

Gradient Boosted Decision Tree (GBDT)

Gradient Boosted Decision Tree (GBDT) is a decision tree based implementation of Gradient Boosting Machine (GBM).

GBM treats the final model \(F_m(x)\) as weighted 🏋️‍♀️ sum of ‘m’ weak learners (decision trees):

\[ F_{M}(x)=\underbrace{F_{0}(x)}_{\text{Initial\ Guess}}+\nu \sum _{m=1}^{M}\underbrace{\left(\sum _{j=1}^{J_{m}}\gamma _{jm}\mathbb{I}(x\in R_{jm})\right)}_{\text{Decision\ Tree\ }h_{m}(x)}\]
  • \(F_0(x)\): The initial base model (usually a constant).
  • M: The total number of boosting iterations (number of trees).
  • \(\gamma_{jm}\)(Leaf Weight): The optimized value for leaf in tree .
  • \(\nu\)(Nu): The Learning Rate or Shrinkage; prevent overfitting.
  • \(\mathbb{I}(x\in R_{jm})\): ‘Indicator Function’; It is 1 if data point falls into leaf of the tree, and 0 otherwise.
  • \(R_{jm}\)(Regions): Region of \(j_{th}\) leaf in \(m_{th}\)tree.
Algorithm
  • Step 1: Initialization.
  • Step 2: Iterative loop 🔁 : for i=1 to m.
  • 2.1: Calculate pseudo residuals ‘\(r_{im}\)'.
  • 2.2: Fit a regression tree 🌲‘\(h_m(x)\)'.
  • 2.3:Compute leaf 🍃weights 🏋️‍♀️ ‘\(\gamma_{jm}\)'.
  • 2.4:Update the model.
Predict House Prices
images/machine_learning/supervised/decision_trees/gbdt_example/house_price_table.png

👉Loss = MSE, Learning rate (\(\nu\)) = 0.5

Solution
  1. Initialization : \(F_0(x) = mean(2,4,9) = 5.0\)
  2. Iteration 1(m=1):
    • 2.1: Calculate residuals ‘\(r_{i1}\)'

      \[\begin{aligned} r_{11} &= 2-5 = -3.0 \\ r_{21} &= 4-5 = -1.0 \\ r_{31} &= 9-5 = 4.0 \\ \end{aligned} \]
    • 2.2: Fit tree(\(h_1\)); Split at X<2150 (midpoint of 1800 and 2500)

    • 2.3: Compute leaf weights \(\gamma_{j1}\)

      • Y-> Leaf 1: Ids 1, 2 ( \(\gamma_{11}\)= -2.0)
      • N-> Leaf 2: Id 3 ( \(\gamma_{21}\)= 4.0)
    • 2.4: Update predictions (\(F_1 = F_0 + 0.5 \cdot \gamma\))

      \[ \begin{aligned} F_1(x_1) &= 5.0 + 0.5(-2.0) = \mathbf{4.0}\ \\F_1(x_2) &= 5.0 + 0.5(-2.0) = \mathbf{4.0}\ \\F_1(x_3) &= 5.0 + 0.5(4.0) = \mathbf{7.0}\ \\ \end{aligned} \]

      Tree 1:

      images/machine_learning/supervised/decision_trees/gbdt_example/slide_05_01.png
  • Iteration 2(m=2):
    • 2.1: Calculate residuals ‘\(r_{i2}\)'

      \[ \begin{aligned} r_{12} &= 2-4.0 = -2.0 \\ r_{22} &= 4-4.0 = 0.0 \\ r_{32} &= 9-7.0 = 2.0 \\ \end{aligned} \]
    • 2.2: Fit tree(\(h_2\)); Split at X<1500 (midpoint of 1200 and 1800)

    • 2.3: Compute leaf weights \(\gamma_{j2}\)

      • Y-> Leaf 1: Ids 1 ( \(\gamma_{12}\)= -2.0)
      • N-> Leaf 2: Id 2, 3 ( \(\gamma_{22}\)= 1.0)
    • 2.4: Update predictions (\(F_1 = F_0 + 0.5 \cdot \gamma\))

      \[ \begin{aligned} F_2(x_1) &= 4.0 + 0.5(-2.0) = \mathbf{3.0} \\F_2(x_2) &= 4.0 + 0.5(1.0) = \mathbf{4.5} \\ F_2(x_3) &= 7.0 + 0.5(1.0) = \mathbf{7.5}\ \\ \end{aligned} \]

      Tree 2:

      images/machine_learning/supervised/decision_trees/gbdt_example/slide_07_01.png

Note: We can keep adding more trees with every iteration;
ideally, learning rate \(\nu\) is small, say 0.1, so that we do not overshoot and converge slowly.

Inference
\[ F_{final}(x) = F_0 + \nu \cdot h_1(x) + \nu \cdot h_2(x) \]

Let’s predict the price of a house with area = 2000 sq. ft.

  • \(F_{0}=5.0\)
  • Pass though tree 1 (\(h_1\)): is 2000 < 2150 ? Yes, \(\gamma_{11}\)= -2.0
    • Contribution (\(h_1\)) = 0.5 x (-2.0) = -1.0
  • Pass though tree 2 (\(h_2\)): is 2000 < 1500 ? No, \(\gamma_{22}\) = 1.0
    • Contribution(\(h_2\)) = 0.5 x (1.0) = 0.5
  • Final prediction = 5.0 - 1.0 + 0.5 = 4.5

Therefore, the price of a house with area = 2000 sq. ft is Rs 4.5 crores, which is very close.
In just 2 iterations, although with higher learning rate (\(\nu=0.5\)), we were able to get a fairly good estimate.



End of Section

1.2.4.13 - Advanced GBDT Algorithms

Advanced GBDT Algorithms

Advanced GBDT Algorithms
🔴 XGBoost (Extreme Gradient Boosting)
🔵 LightGBM (Light Gradient Boosting Machine)
⚫️ CatBoost (Categorical Boosting)
XGBoost (Extreme Gradient Boosting)

⭐️An optimized and highly efficient implementation of gradient boosting.

👉 Widely used in competitive data science (like Kaggle) due to its speed and performance.

Note: Research project developed by Tianqi Chen during his doctoral studies at the University of Washington.

LightGBM (Light Gradient Boosting Machine)

⭐️Developed by Microsoft, this framework is designed for high speed and efficiency with large datasets.

👉It grows trees leaf-wise rather than level-wise and uses Gradient-based One-Side Sampling (GOSS) to speed 🐇 up the finding of optimal split points.

CatBoost (Categorical Boosting)
⭐️Developed by Yandex, this algorithm is specifically optimized for handling ‘categorical’ features without requiring extensive preprocessing (such as, one-hot encoding).



End of Section

1.2.4.14 - XgBoost

XgBoost

XGBoost (Extreme Gradient Boosting)

⭐️An optimized and highly efficient implementation of gradient boosting.

👉 Widely used in competitive data science (like Kaggle) due to its speed and performance.

Note: Research project developed by Tianqi Chen during his doctoral studies at the University of Washington.

Algorithmic Optimizations
🔵 Second order Derivative
🔵 Regularization
🔵 Sparsity-Aware Split Finding
Second order Derivative

⭐️Uses second derivative (Hessian), i.e, curvature, in addition to first derivative (gradient) to optimize the objective function more quickly and accurately than GBDT.

Let’s understand this with the problem to minimize \(f(x) = x^4\), using:

  • Gradient descent (uses only 1st order derivative, \(f'(x) = 4x^3\))

  • Newton’s method (uses both 1st and 2nd order derivatives \(f''(x) = 12x^2\))

    images/machine_learning/supervised/decision_trees/xgboost/slide_04_01.png
Regularization
  • Adds explicit regularization terms (L1/L2 on leaf weights and tree complexity) into the boosting objective, helping reduce over-fitting. \[ \text{Objective} = \underbrace{\sum_{i=1}^{n} L(y_i, \hat{y}_i)}_{\text{Training Loss}} + \underbrace{\gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2 + \alpha \sum_{j=1}^{T} |w_j|}_{\text{Regularization (The Tax)}} \]
Sparsity-Aware Split Finding

💡Real-world data often contains many missing values or zero-entries (sparse data).

👉 XGBoost introduces a ‘default direction’ for each node.

➡️During training, it learns the best direction (left or right) for missing values to go, making it significantly faster and more robust when dealing with sparse or missing data.



End of Section

1.2.4.15 - LightGBM

LightGBM

LightGBM (Light Gradient Boosting Machine)

⭐️Developed by Microsoft, this framework is designed for high speed and efficiency with large datasets.

👉It grows trees leaf-wise rather than level-wise and uses Gradient-based One-Side Sampling (GOSS) to speed 🐇 up the finding of optimal split points.

Algorithmic Optimizations
🔵 Gradient-based One Side Sampling (GOSS)
🔵 Exclusive Feature Bundling (EFB)
🔵 Leaf-wise Tree Growth Strategy
Gradient-based One Side Sampling (GOSS)
  • ❌ Traditional GBDT uses all data instances for gradient calculation, which is inefficient.
  • ✅ GOSS focuses 🔬on instances with larger gradients (those that are less well-learned or have higher error).
  • 🐛 Keeps all instances with large gradients but randomly samples from those with small gradients.
  • 🦩This way, it prioritizes the most informative examples for training, significantly reducing the data used and speeding up 🐇 the process while maintaining accuracy.
Exclusive Feature Bundling (EFB)
  • 🦀 High-dimensional data often contains many sparse, mutually exclusive features (features that never take a non-zero value simultaneously, such as, One Hot Encoding (OHE)).
  • 💡 EFB bundles the non-overlapping features into a single, dense feature, reducing the number of features, without losing much information, saving computation.
Leaf-wise Tree Growth Strategy
  • ❌ Traditional gradient boosting machines (like XGBoost), the trees are built level-wise (BFS-like), meaning all nodes at the current level are split before moving to the next level.
  • ✅ LightGBM maintains a set of all potential leaves that can be split at any given time and selects the leaf (for splitting) that provides the maximum gain across the entire tree, regardless of its depth.

Note: Need mechanisms to avoid over-fitting.



End of Section

1.2.4.16 - CatBoost

CatBoost

CatBoost (Categorical Boosting)
⭐️Developed by Yandex, this algorithm is specifically optimized for handling ‘categorical’ features without requiring extensive preprocessing (such as, one-hot encoding).
Algorithmic Optimizations
🔵 Ordered Target Encoding
🔵 Symmetric(Oblivious) Trees
🔵 Handling Missing Values
Ordered Target Encoding
  • ❌ Standard target encoding can lead to target leakage, where the model uses information from the target variable during training that would not be available during inference.
    👉(model ‘cheats’ by using a row’s own label to predict itself).
  • ✅ CatBoost calculates the target statistics (average target value) for each category based only on the history of previous training examples in a random permutation of the data.
Symmetric (Oblivious) Trees
  • 🦋 Uses symmetric decision trees by default.
    👉 In symmetric trees, the same split condition is applied at each level across the entire tree structure.

  • 🦘Does not walk down the tree using ‘if-else’ logic, instead it evaluates decision conditions to create a binary index (e.g 101) and jumps directly to that leaf 🍃 in memory 🧠.

    images/machine_learning/supervised/decision_trees/catboost/slide_06_01.png
Handling Missing Values
  • ⚙️ CatBoost offers built-in, intelligent handling of missing values and sparse features, which often require manual preprocessing in other GBDT libraries.

  • 💡Treats ‘NaN’ as a distinct category, reducing the need for imputation.

    images/machine_learning/supervised/decision_trees/catboost/slide_08_01.png



End of Section

1.2.5 - Support Vector Machine

Support Vector Machine



End of Section

1.2.5.1 - SVM Intro

SVM Intro

Geometric Intuition💡

⭐️We have two classes of points (e.g. Cats 😸vs. Dogs 🐶) that can be separated by a straight line.

👉 Many such lines exist !

💡SVM asks: “Which line is the safest?”

images/machine_learning/supervised/support_vector_machines/svm_intro/slide_01_01.tif
Highway 🛣️ Analogy

💡Think of the decision boundary as the center-line of a highway 🛣️.

SVM tries to make this highway 🛣️ as wide as possible without hitting any ‘buildings’ 🏡 (data points) on either side.

images/machine_learning/supervised/support_vector_machines/svm_intro/slide_05_01.tif
Support Vectors
The points that lie exactly on the edges of the highway are the Support Vectors.
Goal
🎯Maximize the width of the ‘street’ (the margin) to ensure the model generalizes well to unseen data.



End of Section

1.2.5.2 - Hard Margin SVM

Hard Margin SVM

Assumptions of Hard Margin SVM
  • Data is perfectly linearly separable, i.e, there must exist a hyperplane that can perfectly separate the data into two distinct classes without any misclassification.

  • No noise or outliers that fall within the margin or on the wrong side of the decision boundary. Note: Even a single outlier can prevent the algorithm from finding a valid solution or drastically affect the boundary’s position, leading to poor generalization.

    images/machine_learning/supervised/support_vector_machines/hard_margin_svm/slide_01_01.tif
Distance Between Margins
\[ \begin{aligned} \text{Decision Boundary: } \pi &= \mathbf{w^Tx} + w_0 = 0\\ \text{Upper Margin: }\pi^+ &= \mathbf{w^Tx} + w_0 = +1\\ \text{Lower Margin: }\pi^- &= \mathbf{w^Tx} + w_0 = -1\\ \end{aligned} \]
  • 🐎 Distance(signed) of a hyperplane from origin = \(\frac{-w_0}{\|w\|}\)
  • 🦣 Margin width = distance(\(\pi^+, \pi^-\))
  • = \(\frac{1-w_0 - (-1 -w_0)}{\|w\|}\) = \(\frac{1-\cancel{w_0} + 1 + \cancel{w_0})}{\|w\|}\)
  • distance(\(\pi^+, \pi^-\)) = \(\frac{2}{\|w\|}\)

Figure: Distance of Hyperplane from Origin

images/machine_learning/supervised/support_vector_machines/hard_margin_svm/slide_04_01.png

Read more about Hyperplane

Goal 🎯
  • Separating hyperplane \(\pi\) is exactly equidistant from \(\pi^+\) and \(\pi^-\).
  • We want to maximize the margin between +ve(🐶) and -ve (😸) points.
Constraint
\[w^Tx_i + w_0 \ge +1 ~ for ~ y_i = +1 \]

\[w^Tx_i + w_0 \le -1 ~ for ~ y_i = -1 \]

👉Combining above two constraints:

\[y_{i}.(w^{T}x_{i}+w_{0})≥1\]
Optimization ⚖️
\[\max_{w, w_0} \frac{2}{\|w\|}\]

such that, \(y_i.(w^Tx_i + w_0) \ge 1\)

Primal Problem

👉To maximize the margin, we must minimize \(\|w\|\).
Since, distance(\(\pi^+, \pi^-\)) = \(\frac{2}{\|w\|}\)

\[\min_{w, w_0} \frac{1}{2} {\|w\|^2}\]

such that, \(y_i.(w^Tx_i + w_0) \ge 1 ~ \forall i = 1,2,\dots, n\)

Note: Hard margin SVM will not work if the data has a single outlier or slight overlap.



End of Section

1.2.5.3 - Soft Margin SVM

Soft Margin SVM

Intuition💡

💡Imagine the margin is a fence 🌉.

  • Hard Margin: fence is made of steel.
    Nothing can cross it.

  • Soft Margin: fence is made of rubber(porous).
    Some points can ‘push’ into the margin or even cross over to the wrong side, but we charge them a penalty 💵 for doing so.

    images/machine_learning/supervised/support_vector_machines/soft_margin_svm/slide_03_01.tif
Issue

Distance from decision boundary:

  • Distance of positive labelled points must be \(\ge 1\)
  • But, distance of noise 📢 points (actually positive points) \(x_1, x_2 ~\&~ x_3\) < 1
Solution

⚔️ So, we introduce a slack variable or allowance for error term, \(\xi_i\) (pronounced ‘xi’) for every single data point.

\[y_i.(w^Tx_i + w_0) \ge 1 - \xi_i, ~ \forall i = 1,2,\dots, n\]

\[ \implies \xi_i \ge 1 - y_i.(w^Tx_i + w_0) \\ also, ~ \xi_i \ge 0 \]\[So, ~ \xi _{i}=\max (0,1-y_{i}\cdot (w^Tx_i + w_0))\]

Note: The above error term is also called ‘Hinge Loss’.

images/machine_learning/supervised/support_vector_machines/soft_margin_svm/slide_06_01.png

Hinge

images/machine_learning/supervised/support_vector_machines/soft_margin_svm/slide_06_02.png
Slack/Error Term Interpretation
\[\xi _{i}=\max (0,1-y_{i}\cdot f(x_{i}))\]
  • \(\xi_i = 0\) : Correctly classified and outside (or on) the margin.
  • \(0 < \xi_i \le 1 \) : Within the margin but on the correct side of the decision boundary.
  • \(\xi_i > 0\): On the wrong side of the decision boundary (misclassified).

e.g.: Since, the noise 📢 point are +ve (\(y_i=1\)) labeled:

\[\xi _{i}=\max (0,1-f(x_{i}))\]
  • \(x_1, (d=+0.5)\): \(\xi _{i}=\max (0,1-0.5) = 0.5\)

  • \(x_2, (d=-0.5)\): \(\xi _{i}=\max (0,1-(-0.5))= 1.5\)

  • \(x_3, (d=-1.5)\): \(\xi _{i}=\max (0,1-(-1.5)) = 2.5\)

    images/machine_learning/supervised/support_vector_machines/soft_margin_svm/slide_09_01.tif
Goal 🎯
\[\text{Maximize the width of margin: } \min_{w, w_0} \frac{1}{2} {\|w\|^2}\]

\[\text{Minimize violation or sum of slack/error terms: } \sum \xi_i\]
Optimization (Primal Formulation)
\[\min_{w, w_0, \xi} \underbrace{\frac{1}{2} \|w\|^2}_{\text{Regularization}} + \underbrace{C \sum_{i=1}^n \xi_i}_{\text{Error Penalty}}\]

Subject to constraints:

  1. \(y_i(w^T x_i + b) \geq 1 - \xi_i\): The ‘softened’ margin constraint.
  2. \(\xi_i \geq 0\): Slack/Error cannot be negative.

Note: We use a hyper-parameterC’ to control the trade-off.

Hyper-Parameter ‘C'
  • Large ‘C’: Over-Fitting;
    Misclassifications are expensive 💰.
    Model tries to keep the errors as low as possible.
  • Small ‘C’: Under-Fitting;
    Margin width is more important than individual errors.
    Model will ignore outliers/noise to get a ‘cleaner’(wider) boundary.
Hinge Loss View
\[ \text{Hinge loss: } \xi _{i}=\max (0,1-y_{i}\cdot (w^Tx_i + w_0))\]\[ \min_{w, b} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^n \text{HingeLoss}(y_i, f(x_i))\]

Note: SVM is just L2-Regularized Hinge Loss minimization, as Logistic Regression minimizes Log-Loss.

images/machine_learning/supervised/support_vector_machines/soft_margin_svm/slide_14_01.tif



End of Section

1.2.5.4 - Primal Dual Equivalence

Primal Dual Equivalence

Intuition💡

The Primal form is intuitive but computationally expensive in high dimensions.

➡️ Number of new features = \({d+p \choose p}\), grows roughly as \(O(d^{p})\)

  • d= number of dimensions (features)
  • p = degree of polynomial

Note: The Dual form is what enables the Kernel Trick 🪄.

Optimization (Primal Formulation)
\[\min_{w, w_0, \xi} \underbrace{\frac{1}{2} \|w\|^2}_{\text{Regularization}} + \underbrace{C \sum_{i=1}^n \xi_i}_{\text{Error Penalty}}\]

Subject to constraints:

  1. \(y_i(w^T x_i + b) \geq 1 - \xi_i\): The ‘softened’ margin constraint.
  2. \(\xi_i \geq 0\): Slack/Error cannot be negative.
Lagrangian

⭐️ We start with the Soft-Margin Primal objective and incorporate its constraints using Lagrange Multipliers \((\alpha_i, \mu_i \geq 0)\)

\[L(w, w_0, \xi, \alpha, \mu) = \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[y_i(w^T x_i + w_0) - 1 + \xi_i \right] - \sum_{i=1}^n \mu_i \xi_i\]

Note: Inequality conditions must be \(\le 0\).

Lagrangian Objective

👉The Lagrangian function has two competing objectives:

\[L(w, w_0, \xi, \alpha, \mu) = \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[y_i(w^T x_i + w_0) - 1 + \xi_i \right] - \sum_{i=1}^n \mu_i \xi_i\]
  • Minimization: We want to minimize \(L(w, w_0, \xi, \alpha, \mu)\) w.r.t primal variables (\(w, w_0, \xi_i \) ) to find the hyperplane with the largest possible margin.
  • Maximization: We want to maximize \(L(w, w_0, \xi, \alpha, \mu)\) w.r.t dual variables (\(\alpha_i, \mu_i\) ) to ensure all training constraints are satisfied.

Note: A point that is simultaneously a minimum for one set of variables and a maximum for another is, by definition, a saddle point.

Karush–Kuhn–Tucker (KKT) Conditions

👉To find the Dual, we find the saddle point by taking partial derivatives with respect to the primal variables \((w, w_0, \xi)\) and equating them to 0.

\[\frac{\partial L}{\partial w} = 0 \implies \mathbf{w = \sum_{i=1}^n \alpha_i y_i x_i}\]

\[\frac{\partial L}{\partial w_0} = 0 \implies \mathbf{\sum_{i=1}^n \alpha_i y_i = 0}\]

\[\frac{\partial L}{\partial \xi_i} = 0 \implies C - \alpha_i - \mu_i = 0 \implies \mathbf{0 \leq \alpha_i \leq C}\]
Primal Expansion
\[\frac{1}{2}\mathbf{w}^T\mathbf{w} + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[y_i(\mathbf{w}^T x_i + w_0) - 1 + \xi_i\right] - \sum_{i=1}^n \mu_i \xi_i\]

\[ \begin{aligned} = \frac{1}{2} \left(\sum_i \alpha_i y_i x_i \right)^T . \left(\sum_j \alpha_j y_j x_j \right) + C \sum_{i=1}^n \xi_i + \sum_{i=1}^n -\alpha_i y_i \left( \sum_{j=1}^n \alpha_j y_j x_j \right)^T x_i \\ -w_0 \sum_{i=1}^n \alpha_i y_i + \sum_{i=1}^n \alpha_i(1-\xi_i) + \sum_{i=1}^n \mu_i. (-\xi_i) \\ \end{aligned} \]\[ = \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j (x_i \cdot x_j) \underbrace{-w_0\sum_{i=1}^n \alpha_i y_i}_{=0, K.K.T} + \underbrace{\sum_{i=1}^n \xi_i (C -\alpha_i -\mu_i)}_{=0, K.K.T} \]\[ = \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j (x_i \cdot x_j) \]
(Wolfe) ‘Dual' Optimization
\[ \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[y_i(w^T x_i + w_0) - 1 + \xi_i\right] - \sum_{i=1}^n \mu_i \xi_i\]

\[ = \max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{(x_i \cdot x_j)}\]

subject to: \(0 \leq \alpha_i \leq C\) and \(\sum \alpha_i y_i = 0\)

  • \(\alpha_i\)= 0, for non support vectors (correct side)
  • \(0 < \alpha_i < C\), for free support vectors (exactly on the margin)
  • \(\alpha_i = C\), for bounded support vectors (misclassified or inside margin)

Note: Sequential Minimal Optimization (SMO) algorithm is used to find optimal \(\alpha_i\) values.

Inference Time ⏰

🎯To classify unseen point \(x_q\) : \(f(x_q) = \text{sign}(w^T x_q + w_0)\)

✅ From the KKT stationarity condition, we know: \(\mathbf{w} = \sum_{i=1}^n \alpha_i y_i x_i\)

👉 Substituting this into the equation:

\[f(x_q) = \text{sign}\left( \sum_{i=1}^n \alpha_i y_i (x_i^T x_q) + w_0 \right)\]

Note: Even if you have 1 million training points, if only 500 are support vectors, the summation only runs for 500 terms.
All other points have \(\alpha_i = 0\) and vanish.



End of Section

1.2.5.5 - Kernel Trick

Kernel Trick

Intuition 💡

👉If our data is not linearly separable in its original space , we can map it to a higher-dimensional feature space (where D»d) using a transformation function .

images/machine_learning/supervised/support_vector_machines/kernel_trick/slide_02_01.png
Kernel Trick 🪄
  • Bridge between Dual formulation and the geometry of high dimensional spaces.
  • It is a way to manipulate inner product spaces without the computational cost 💰 of explicit transformation.
(Wolfe) ‘Dual' Optimization
\[ \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[y_i(w^T x_i + w_0) - 1 + \xi_i\right] - \sum_{i=1}^n \mu_i \xi_i\]

\[= \max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{(x_i \cdot x_j)}\]

subject to: \(0 \leq \alpha_i \leq C\) and \(\sum \alpha_i y_i = 0\)

Observation

💡Actual values of the input vectors \(x_i\) and \(x_j\) never appear in isolation; only appear as inner product.

\[ \max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{(x_i \cdot x_j)}\]

\[f(x_q) = \text{sign}\left( \sum_{i=1}^n \alpha_i y_i (x_i^T x_q) + w_0 \right)\]

👉The ‘shape’ of the decision boundary is entirely determined by how similar points are to one another, not by their absolute coordinates.

Non-Linear Separation
If our data is not linearly separable in its original space \(\mathbb{R}^d\), we can map it to a higher-dimensional \(\mathbb{R}^D\) feature space (where D»d) using a transformation function \(\phi(x)\) .
Problem 🤔
If we choose a very high-dimensional mapping (e.g. \(D = 10^6\) or \(D = \infty\) ), calculating and then performing the dot product \(\phi(x_i)^T \phi(x_j)\) becomes computationally impossible or extremely slow.
Kernel Trick 👻

So we define a function called ‘Kernel Function’.

The ‘Kernel Trick’ 🪄 is an optimization that replaces the dot product of a high-dimensional mapping with a function of the dot product in the original space.

\[(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle\]

💡How it works ?

Instead of mapping (to higher dimension) \(x_i \rightarrow \phi(x_i)\), \(x_j \rightarrow \phi(x_j)\), and calculating the dot product.
We simply compute \(K(x_i, x_j)\) directly in the original input space.

👉The ‘Kernel Function’ gives the similar mathematical equivalent of mapping it to a higher dimensions and taking the dot product.

Note: For \(K(x_i, x_j)\) to be a valid kernel, it must satisfy Mercer’s Condition.

Polynomial (Quadratic) Kernel

Below is an example for a quadratic Kernel function in 2D that is equivalent to mapping the vectors to 3D and taking a dot product in 3D.

\[K(x, z) = (x^T z)^2\]

\[(x_1 z_1 + x_2 z_2)^2 = x_1^2 z_1^2 + 2x_1 z_1 x_2 z_2 + x_2^2 z_2^2\]

The output of above quadratic kernel function is equivalent to the explicit dot product of two vectors in 3D:

\[\phi(x) = [x_1^2, \sqrt{2}x_1x_2, x_2^2]^T\]

\[\phi(z) = [z_1^2, \sqrt{2}z_1z_2, z_2^2]^T\]\[\phi (x)\cdot \phi (z)=x_{1}^{2}z_{1}^{2}+2x_{1}x_{2}z_{1}z_{2}+x_{2}^{2}z_{2}^{2}\]
Advantages ⛳️
  • Computational Efficiency: Avoids the ‘combinatorial blowup’ 💥 of generating thousands of interaction features manually.
  • Memory Savings: No need to store 💾 or process high-dimensional coordinates, only the scalar result of the kernel function.
Why Kernel SVMs are Not so Popular ?
  • Designing special purpose domain specific kernel is very hard.
    • Basically, we are trying to replace feature engineering with kernel design.
    • Note: Deep learning does feature engineering implicitly for us.
  • Runtime complexity depends on number of support vectors, whose count is not easy to control.

Note: Runtime Time Complexity ⏰ = \(O(n_{SV}\times d)\) , whereas linear SVM,\(O(d)\) .



End of Section

1.2.5.6 - RBF Kernel

RBF Kernel

Intuition 💡
  • Unlike the polynomial kernel, which looks at global 🌎 interactions, the RBF kernel acts like a similarity measure.
  • If ‘x’ and ‘z’ are identical \(K(x,z)=1\).
    • As they move further apart in Euclidean space, the value decays exponentially towards 0.
Radial Basis Function (RBF) Kernel
\[K(x, z) = \exp\left(-\gamma. \|x - z\|^2\right)\]

\[\text{where, }\gamma = \frac{1}{2\sigma^2}\]
  • If \(x \approx z\) (very close), \(K(x,z)=1\)
  • If ‘x’, ‘z’ are far apart, \(K(x,z) \approx 0\)

Note: Kernel function is the measure of similarity or closeness.

Infinite Dimension Mapping

Say \(\sigma = 1\), then Euclidean distance: \(\|x - z\|^2 = \|x\|^2 + \|z\|^2 - 2x^Tz\)

\[K(x, z) = \exp(-( \|x\|^2 + \|z\|^2 - 2x^T z )) = \exp(-\|x\|^2) \exp(-\|z\|^2) \exp(2x^T z)\]

The Taylor expansion for \(e^u= \sum_{n=0}^{\infty} \frac{u^n}{n!}\)

\[\exp(2x^T z) = \sum_{n=0}^{\infty} \frac{(2x^T z)^n}{n!} = 1 + \frac{2x^T z}{1!} + \frac{(2x^T z)^2}{2!} + \dots + \frac{(2x^T z)^n}{n!} + \dots\]

\[K(x, z) = e^{-\|x\|^2} e^{-\|z\|^2} \left( \sum_{n=0}^{\infty} \frac{2^n (x^T z)^n}{n!} \right)\]

💡If we expand each \((x^T z)^n\) term, it represents the dot product of all possible n-th order polynomial features.

👉Thus, the implicit feature map is:

\[\phi(x) = e^{-\|x\|^2} \left[ 1, \sqrt{\frac{2}{1!}}x, \sqrt{\frac{2^2}{2!}}(x \otimes x), \dots, \sqrt{\frac{2^n}{n!}}(\underbrace{x \otimes \dots \otimes x}_{n \text{ times}}), \dots \right]^T\]
  • Important: The tensor product \(x\otimes x\) creates a vector (or matrix) containing all combinations of the features. e.g. if \(x=[x_{1},x_{2}]\), then \(x\otimes x=[x_{1}^{2},x_{1}x_{2},x_{2}x_{1},x_{2}^{2}]\) 

Note: Because the Taylor series has an infinite number of terms, feature map has an infinite number of dimensions.

Bias-Variance Trade-Off ⚔️
  • High Gamma(low \(\sigma\)): Over-Fitting
    • Makes the kernel so ‘peaky’ that each support vector only influences its immediate neighborhood.
    • Decision boundary becomes highly irregular, ‘wrapping’ tightly around individual data points to ensure they are classified correctly.
  • Low Gamma(high \(\sigma\)): Under-Fitting
    • The Gaussian bumps are wide and flat.
    • Decision boundary becomes very smooth, essentially behaving more like a linear or low-degree polynomial classifier.



End of Section

1.2.5.7 - Support Vector Regression

Support Vector Regression

Intuition 💡

👉Imagine a ‘tube’ of radius \(\epsilon\) surrounding the regression line.

  • Points inside the tube are considered ‘correct’ and incur zero penalty.
  • Points outside the tube are penalized based on their distance from the tube’s boundary.
Ignore Errors

👉SVR ignores errors as long as they are within a certain distance (\(\epsilon\)) from the true value.

🎯This makes SVR inherently robust to noise and outliers, as it does not try to fit every single point perfectly, only those that ‘matter’ to the structure of the data.

Note: Standard regression (like OLS) tries to minimize the squared error between the prediction and every data point.

Optimization (Primal Formulation)
\[\min_{w, w_0, \xi, \xi^*} \underbrace{\frac{1}{2} \|w\|^2}_{\text{Regularization}} + \underbrace{C \sum_{i=1}^n \xi_i, \xi_i^*}_{\text{Error Penalty}}\]

Subject to constraints:

  • \(y_i - (w^T x_i + w_0) \leq \epsilon + \xi_i\): (Upper boundary)
  • \((w^T x_i + w_0) - y_i \leq \epsilon + \xi_i^*\): (Lower boundary)
  • \(\xi_i, \xi_i^* \geq 0\): (Slack/Error cannot be negative)

Terms:

  • Epsilon(\(\epsilon\)): The width of the tube. Increasing ‘\(\epsilon\)’ results in fewer support vectors and a smoother (flatter) model.
  • Slack Variables (\(\xi_i, \xi_i^*\)): How far a point lies outside the upper and lower boundaries of the tube.
  • C: The trade-off between the flatness of the model and the extent to which deviations larger than \(\epsilon\) are tolerated.
Loss Function

SVR uses a specific loss function that is 0 when the error<’\(\epsilon\)’.

\[L_\epsilon(y, f(x)) = \max(0, |y - f(x)| - \epsilon)\]
  • The solution becomes sparse, because the loss is zero for points inside the tube.
  • Only the Support Vectors, i.e, points outside or on the boundary of the tube have non-zero Lagrange multipliers (\(\alpha_i\)).

Note: \(\epsilon=0.1\) default value in scikit-learn.

(Wolfe) ‘Dual' Optimization
\[\max_{\alpha, \alpha^*} \sum_{i=1}^n y_i (\alpha_i - \alpha_i^*) - \epsilon \sum_{i=1}^n (\alpha_i + \alpha_i^*) - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n (\alpha_i - \alpha_i^*) (\alpha_j - \alpha_j^*) \mathbf{(x_i^T x_j)}\]

Subject to:

  1. \(\sum_{i=1}^n (\alpha_i - \alpha_i^*) = 0\)
  2. \(0 \leq \alpha_i, \alpha_i^* \leq C\)
  • \(\alpha_i = \alpha_i^* = 0\): point is inside the tube.
  • \(|\alpha_i - \alpha_i^*| > 0\) : support vectors; points on or outside the tube.

Note: \(\alpha_i , \alpha_i^* \) cannot both be non-zero for the same point; a point cannot be simultaneously above and below the tube.

Inference & Kernel Trick
\[f(z) = \sum_{i \in SV} (\alpha_i - \alpha_i^*) \mathbf{K(x_i, z)} + w_0\]
  • 👉 For non-linear SVR we replace dot product \(x_i^T x_j\) with kernel function \(K(x_i, x_j)\).
  • ✅ Model needs to store only support vectors, i.e, points where \(|\alpha_i - \alpha_i^*| > 0\).
  • ⭐️\(\xi_i =0 \) for a point that lies exactly on the boundary, so we can use that to calculate the bias (\(w_0\)):
\[w_0 = y_i - \sum_{j \in SV} (\alpha_j - \alpha_j^*) K(x_j, x_i) - \epsilon\]

\[ \text{Since, } y_i - (w^T x_i + w_0) \leq \epsilon + \xi_i\]



End of Section

1.2.6 - Naive Bayes'

Naive Bayes'



End of Section

1.2.6.1 - Naive Bayes Intro

Naive Bayes Intro

Naive Bayes
📘Simple, fast, and highly effective probabilistic machine learning classifier based on Bayes’ theorem.
Use Case

📌Let’s understand Naive Bayes through an Email/Text classification example.

  • Number of words in an email is not fixed.
  • Remove all stop words, such as, the, is , are, if, etc.
  • Keep only relevant words, i.e, \(w_1, w_2, \dots , w_d\) words.
  • We want to do a binary classification - Spam/Not Spam.
Bayes' Theorem

Let’s revise Bayes’ theorem first:

\[P(S|W)=\frac{P(W|S)\times P(S)}{P(W)}\]
  • \(P(S|W)\) is the posterior probability: the probability of the email being spam, given the words inside it.
  • \(P(W|S)\) is the likelihood: how likely is this email’s word pattern if it were spam?
  • \(P(S)\) is the prior probability: The ‘base rate’ of spam.
    • If our dataset has 10,000 emails and 2,000 are spam, then \(P(S)\)=0.2.
  • \(P(W)\) is the prior probability of the predictor (evidence): total probability of seeing these words across all emails.
    • 👉Since this is the same for both classes, we treat it as a constant and ignore it during comparison.
Challenge 🤺

👉Likelihood = \(P(W|S)\) = \(P(w_1, w_2, \dots w_d | S)\)

➡️ For computing the joint distribution of say d=1000 words, we need to learn from possible \(2^{1000}\) combinations.

  • \(2^{1000}\) > the atoms in the observable 🔭 universe 🌌.
  • We will never have enough training data to see every possible combination of words even once.
    • Most combinations would have a count of zero.

🦉So, how do we solve this ?

Naive Assumption

💡The ‘Naive’ assumption is a ‘Conditional Independence’ assumption, i.e, we assume each word appears independently of the others, given the class Spam/Not Spam.

  • e.g. In a spam email, the likelihood of ‘Free’ and ‘Money’ 💵 appearing are treated as independent events, even though they usually appear together.

Note: The conditional independence assumption makes the probability calculations easier, i.e, the joint probability simply becomes the product of individual probabilities, conditional on the label.

\[P(W|S) = P(w_1|S)\times P(w_2|S)\times \dots P(w_d|S) \]\[\implies P(S|W)=\frac{P(w_1|S)\times P(w_2|S)\times \dots P(w_d|S)\times P(S)}{P(W)}\]

\[\implies P(S|W)=\frac{\prod_{i=1}^d P(w_i|S)\times P(S)}{P(W)}\]

\[\text{Similarly, } P(NS|W)=\frac{\prod_{i=1}^d P(w_i|NS)\times P(NS)}{P(W)}\]

We can generalize it for any number of class labels ‘y’:

\[\implies P(y|W) \propto \prod_{i=1}^d P(w_i|y)\times P(y) \quad \text{ where, y = class label} \]

\[ P(w_i|y) = \frac{count(w_i ~in~ y)}{\text{total words in class y}} \]

Note: We compute the probabilities for both Spam/Not Spam and assign the final label to email, depending upon which probability is higher.

Performance 🏇

👉Space Complexity: O(d*c)

👉Time Complexity:

  • Training: O(n*d*c)
  • Inference: O(d*c)

Where,

  • d = number of features/dimensions
  • c = number of classes
  • n = number of training examples



End of Section

1.2.6.2 - Naive Bayes Issues

Naive Bayes Issues

Naive Bayes

⭐️Simple, fast, and highly effective probabilistic machine learning classifier based on Bayes’ theorem.

\[P(y|W) \propto \prod_{i=1}^d P(w_i|y)\times P(y)\]

\[P(w_i|y) = \frac{count(w_i ~in~ y)}{\text{total words in class y}}\]
Problem # 1

🦀What if at runtime we encounter a word that was never seen during training ?

e.g. A word ‘crypto’ appears in the test email that was not present in training emails; P(‘crypto’|S) =0.

👉This will force the entire product to zero.

\[P(w_i|S) = \frac{\text{Total count of } w_i \text{ in all Spam emails}}{\text{Total count of all words in all Spam emails}}\]
Laplace Smoothing

💡Add ‘Laplace smoothing’ to all likelihoods, both during training and test time, so that the probability becomes non-zero.

\[P(x_{i}|y)=\frac{count(x_{i},y)+\alpha }{count(y)+\alpha \cdot |V|}\]
  • \(count(x_{i},y)\) : number of times word appears in documents of class ‘y'.
  • \(count(y)\): The total count of all words in documents of class ‘y'.
  • \(|V|\)(or \(N_{features}\)):Vocabulary size or total number of unique possible words.

Let’s understand this by the examples below:

\[P(w_{i}|S)=\frac{count(w_{i},S)+\alpha }{count(S)+\alpha \cdot |V|}\]
  1. \(count(w_{i},S) = 0 \), \(count(S) = 100\), \(|V|\)(or \(N_{features}) =2, \alpha = 1\) \[P(w_{i}|S)=\frac{ 0+1 }{100 +1 \cdot 2} = \frac{1}{102}\]
  2. \(count(w_{i},S) = 0 \), \(count(S) = 100\), \(|V|\)(or \(N_{features}) =2, \alpha = 10,000\) \[P(w_{i}|S)=\frac{ 0+10,000 }{100 +10,000 \cdot 2} = \frac{10,000}{20,100} \approx \frac{1}{2}\]

Note: High alpha value may lead to under-fitting; \(\alpha = 1\) recommended.

Problem # 2

🦀What happens if the number of words ‘d’ is very large ?

👉Multiplying 500 times will result in a number so small a computer 💻 cannot store it (underflow).

Note: Computers have a limit to store floating point numbers, e.g., 32 bit: \(1.175 x 10^{-38}\)

Logarithm

💡Take ‘Logarithm’ that will convert the product to sum.

\[P(y|W) \propto \prod_{i=1}^d P(w_i|y)\times P(y)\]

\[\log(P(y| W)) \propto \sum_{i=1}^d \log(P(w_i|y)) + \log(P(y))\]

Note: In the next section we will solve a problem covering all the concepts discussed in this section.



End of Section

1.2.6.3 - Naive Bayes Example

Naive Bayes Example

Naive Bayes

⭐️Simple, fast, and highly effective probabilistic machine learning classifier based on Bayes’ theorem.

\[\log(P(y| W)) \propto \sum_{i=1}^d \log(P(w_i|y)) + \log(P(y))\]

\[P(w_{i}|y)=\frac{count(w_{i},y)+\alpha }{count(y)+\alpha \cdot |V|}\]
Email Classification Problem

Let’s solve an email classification problem, below we have list of emails (tokenized) and labelled as Spam/Not Spam for training.

images/machine_learning/supervised/naive_bayes/naive_bayes_example/email_classification.png
Training Phase

🏛️Class Priors:

  • P(Spam) = 2/4 =0.5
  • P(Not Spam) = 2/4 = 0.5

📕 Vocabulary = { Free, Money, Inside, Scan, Win, Cash, Click, Link, Catch, Up Today, Noon, Project, Meeting }

  • |V| = Total unique word count = 14

🧮 Class count:

  • count(Spam) = 9
  • count(Not Spam) = 7

Laplace smoothing: \(\alpha = 1\)

Likelihood of 'free'
\[P(w_{i}|y)=\frac{count(w_{i},y)+\alpha }{count(y)+\alpha \cdot |V|}\]
  • P(‘free’| Spam) = \(\frac{2+1}{9+14} = \frac{3}{23} = 0.13\)
  • P(‘free’| Not Spam) = \(\frac{0+1}{7+14} = \frac{1}{21} = 0.047\)
Inference Time

👉Say a new email 📧 arrives - “Free money today”; lets classify it as Spam/Not Spam.

Spam:

  • P(‘free’| Spam) = \(\frac{2+1}{9+14} = \frac{3}{23} = 0.13\)
  • P(‘money’| Spam) = \(\frac{1+1}{9+14} = \frac{2}{23} = 0.087\)
  • P(‘today’| Spam) = \(\frac{0+1}{9+14} = \frac{1}{23} = 0.043\)

Not Spam:

  • P(‘free’| Not Spam) = \(\frac{0+1}{7+14} = \frac{1}{21} = 0.047\)
  • P(‘money’| Not Spam) = \(\frac{0+1}{7+14} = \frac{1}{21} = 0.047\)
  • P(‘today’| Not Spam) = \(\frac{1+1}{7+14} = \frac{2}{21} = 0.095\)
Final Score 🏏
\[\log(P(y| W)) \propto \sum_{i=1}^d \log(P(w_i|y)) + \log(P(y))\]
  • Score(Spam) = log(P(Spam)) + log(P(‘free’|S)) + log(P(‘money’|S)) + log(P(‘today’|S))
    = log(0.5) + log(0.13) + log(0.087) + log(0.043) = -0.301 -0.886 -1.06 -1.366 = -3.614
  • Score(Not Spam) = log(P(Not Spam)) + log(P(‘free’|NS)) + log(P(‘money’|NS)) + log(P(‘today’|NS))
    = log(0.5) + log(0.047) + log(0.047) + log(0.095) = -0.301 -1.328 -1.328 -1.022 = -3.979

👉Since, Score(Spam) (-3.614 )> Score(Not Spam) (-3.979) , the model chooses ‘Spam’ as the label for the email.



End of Section

1.3 - Unsupervised Learning

Unsupervised Machine Learning



End of Section

1.3.1 - K Means

K Means Clustering



End of Section

1.3.1.1 - K Means

K Means Clustering

Unsupervised Learning

🌍In real-world systems, labeled data is scarce and expensive 💰.

💡Unsupervised learning discovers inherent structure without human annotation.

👉Clustering answers: “Given a set of points, what natural groupings exist?”

Real-World 🌍 Motivations for Clustering
  • Customer Segmentation: Group users by behavior without predefined categories.
  • Image Compression: Reduce color palette by clustering pixel colors.
  • Anomaly Detection: Points far from any cluster are outliers.
  • Data Exploration: Understand structure before building supervised models.
  • Preprocessing: Create features from cluster assignments.
Key Insight 💡

💡Clustering assumes that ‘similar’ points should be grouped together.

👉But what is ‘similar’? This assumption drives everything.

Problem Statement ✍️

Given:

  • Dataset X = {x₁, x₂, …, xₙ} where xᵢ ∈ ℝᵈ.
  • Desired number of clusters ‘k'.

Find:

  • Cluster assignments C = {C₁, C₂, …, Cₖ}.

  • Such that points within clusters are ‘similar’.

  • And points across clusters are ‘dissimilar’.

    images/machine_learning/unsupervised/k_means/k_means_clustering/slide_07_01.png
Optimization Perspective

This is fundamentally an optimization problem, i.e, find parameters such that the value is minimum/maximum. We need:

  • An objective function
    • what makes a clustering ‘good’?
  • An algorithm to optimize it
    • how do we find good clusters?
  • Evaluation metrics
    • how do we measure quality?
Optimization

Objective function:
👉Minimize the within-cluster sum of squares (WCSS).

\[J(C, \mu) = \sum_{j=1}^k \sum_{x_i \in C_j} \underbrace{\|x_i -\mu_j\|^2}_{\text{distance from mean}} \]
  • Where:
  • C = {C₁, …, Cₖ} are cluster assignments.
  • μⱼ is the centroid (mean) of cluster Cₖ.
  • ||·||² is squared Euclidean distance.

Note: Every point belongs to one and only one cluster.

Variance Decomposition

💡Within-Cluster Sum of Squares (WCSS) is nothing but variance.

⭐️ Total Variance = Within-Cluster Variance + Between-Cluster Variance

👉K-Means minimizes within-Cluster variance, which implicitly maximizes between-cluster separation.

Geometric Interpretation:

  • Each point is ‘pulled’ toward its cluster center.
  • The objective measures total squared distance of all points to their centers.
  • Lower J(C, μ) means tighter, more compact clusters.

Note: K-Means works best when clusters are roughly spherical, similarly sized, and well-separated.

images/machine_learning/unsupervised/k_means/k_means_clustering/slide_11_01.png
Combinatorial Explosion 💥

⭐️The problem requires partitioning ’n’ observations into ‘k’ distinct, non-overlapping clusters, which is given by the Stirling number of the second kind, which grows at a rate roughly equal to \(k^n/k!\).

\[S(n,k)=\frac{1}{k!}\sum _{j=0}^{k}(-1)^{k-j}{k \choose j}j^{n}\]

\[S(100,2)=2^{100-1}-1=2^{99}-1\]

\[2^{99}\approx 6.338\times 10^{29}\]

👉This large number of possible combinations makes the problem NP-Hard.

🦉The k-means optimization problem is NP-hard because it belongs to a class of problems for which no efficient (polynomial-time ⏰) algorithm is known to exist.



End of Section

1.3.1.2 - Lloyds Algorithm

Lloyds Algorithm

Idea 💡
Since, we cannot enumerate all partitions (i.e, partitioning ’n’ observations into ‘k’ distinct, non-overlapping clusters), Lloyd’s algorithm provides a local search heuristic (approximate algorithm).
Lloyd's Algorithm ⚙️

Iterative method for partitioning ’n’ data points into ‘k’ groups by repeatedly assigning data points to the nearest centroid (mean) and then recalculating centroids until assignments stabilize, aiming to minimize within-cluster variance.

📥Input: X = {x₁, …, xₙ}, ‘k’ (number of clusters)

📤Output: ‘C’ (clusters), ‘μ’ (centroids)

👉Steps:

  1. Initialize: Randomly choose ‘k’ cluster centroids μ₁, …, μₖ.
  2. Repeat until convergence, i.e, until cluster assignments and centroids no longer change significantly.
  • a) Assignment: Assign each data point to the cluster whose centroid is closest (usually using Euclidean distance).
    • For each point xᵢ: cᵢ = argminⱼ ||xᵢ - μⱼ||²
  • b) Update: Recalculate the centroid (mean) of each cluster.
    • For each cluster j: μⱼ = (1/|Cⱼ|) Σₓᵢ∈Cⱼ xᵢ
Issues🚨
  • Initialization sensitive, different initialization may lead to different clusters.
  • Tries to make each cluster of same size that may not be the case in real world.
  • Tries to make each cluster with same density(variance)
  • Does not work well with non-globular(spherical) data.

👉See how 2 different runs of K-Means algorithm gives totally different clusters.

images/machine_learning/unsupervised/k_means/lloyds_algorithm/slide_06_01.png
images/machine_learning/unsupervised/k_means/lloyds_algorithm/slide_06_02.png

👉Also, K-Means does not work well with non-spherical clusters, or clusters with different densities and sizes.

images/machine_learning/unsupervised/k_means/lloyds_algorithm/slide_07_01.png
Solutions
✅ Do multiple runs 🏃‍♂️and choose the clustering with minimum error.
✅ Do not select initial points randomly, but some logic, such as, K-means++ algorithm.
✅ Use hierarchical clustering or density based clustering DBSCAN.



End of Section

1.3.1.3 - K Means++

K Means++ Algorithm

Issues with Random Initialization
  • If two initial centroids belong to the same natural cluster, the algorithm will likely split that natural cluster in half and be forced to merge two other distinct clusters elsewhere to compensate.
  • Inconsistent; different runs may lead to different clusters.
  • Slow convergence; Centroids may need to travel much farther across the feature space, requiring more iterations.

👉Example for different K-Means algorithm runs give different clusters

images/machine_learning/unsupervised/k_means/k_means_plus_plus/slide_02_01.png
images/machine_learning/unsupervised/k_means/k_means_plus_plus/slide_02_02.png
K-Means++ Algorithm

💡Addresses the issue due to random initialization by aiming to spread out the initial centroids across the data points.

Steps:

  1. Select the first centroid: Choose one data point randomly from the dataset to be the first centroid.
  2. Calculate distances: For every data point x not yet selected as a centroid, calculate the distance, D(x), between x and the nearest centroid chosen so far.
  3. Select the next centroid: Choose the next centroid from the remaining data points with a probability proportional to D(x)^2.
    This makes it more likely that a point far from existing centroids is selected, ensuring the initial centroids are well-dispersed.
  4. Repeat: Repeat steps 2 and 3 until ‘k’ number of centroids are selected.
  5. Run standard K-means: Once the initial centroids are chosen, the standard k-means algorithm proceeds with assigning data points to the nearest centroid and iteratively updating the centroids until convergence.
Problem 🚨
🦀 If our data is extremely noisy (outliers), the probabilistic logic (\(\propto D(x)^2\)) might accidentally pick an outlier as a cluster center.
Solution ✅
Do robust preprocessing to remove outliers or use K-Medoids algorithm.



End of Section

1.3.1.4 - K Medoid

K Medoid

Issues with K-Means
  • In K-Means, the centroid is the arithmetic mean of the cluster. The mean is very sensitive to outliers.
  • Not interpretable; centroid is the mean of cluster data points and may not be an actual data point, hence not representative.
Medoid

️Medoid is a specific data point from a dataset that acts as the ‘center’ or most representative member of its cluster.

👉It is defined as the object within a cluster whose average dissimilarity (distance) to all other members in that same cluster is the smallest.

K-Medoids (PAM) Algorithm

💡Selects actual data points from the dataset as cluster representatives, called medoids (most centrally located).

👉a.k.a Partitioning Around Medoids(PAM).

Steps:

  1. Initialization: Select ‘k’ data points from the dataset as the initial medoids using K-Means++ algorithm.
  2. Assignment: Calculate the distance (e.g., Euclidean or Manhattan) from each non-medoid point to all medoids and assign each point to the cluster of its nearest medoid.
  3. Update (Swap):
  • For each cluster, swap current medoid with a non-medoid point from the same cluster.
  • For each swap, calculate the total cost 💰(sum of distances from medoid).
  • Pick the medoid with minimum cost 💰.
  1. Repeat🔁: Repeat the assignment and update steps until (convergence), i.e, medoids no longer change or a maximum number of iterations is reached.

Note: Kind of brute-force algorithm, computationally expensive for large dataset.

Advantages
Robust to Outliers: Since medoids are actual data points rather than averages, extreme values (outliers) do not skew the center of the cluster as they do in K-Means.
Flexible Distance Metrics: It can work with any dissimilarity measure (Manhattan, Cosine similarity), making it ideal for categorical or non-Euclidean data.
Interpretable Results: Cluster centers are real observations (e.g., a ‘typical’ customer profile), which makes the output easier to explain to stakeholders.


End of Section

1.3.1.5 - Clustering Quality Metrics

Clustering Quality Metrics

How to Evaluate Quality of Clustering?
  • 👉 Elbow Method: Quickest to compute; good for initial EDA (Exploratory Data Analysis).
  • 👉 Dunn Index: Focuses on the ‘gap’ between the closest clusters.
  • 👉 Silhouette Score: Balances compactness and separation.
  • 👉 Domain specific knowledge and system constraints.
Elbow Method

️Heuristic used to determine the optimal number of clusters (k) for clustering by visualizing how the quality of clustering improves as ‘k’ increases.

🎯The goal is to find a value of ‘k’ where adding more clusters provides a diminishing return in terms of variance reduction.

images/machine_learning/unsupervised/k_means/clustering_quality_metrics/slide_02_01.png
Dunn Index [0, \(\infty\))

⭐️Clustering quality evaluation metric that measures: separation (between clusters) and compactness (within clusters)

Note: A higher Dunn Index value indicates better clustering, meaning clusters are well-separated from each other and compact.

👉Dunn Index Formula:

\[DI = \frac{\text{Minimum Inter-Cluster Distance(between different clusters)}}{\text{Maximum Intra-Cluster Distance(within a cluster)}}\]

\[DI = \frac{\min_{1 \le i < j \le k} \delta(C_i, C_j)}{\max_{1 \le l \le k} \Delta(C_l)}\]
images/machine_learning/unsupervised/k_means/clustering_quality_metrics/slide_06_01.png

👉Let’s understand the terms in the above formula:

  • \(\delta(C_i, C_j)\) (Inter-Cluster Distance):

    • Measures how ‘far apart’ the clusters are.
    • Distance between the two closest points of different clusters (Single-Linkage distance). \[\delta(C_i, C_j) = \min_{x \in C_i, y \in C_j} d(x, y)\]
  • \(\Delta(C_l)\) (Intra-Cluster Diameter):

    • Measures how ‘spread out’ a cluster is.
    • Distance between the two furthest points within the same cluster (Complete-Linkage distance). \[\Delta(C_l) = \max_{x, y \in C_l} d(x, y)\]
Measure of Closeness
  • Single Linkage (MIN): Uses the minimum distance between any two points in different clusters.
  • Complete Linkage (MAX): Uses the maximum distance between any two points in same cluster.



End of Section

1.3.1.6 - Silhouette Score

Silhouette Score

How to Evaluate Quality of Clustering?
  • Elbow Method: Quickest to compute; good for initial EDA.
  • Dunn Index: Focuses on the ‘gap’ between the closest clusters.
    ——- We have seen the above 2 methods in the previous section ———-
  • 👉 Silhouette Score: Balances compactness and separation.
  • 👉 Domain specific knowledge and system constraints.
Silhouette Score [-1, 1]

⭐️Clustering quality evaluation metric that measures how similar a data point is to its own cluster (cohesion) compared to other clusters (separation).

Note: Higher scores (closer to 1) indicate better-defined, distinct clusters, while scores near 0 suggest overlapping clusters, and negative scores mean points might be in the wrong cluster.

Silhouette Score Formula

Silhouette score for point ‘i’ is the difference between separation b(i) and cohesion a(i), normalized by the larger of the two.

\[ s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \]

Note: The Global Silhouette Score is simply the mean of s(i) for all points in the dataset.

👉Example for Silhouette Score:

images/machine_learning/unsupervised/k_means/silhouette_score/slide_04_01.png

👉Example for Silhouette Score of 0(Border Point) and negative(Wrong Cluster).

images/machine_learning/unsupervised/k_means/silhouette_score/slide_05_01.png

🦉Now let’s understand the terms in Silhouette Score in detail.

Cohesion a(i)

Average distance between point ‘i’ and all other points in the same cluster.

\[a(i) = \frac{1}{|C_A| - 1} \sum_{j \in C_A, i \neq j} d(i, j)\]

Note: Lower a(i) means the point is well-matched to its own cluster.

Separation b(i)

Average distance between point ‘i’ and all points in the nearest neighboring cluster (the cluster that ‘i’ is not a part of, but is closest to).

\[b(i) = \min_{C_B \neq C_A} \frac{1}{|C_B|} \sum_{j \in C_B} d(i, j)\]

Note: Higher b(i) means the point is very far from the next closest cluster.

Silhouette Plot

⭐️A silhouette plot is a graphical tool used to evaluate the quality of clustering algorithms (like K-Means), showing how well each data point fits within its cluster.

👉Each bar gives the average silhouette score of the points assigned to that cluster.

images/machine_learning/unsupervised/k_means/silhouette_score/slide_09_01.png
Geometric Interpretation
  • ⛳️ Like K-Means, the Silhouette Score (when using Euclidean distance) assumes convex clusters.

  • 🌘 If we use it on ‘Moon’ shaped clusters, it will give a low score even if the clusters are perfectly separated, because the ‘average distance’ to a neighbor might be small due to the curvature of the manifold.

    images/machine_learning/unsupervised/k_means/silhouette_score/slide_11_01.png
Silhouette Score Vs Dunn Index
images/machine_learning/unsupervised/k_means/silhouette_score/silhouette_score_vs_dunn_index.png

Choose Silhouette Score if:
✅ Need a human-interpretable metric to present to stakeholders.
✅ Dealing with real-world noise and overlapping ‘fuzzy’ boundaries.
✅ Want to see which specific clusters are weak (using the plot).

Choose Dunn Index if:
✅ Performing ‘Hard Clustering’ where separation is a safety or business requirement.
✅ Data is pre-cleaned of outliers (e.g., in a curated embedding space).
✅ Need to compare different clustering algorithms (e.g., K-Means vs. DBSCAN) on a high-integrity task.



End of Section

1.3.2 - Hierarchical Clustering

Hierarchical Clustering



End of Section

1.3.2.1 - Hierarchical Clustering

Hierarchical Clustering
Issues with K-Means
  • 🤷 We might not know in advance the number of distinct clusters ‘k’ in the dataset.
  • 🕸️ Also, sometimes the dataset may contain a nested structure or some inherent hierarchy, such as, file system, organizational chart, biological lineages, etc.
Hierarchical Clustering

⭐️ Method of cluster analysis that seeks to build a hierarchy of clusters, resulting in a tree like structure called dendrogram.

👉Hierarchical clustering allows us to explore different possibilities (of ‘k’) by cutting the dendrogram at various levels.

images/machine_learning/unsupervised/hierarchical_clustering/hierarchical_clustering/slide_03_01.png
2 Philosophies

Agglomerative (Bottom-Up):
Most common, also known as Agglomerative Nesting (AgNes).

  1. Every data point starts as its own cluster.
  2. In each step, merge the two ‘closest’ clusters.
  3. Repeat step 2, until all points are merged in a single cluster.

Divisive (Top-Down):

  1. All data points start in one large cluster.
  2. In each step, divide the cluster into two halves.
  3. Repeat step 2, until every point is its own cluster.

Agglomerative Clustering Example:

images/machine_learning/unsupervised/hierarchical_clustering/hierarchical_clustering/slide_05_01.png
Closeness of Clusters
  • Ward’s Method:
    • Merges clusters to minimize the increase in the total within-cluster variance (sum of squared errors), resulting in compact, equally sized clusters.
  • Single Linkage (MIN):
    • Uses the minimum distance between any two points in different clusters.
    • Prone to creating long, ‘chain-like’ 🔗 clusters, sensitive to outliers.
  • Complete Linkage (MAX):
    • Uses the maximum distance between any two points in different clusters.
    • Forms tighter, more spherical clusters, less sensitive to outliers than single linkage.
  • Average Linkage:
    • Combines clusters by the average distance between all points in two clusters, offering a compromise between single and complete linkage.
    • A good middle ground, often overcoming limitations of single and complete linkage.
  • Centroid Method:
    • Merges clusters based on the distance between their centroids (mean points).

👉Single Linkage is more sensitive to outlier than Complete Linkage, as Single Linkage can keep linking to the closest point forming a bridge to outlier.

images/machine_learning/unsupervised/hierarchical_clustering/hierarchical_clustering/slide_07_01.png

👉All cluster linkage distances.

images/machine_learning/unsupervised/hierarchical_clustering/hierarchical_clustering/slide_09_01.png

👉We get different clustering using different linkages.

images/machine_learning/unsupervised/hierarchical_clustering/hierarchical_clustering/slide_10_01.png



End of Section

1.3.3 - DBSCAN

Density Based Spatial Clustering of Application with Noise



End of Section

1.3.3.1 - DBSCAN

Density Based Spatial Clustering of Application with Noise
Issues with K-Means
  • Non-Convex Shapes: K-Means can not find ‘crescent’ or ‘ring’ shape clusters.
  • Noise: K-Means forces every point into a cluster, even outliers.
Main Question for Clustering ?

👉K-Means asks:

  • “Which center is closest to this point?”

👉DBSCAN asks:

  • “Is this point part of a dense neighborhood?”
Intuition 💡
Cluster is a contiguous region of high density in the data space, separated from other clusters by areas of low density.
DBSCAN

⭐️Groups closely packed data points into clusters based on their density, and marks points that lie alone in low-density regions as outliers or noise.

Note: Unlike K-means, DBSCAN can find arbitrarily shaped clusters and does not require the number of clusters to be specified beforehand.

Hyper-Parameters 🎛️
  1. Epsilon (eps or \(\epsilon\)):
  • Radius that defines the neighborhood around a data point.
  • If it’s too small, many points will be noise, and if too large, distinct clusters may merge.
  1. Minimum Points(minPts or min_samples):
  • Minimum number of data points required within a point’s -neighborhood for that point to be considered a dense region (a core point).
  • Defines threshold for ‘density’.
  • Rule of thumb: minPts dimensions + 1; use larger value for noisy data (minPts 2*dimensions).
Types of Points
  • Core Point:
    • If it has at least minPts (including itself) within its -neighborhood.
    • Forms the dense backbone of the clusters and can expand them.
  • Border Point:
    • If it has at fewer than minPts within its -neighborhood but falls within the -neighborhood of a core point.
    • Border points belong to a cluster but cannot expand it further.
  • Noise Point (Outlier):
    • If it is neither a core point nor a border point, i.e., it is not density-reachable from any core point.
    • Not assigned to any cluster.
DBSCAN Algorithm ⚙️
  1. Random Start:
  • Mark all points as unvisited; pick an arbitrary unvisited point ‘P’ from the dataset.
  1. Density Check:
  • Check the point P’s ϵ-neighborhood.
  • If ‘P’ has at least minPts, it is identified as a core point, and a new cluster is started.
  • If it has fewer than minPts, the point is temporarily labeled as noise (it might become a border point later).
  1. Cluster Expansion:
  • Recursively visit all points in P’s ϵ-neighborhood.
  • If they are also core points, their neighbors are added to the cluster.
  1. Iteration 🔁:
  • This ‘density-reachable’ logic continues until the cluster is fully expanded.
  • The algorithm then picks another unvisited point and repeats the process, discovering new clusters or marking more points as noise until all points are processed.

👉DBSCAN can correctly detect non-spherical clusters.

images/machine_learning/unsupervised/dbscan/dbscan/slide_11_01.png

👉DBSCAN Points and Epsilon-Neighborhood.

images/machine_learning/unsupervised/dbscan/dbscan/slide_12_01.png
When DBSCAN Fails ?
  • Varying Density Clusters:
    • Say A cluster is very dense and B is sparse, a single cannot satisfy both clusters.
  • High Dimensional Data:
    • ‘Curse of Dimensionality’ - In high-dimensional space, the distance between any two points converge.

Note: Sensitive to parameter eps and minPts; tricky to get it work.

👉DBSCAN Failure and Epsilon ((\epsilon)

images/machine_learning/unsupervised/dbscan/dbscan/slide_14_01.png
When to use DBSCAN ?
  • Arbitrary Cluster Shapes:

    • When clusters are intertwined, nested, or ‘moon-shaped’; where K-Means would fail by splitting them.
  • Presence of Noise and Outliers:

    • Robust to noise and outliers because it explicitly identifies low-density points as noise (labeled as -1) rather than forcing them into a cluster.
    images/machine_learning/unsupervised/dbscan/dbscan/snake_shape.png



End of Section

1.3.4 - Gaussian Mixture Model

Gaussian Mixture Model (GMM)



End of Section

1.3.4.1 - Gaussian Mixture Models

Introduction to Gaussian Mixture Models

Issue with K-Means
  • K-means uses Euclidean distance and assumes that clusters are spherical (isotropic) and have the same variance across all dimensions.
  • Places a circle or sphere around each centroid.
    • What if the clusters are elliptical ? 🤔

👉K-Means Fails with Elliptical Clusters.

images/machine_learning/unsupervised/gaussian_mixture_model/introduction_gaussian_mixture_models/slide_02_01.png
Gaussian Mixture Model (GMM)

💡GMM: ‘Probabilistic evolution’ of K-Means

⭐️ GMM provides soft assignments and can model elliptical clusters by accounting for variance and correlation between features.

Note: GMM assumes that all data points are generated from a mixture of a finite number of Gaussian Distributions with unknown parameters.

👉GMM can Model Elliptical Clusters.

images/machine_learning/unsupervised/gaussian_mixture_model/introduction_gaussian_mixture_models/slide_03_01.png
What is a Mixture Model

💡‘Combination of probability distributions’.

👉Soft Assignment: Instead of a simple ‘yes’ or ’no’ for cluster membership, a data point is assigned a set of probabilities, one for each cluster.

e.g: A data point might have a 60% probability of belonging to cluster ‘A’, 30% probability for cluster ‘B’, and 10% probability for cluster ‘C’.

👉Gaussian Mixture Model Example:

images/machine_learning/unsupervised/gaussian_mixture_model/introduction_gaussian_mixture_models/slide_07_01.png
Gaussian PDF
\[{\displaystyle {\mathcal {N}}({\boldsymbol {\mu }},\,{\boldsymbol {\sigma }})}: f(x\,|\,\mu ,\sigma ^{2})=\frac{1}{\sqrt{2\pi \sigma ^{2}}}\exp \left\{-\frac{(x-\mu )^{2}}{2\sigma ^{2}}\right\}\]

\[ \text{ Multivariate Gaussian, } {\displaystyle {\mathcal {N}}({\boldsymbol {\mu }},\,{\boldsymbol {\Sigma }})}: f(\mathbf{x}\,|\,\mathbf{\mu },\mathbf{\Sigma })=\frac{1}{\sqrt{(2\pi )^{n}|\mathbf{\Sigma }|}}\exp \left\{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu })^{T}\mathbf{\Sigma }^{-1}(\mathbf{x}-\mathbf{\mu })\right\}\]

Note: The term \(1/(\sqrt{2\pi \sigma ^{2}})\) is a normalization constant to ensure the total area under the curve = 1.

👉Multivariate Gaussian Example:

images/machine_learning/unsupervised/gaussian_mixture_model/introduction_gaussian_mixture_models/slide_08_01.png
Gaussian Mixture

Whenever we have multivariate Gaussian, then the variables may be independent or correlated.

👉Feature Covariance:

images/machine_learning/unsupervised/gaussian_mixture_model/introduction_gaussian_mixture_models/slide_10_01.png

👉Gaussian Mixture with PDF

images/machine_learning/unsupervised/gaussian_mixture_model/introduction_gaussian_mixture_models/slide_11_01.png

👉Gaussian Mixture (2D)

images/machine_learning/unsupervised/gaussian_mixture_model/introduction_gaussian_mixture_models/gmm_2d.png



End of Section

1.3.4.2 - Latent Variable Model

Latent Variable Model

Gaussian Mixture Model (GMM)

⭐️Probabilistic model that assumes data is generated from a mixture of several Gaussian (normal) distributions with unknown parameters.

🎯GMM represents the probability density function of the data as a weighted sum of ‘K’ component Gaussian densities.

👉Below plot shows the probability of a point being generated by 3 different Gaussians.

images/machine_learning/unsupervised/gaussian_mixture_model/latent_variable_model/slide_01_01.png
Gaussian Mixture PDF

Overall density \(p(x_i|\mathbf{\theta })\) for a data point ‘\(x_i\)’:

\[p(x_i|\mathbf{\mu},\mathbf{\Sigma} )=\sum _{k=1}^{K}\pi _{k}\mathcal{N}(x_i|\mathbf{\mu }_{k},\mathbf{\Sigma }_{k})\]
  • K: number of component Gaussians.
  • \(\pi_k\): mixing coefficient (weight) of the k-th component, such that, \(\pi_k \ge 0\) and \(\sum _{k=1}^{K}\pi _{k}=1\).
  • \(\mathcal{N}(x_i|\mathbf{\mu }_{k},\mathbf{\Sigma }_{k})\): probability density function of the k-th Gaussian component with mean \(\mu_k\) and covariance matrix \(\Sigma_k\).
  • \(\mathbf{\theta }=\{(\pi _{k},\mathbf{\mu }_{k},\mathbf{\Sigma }_{k})\}_{k=1}^{K}\): complete set of parameters to be estimated.

Note: \(\pi _{k}\approx \frac{\text{Number\ of\ points\ in\ cluster\ }k}{\text{Total\ number\ of\ points\ }(N)}\)

👉 Weight of the cluster is proportional to the number of points in the cluster.

images/machine_learning/unsupervised/gaussian_mixture_model/latent_variable_model/slide_04_01.png

👉Below image shows the weighted Gaussian PDF, given the weights of clusters.

images/machine_learning/unsupervised/gaussian_mixture_model/latent_variable_model/slide_05_01.png
GMM Optimization (Why MLE Fails?)

🎯 Goal of a GMM optimization is to find the set of parameters \(\Theta =\{(\pi _{k},\mu _{k},\Sigma _{k})\mid k=1,\dots ,K\}\) that maximize the likelihood of observing the given data.

\[L(\Theta |X)=\sum _{i=1}^{N}\log P(x_i|\Theta )=\sum _{i=1}^{N}\log \left(\sum _{k=1}^{K}\pi _{k}\mathcal{N}(x_i|\mu _{k},\Sigma _{k})\right)\]
  • 🦀 \(\log (A+B)\) cannot be simplified.
  • 🦉So, is there any other way ?
Latent Variable (Intuition 💡)

⭐️Imagine we are measuring the heights of people in a college.

  • We see a distribution with two peaks (bimodal).
  • We suspect there are two underlying groups:
    • Group A (Men) and Group B (Women).

Observation:

  • Observed Variable (X): Actual height measurements.
  • Latent Variable (Z): The ‘label’ (Man or Woman) for each person.

Note: We did not record gender, so it is ‘hidden’ or ‘latent’.

Latent Variable Model
A Latent Variable Model assumes that the observed data ‘X’ is generated by first picking a latent state ‘z’ and then drawing a sample from the distribution associated with that state.
GMM as Latent Variable Model

⭐️GMM is a latent variable model, meaning each data point \(\mathbf{x}_{i}\) is assumed to have an associated unobserved (latent) variable \(z_{i}\in \{1,\dots ,K\}\) indicating which component generated it.

Note: We observe the data point, but we do not observe which cluster it belongs to (\(z_i\)).

Latent Variable Purpose

👉If we knew the value of \(z_i\) (component indicator) for every point, estimating the parameters of each Gaussian component would be straightforward.

Note: The challenge lies in estimating both the parameters of the Gaussians and the values of the latent variables simultaneously.

Cluster Indicator (z) & Log Likelihood (sum)
  • With ‘z’ unknown:
    • maximize: \[ \log \sum _{k}\pi _{k}\mathcal{N}(x_{i}|\mu _{k},\Sigma _{k}) = \log \Big(\pi _{1}\mathcal{N}(x_{i}\mid \mu _{1},\Sigma _{1})+\pi _{2}\mathcal{N}(x_{i}\mid \mu _{2},\Sigma _{2})+ \dots + \pi _{k}\mathcal{N}(x_{i}\mid \mu _{k},\Sigma _{k})\Big)\]
      • \(\log (A+B)\) cannot be simplified.
  • With ‘z’ known:
    • The log-likelihood of the ‘complete data’ simplifies into a sum of logarithms: \[\sum _{i}\log (\pi _{z_{i}}\mathcal{N}(x_{i}|\mu _{z_{i}},\Sigma _{z_{i}}))\]
      • Every point is assigned to exactly one cluster, so the sum disappears because there is only one cluster responsible for that point.

Note: This allows the logarithm to act directly on the exponential term of the Gaussian, leading to simple linear equations.

Hard Assignment Simplifies Estimation

👉When ‘z’ is known, every data point is ‘labeled’ with its parent component.
To estimate the parameters (mean \(\mu_k\) and covariance \(\Sigma_k\)) for a specific component ‘k’ :

  • Gather all data points \(x_i\), where \(z_i\)= k.
  • Calculate the standard Maximum Likelihood Estimate.(MLE) for that single Gaussian using only those points.
Closed-Form Solution

⭐️ Knowing ‘z’ provides exact counts and component assignments, leading to direct formulae for the parameters:

  • Mean (\(\mu_k\)): Arithmetic average of all points assigned to component ‘k’.
  • Covariance (\(\Sigma_k\)): Sample covariance of all points assigned to component ‘k’.
  • Mixing Weight (\(\pi_k\)): Fraction of total points assigned to component ‘k’.



End of Section

1.3.4.3 - Expectation Maximization

Expectation Maximization

GMM as Latent Variable Model
⭐️GMM is a latent variable model, where the variable \(z_i\) is a latent (hidden) variable that indicates which specific Gaussian component or cluster generated a particular data point.
Chicken 🐓 & Egg 🥚 Problem
  • If we knew the parameters (\(\mu, \Sigma, \pi\)) we could easily calculate which cluster ‘z’ each point belongs to (using probability).
  • If we knew the cluster assignments ‘z’ of each point, we could easily calculate the parameters for each cluster (using simple averages).

🦉But we do not know either of them, as the parameters of the Gaussians - we aim to find and cluster indicator latent variable is hidden.

images/machine_learning/unsupervised/gaussian_mixture_model/expectation_maximization/slide_03_01.png
images/machine_learning/unsupervised/gaussian_mixture_model/expectation_maximization/slide_04_01.png
Break the Loop 🔁
⛓️‍💥Guess one, i.e, cluster assignment ‘z’ to find the other, i.e, parameters \(\mu, \Sigma, \pi\).
Goal 🎯

⛳️ Find latent cluster indicator variable \(z_{ik}\).

But \(z_{ik}\) is a ‘hard’ assignment’ (either ‘0’ or ‘1’).

  • 🦆 Because we do not observe ‘z’, we use another variable ‘Responsibility’ (\(\gamma_{ik}\)) as a ‘soft’ assignment (value between 0 and 1).
  • 🐣 \(\gamma_{ik}\) is the expected value of the latent variable \(z_{ik}\), given the observed data \(x_{i}\) and parameters \(\Theta\). \[\gamma _{ik}=E[z_{ik}\mid x_{i},\theta ]=P(z_{ik}=1\mid x_{i},\theta )\]

Note: \(\gamma_{ik}\) is the posterior probability (or ‘responsibility’) that cluster ‘k’ takes for explaining data point \(x_{i}\).

Indicator Variable ➡ Responsibility
\[\gamma _{ik}=E[z_{ik}\mid x_{i},\theta ]=P(z_{ik}=1\mid x_{i},\theta )\]

⭐️Using Bayes’ Theorem, we derive responsibility (posterior probability that component ‘k’ generated data point \(x_i\)) by combining the prior/weights (\(\pi_k\)) and the likelihood (\(\mathcal{N}(x_{i}\mid \mu _{k},\Sigma _{k})\)).

\[\gamma _{ik}= P(z_{ik}=1\mid x_{i},\theta ) = \frac{P(z_{ik}=1)P(x_{i}\mid z_{ik}=1)}{P(x_{i})}\]

\[\gamma _{ik}=\frac{\pi _{k}\mathcal{N}(x_{i}\mid \mu _{k},\Sigma _{k})}{\sum _{j=1}^{K}\pi _{j}\mathcal{N}(x_{i}\mid \mu _{j},\Sigma _{j})}\]

👉Bayes’ Theorem: \(P(A|B)=\frac{P(B|A)\cdot P(A)}{P(B)}\)

👉 GMM Densities at point

images/machine_learning/unsupervised/gaussian_mixture_model/expectation_maximization/slide_08_01.png

👉GMM Densities at point (different cluster weights)

images/machine_learning/unsupervised/gaussian_mixture_model/expectation_maximization/slide_09_01.png
Expectation Maximization Algorithm ⚙️
  1. Initialization: Assign initial values to parameters (\(\mu, \Sigma, \pi\)), often using K-Means results.
  2. Expectation Step (E): Calculate responsibilities; provides ‘soft’ assignments of points to clusters.
  3. Maximization Step (M): Update parameters using responsibilities as weights to maximize the expected log-likelihood.
  4. Convergence: Repeat ‘E’ and ‘M’ steps until the change in log-likelihood falls below a threshold.
Expectation Step

👉Given our current guess of the clusters, what is the probability that point \(x_i\) came from cluster ‘k’ ?

\[\gamma (z_{ik})=p(z_{i}=k|\mathbf{x}_{i},\mathbf{\theta }^{(\text{old})})=\frac{\pi _{k}^{(\text{old})}\mathcal{N}(\mathbf{x}_{i}|\mathbf{\mu }_{k}^{(\text{old})},\mathbf{\Sigma }_{k}^{(\text{old})})}{\sum _{j=1}^{K}\pi _{j}^{(\text{old})}\mathcal{N}(\mathbf{x}_{i}|\mathbf{\mu }_{j}^{(\text{old})},\mathbf{\Sigma }_{j}^{(\text{old})})}\]

\(\pi_k\) : Probability that a randomly selected data point \(x_i\) belongs to the k-th component before we even look at the specific value of \(x_i\), such that \(\pi_k \ge 0\) and \(\sum _{k=1}^{K}\pi _{k}=1\).

Maximization Step

👉Update the parameters (\(\mu, \Sigma, \pi\)) by calculating weighted versions of the standard MLE formulas using responsibilities as weight 🏋️‍♀️.

\[ \begin{align*} &\bullet \mathbf{\mu }_{k}^{(\text{new})}=\frac{1}{N_{k}}\sum _{i=1}^{N}\gamma (z_{ik})\mathbf{x}_{i} \\ &\bullet \mathbf{\Sigma }_{k}^{(\text{new})}=\frac{1}{N_{k}}\sum _{i=1}^{N}\gamma (z_{ik})(\mathbf{x}_{i}-\mathbf{\mu }_{k}^{(\text{new})})(\mathbf{x}_{i}-\mathbf{\mu }_{k}^{(\text{new})})^{\top } \\ &\bullet \pi _{k}^{(\text{new})}=\frac{N_{k}}{N} \end{align*} \]
  • where, \(N_{k}=\sum _{i=1}^{N}\gamma (z_{ik})\) is the effective number of points assigned to component ‘k'.



End of Section

1.3.5 - Anomaly Detection

Anomaly/Outlier/Novelty Detection



End of Section

1.3.5.1 - Anomaly Detection

Anomaly Detection Introduction

What is Anomaly?

🦄 Anomaly is a rare item, event or observation which deviates significantly from the majority of the data and does not conform to a well-defined notion of normal behavior.

Note: Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.

Anomaly Detection
🐙 Anomaly detection (Outlier detection or Novelty detection) is the identification of unusual patterns or anomalies or outliers in a given dataset.
What to do with Outliers ?

Remove Outliers:

  • Rejection or omission of outliers from the data to aid statistical analysis, for example to compute the mean or standard deviation of the dataset.
  • Remove outliers for better predictions from models, such as linear regression.

🔦Focus on Outliers:

  • Fraud detection in banking and financial services.
  • Cyber-security: intrusion detection, malware, or unusual user access patterns.
Anomaly Detection Methods 🐉
  • Supervised
  • Semi-Supervised
  • Unsupervised (most common) ✅

Note: Labeled anomaly data is often unavailable in real-world scenarios.

Known Methods 🐈
  • Statistical Methods: Z-Score, large value means outlier, IQR, point beyond fences (Q1 - 1.5*IQR or Q3 + 1.5*IQR) is flagged as an outlier.
  • Distance Based: KNN, points far from their neighbors as potential anomalies.
  • Density Based: DBSCAN, points in low density regions are considered outliers.
  • Clustering Based: K-Means, points far from cluster centroids that do not fit any cluster are anomalies.
Unsupervised Methods 🦅
  • Elliptic Envelope (MCD - Minimum Covariance Determinant)
  • One-Class SVM (OC-SVM)
  • Local Outlier Factor (LOF)
  • Isolation Forest (iForest)
  • RANSAC (Random Sample Consensus)



End of Section

1.3.5.2 - Elliptic Envelope

Elliptic Envelope

Use Case 🐝

Detect anomalies in multivariate Gaussian data, such as, biometric data (height/weight) where features are normally distributed and correlated.

Assumption: The data can be modeled by a Gaussian distribution.

Intuition 💡

In a normal distribution, most data points cluster around the mean, and the probability density decreases as we move farther away from the center.

images/machine_learning/unsupervised/anomaly_detection/elliptic_envelope/slide_03_01.png
Issue with Euclidean Distance 🐲

🌍 Euclidean distance measures the simple straight-line distance from the center of the cloud.

👉If the data is spherical, this works fine.

🦕 However, real-world data is often stretched or skewed (e.g., taller people are generally heavier), due to correlations between variables, forming an elliptical shape.

images/machine_learning/unsupervised/anomaly_detection/elliptic_envelope/slide_05_01.png
Mahalanobis Distance (Solution)

⭐️Mahalanobis distance essentially re-scales the data so that the elliptical distribution appears spherical, and then measures the Euclidean distance in that transformed space.

👉This way, it measures how many standard deviations(\(\sigma\)) away a point is from the mean, considering the data’s spread and correlation (covariance).

\[D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}\]
Problem 🦀
Standard methods (like Z-score \(z = \frac{x-\mu}{\sigma}\)) fail because they are easily skewed by the outliers they are trying to find.
Solution 🦉
💡Instead of using all data, we find a ‘clean’ subset of the data that is most tightly packed and use only that subset to define the ‘normal’ ellipse.
Goal 🎯

👉Find the most dense core of the data.

\[D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}\]

🦣 Determinant of covariance matrix \(\Sigma\) represents the volume of the ellipsoid.

⏲️ Therefore, minimize \(|\Sigma|\) to find the tight core.

👉 \(\text {Small} ~ \Sigma \rightarrow \text {Large} ~\Sigma ^{-1}\rightarrow \text {Large} ~ D_{M} ~\text {for outliers}\)

Minimum Covariance Determinant (MCD) Algorithm ⚙️

MCD algorithm is used to find the covariance matrix \(\Sigma\) with minimum determinant, so that the volume of the ellipsoid is minimized.

  1. Initialization: Select several random subsets of size h < n (default h = \(\frac{n+d+1}{2}\), d = # dimensions), representing ‘robustmajority of the data.
  2. Calculate preliminary mean (\(\mu\)) and covariance (\(\Sigma\)) for each random subset.
  3. Concentration Step: Iterative core of the algorithm designed to ‘tighten’ the ellipsoid.
    • Calculate Distances: Compute the Mahalanobis distance of all ’n’ points in the dataset from the current subset’s mean (\(\mu\)) and covariance (\(\Sigma\)).
    • Select New Subset: Identify the ‘h’ points with the smallest Mahalanobis distances.
      • These are the points most centrally located relative to the current ellipsoid.
    • Update Estimates: Calculate a new and based only on these ‘h’ most central points.
    • Repeat 🔁: The steps repeat until the determinant stops shrinking.

Note: Select the best subset that achieved the absolute minimum determinant.

Limitations
  • Assumptions
    • Gaussian data.
    • Unimodal data (single center).
  • Cost 💰of covariance matrix \(\Sigma^{-1}\) inversion is O(d^3).



End of Section

1.3.5.3 - One Class SVM

One Class SVM

Use Case (Novelty Detection)🐝

⭐️Only one class of data (normal, non-outlier) is available for training, making standard supervised learning models impossible.

e.g. Only normal observations are available for fraud detection, cyber attack, fault detection etc.

Intuition
images/machine_learning/unsupervised/anomaly_detection/one_class_svm/slide_02_01.png
Problem 🦀

🦂 The core problem is to build a model that can distinguish between ‘normal’ and ‘anomalous’ data when we only have examples of the ‘normal’ class during training.

🦖 We need to find a decision boundary that is as compact as possible while still encompassing the bulk of the training data.

Solution 🦉
💡Instead of finding a hyperplane that separates two different classes, we find a hyperplane that best separates the normal data points from the origin (0,0) in the feature space 🚀.
Goal 🎯

🦍 Define a boundary for a single class in high-dimensional space where data might be non-linearly distributed (e.g.‘U’ shape).

🦧 Use the Kernel Trick to project data into a higher-dimensional space and find a hyperplane that separates the data from the origin with the maximum margin.

One Class SVM

⭐️OC-SVM, as introduced by Bernhard Schölkopf et al., uses a hyperplane ‘H’ defined by a weight vector \(\mathbf{w}\) and a bias term \(\rho\).

👉Solve the following optimization problem:

\[\min _{\mathbf{w},\xi _{i},\rho }\frac{1}{2}||\mathbf{w}||^{2}+\frac{1}{\nu N}\sum _{i=1}^{N}\xi _{i}-\rho \]

Subject to constraints:

\[\mathbf{w}\cdot \phi (\mathbf{x}_{i})\ge \rho -\xi _{i}\quad \text{and}\quad \xi _{i}\ge 0,\quad \text{for\ }i=1,\dots ,N\]
Explanation of Terms
  • \(\mathbf{x}_{i}\): i-th training data point.
  • \(\phi (\mathbf{x}_{i})\): RBF kernel function \(K(x, y) = \exp(-\gamma \|x-y\|^2)\) that maps the data into a higher-dimensional feature space, making it easier to separate from the origin.
  • \(\mathbf{w}\): normal vector to the separating hyperplane.
  • \(\rho\): scalar bias term that determines the offset of the hyperplane from the origin.
  • \(\xi_i\): Slack variables that allow some data points to fall on the ‘wrong’ side of the hyperplane (inside the anomalous region) to prevent overfitting.
  • N: total number of training points.
  • \(\nu\): hyper-parameter between 0 and 1. It acts as an upper bound on the fraction of outliers (training data points outside the boundary) and a lower bound on the fraction of support vectors.
Working 🦇
  • \(\frac{1}{2}\|\mathbf{w}\|^{2}\): aims to maximize the margin/compactness of the region.
  • \(\frac{1}{\nu N}\sum _{i=1}^{N}\xi _{i}-\rho\): penalizes points (outliers) that violate the boundary constraints.

After solving the optimization problem using standard quadratic programming techniques, we obtain the optimal \(\mathbf{w}^{*}\) and \(\rho ^{*}\).

For a new data point \(x_{new}\), decision function is:

\[f(\mathbf{x}_{\text{new}})=\text{sign}(\mathbf{w}^{*}\cdot \phi (\mathbf{x}_{\text{new}})-\rho ^{*})\]
  • \(f(\mathbf{x}_{\text{new}})\ge 0\): normal point.

  • \(f(\mathbf{x}_{\text{new}})< 0\): anomalous point (outlier).

    images/machine_learning/unsupervised/anomaly_detection/one_class_svm/oc_svm_plot.png



End of Section

1.3.5.4 - Local Outlier Factor

Local Outlier Factor

Use Case 🐝
⭐️Geographic fraud detection:
A $100 transaction might be ‘normal’ in New York but an ‘outlier’ in a small rural village.
Intuition 💡

‘Local context matters.’

Global distance metrics fail when density is non-uniform.

🦄 An outlier is a point that is ‘unusualrelative to its immediate neighbors, regardless of how far it is from the center of the entire dataset.

Problem 🦀

💡Traditional distance-based outlier detection methods, such as, KNN, often struggle with datasets where data is clustered at varying densities.

  • A point in a sparse region might be considered an outlier by a global method, even if it is a normal part of that sparse cluster.
  • Conversely, a point very close to a dense cluster might be an outlier relative to that specific neighborhood.
Solution 🦉

👉Calculate the relative density of a point compared to its immediate neighborhood.

e.g. If the neighbors are in a dense crowd and the point is not, it is an outlier.

Goal 🎯
📌Compare the density of a point to the density of its neighbors.
Local Outlier Factor (LOF)

Local Outlier Factor (LOF) is a density-based algorithm designed to detect anomalies by measuring the local deviation of a data point relative to its neighbors.

👉Size of the red circle represents the LOF score.

images/machine_learning/unsupervised/anomaly_detection/local_outlier_factor/slide_08_01.png
LOF Score Calculation 🔢
  1. K-Distance (\(k\text{-dist}(p)\)):
    • The distance from point ‘p’ to its k-th nearest neighbor.
  2. Reachability Distance (\(\text{reach-dist}_{k}(p,o)\)): \[\text{reach-dist}_{k}(p,o)=\max \{k\text{-dist}(o),\text{dist}(p,o)\}\]
    • \(\text{dist}(p,o)\): actual Euclidean distance between ‘p’ and ‘o’.
    • This acts as ‘smoothing’ factor.
    • If point ‘p’ is very close to ‘o’ (inside o’s k-neighborhood), round up distance to \(k\text{-dist}(o)\).
  3. Local Reachability Density (\(\text{lrd}_{k}(p)\)):
    • The inverse of the average reachability distance from ‘p’ to its k-neighbors (\(N_{k}(p)\)). \[\text{lrd}_{k}(p)=\left[\frac{1}{|N_{k}(p)|}\sum _{o\in N_{k}(p)}\text{reach-dist}_{k}(p,o)\right]^{-1}\]
      • High LRD: Neighbors are very close; the point is in a dense region.
      • Low LRD: Neighbors are far away; the point is in a sparse region.
  4. Local Outlier Factor (\(\text{LOF}_{k}(p)\)):
    • The ratio of the average ‘lrd’ of p’s neighbors to p’s own ‘lrd’. \[\text{LOF}_{k}(p)=\frac{1}{|N_{k}(p)|}\sum _{o\in N_{k}(p)}\frac{\text{lrd}_{k}(o)}{\text{lrd}_{k}(p)}\]
images/machine_learning/unsupervised/anomaly_detection/local_outlier_factor/slide_10_01.png
LOF Score 🔢 Interpretation
  • LOF ≈ 1: Point ‘p’ has similar density to its neighbors (inlier).
  • LOF > 1: Point p’s density is much lower than its neighbors’ density (outlier).



End of Section

1.3.5.5 - Isolation Forest

Isolation Forest

Use Case 🐝

‘Large scale tabular data.’

Credit card fraud detection in datasets with millions of rows and hundreds of features.

Note: Supervised learning requires balanced, labeled datasets (normal vs. anomaly), which are rarely available in real-world scenarios like fraud or cyber-attacks.

Intuition 💡

🦥 ‘Flip the logic.’

🦄 ‘Anomalies’ are few and different, so they are much easier to isolate from the rest of the data than normal points.

Problem 🦀

🐦‍🔥 ‘Curse of dimensionality.’

🐎 Distance based (K-NN), and density based (LOF) algorithms require calculation of distance between all pair of points.

🐙 As the number of dimensions and data points grows, these calculations become exponentially more expensive 💰 and less effective.

Solution 🦉
Use a tree-based 🌲 approach with better time ⏰ complexity O(nlogn), making it highly scalable for massive datasets and robust in high-dimensional spaces without needing expensive distance metrics.
Goal 🎯

‘Randomly partition the data.’

🦄 If a point is an outlier, it will take fewer partitions (splits) to isolate it into a leaf 🍃 node compared to a point that is buried deep within a dense cluster of normal data.

Isolation Forest (iForest) 🌲🌳

🌳Isolation Forest uses an ensemble of ‘Isolation Trees’ (iTrees) 🌲.

👉iTree (Isolation Tree) 🌲 is a proper binary tree structure specifically designed to separate individual data points through random recursive partitioning.

Algorithm ⚙️
  1. Sub-sampling:
    • Select a random subset of data (typically 256 instances) to build an iTree.
  2. Tree Construction: Randomly select a feature.
    • Randomly select a split value between the minimum and maximum values of that feature.
    • Divide the data into two branches based on this split.
    • Repeat recursively until the point is isolated or a height limit is reached.
  3. Forest Creation:
    • Repeat 🔁 the process to create ‘N’ trees (typically 100).
  4. Inference:
    • Pass a new data point through all trees, calculate the average path length, and compute the anomaly score.
Scoring Function 📟

⭐️🦄 Assign an anomaly score based on the path length h(x) required to isolate a point ‘x’.

  • Path Length (h(x)): The number of edges ‘x’ traverses from the root node to a leaf node.
  • Average Path Length (c(n)): Since iTrees are structurally similar to Binary Search Trees (BST), the average path length for a dataset of size ’n’ is given by: \[c(n)=2H(n-1)-\frac{2(n-1)}{n}\]

where, H(i) is the harmonic number, estimated as \(\ln (i)+0.5772156649\) (Euler’s constant).

🦄 Anomaly Score

To normalize the score between 0 and 1, we define it as:

\[s(x,n)=2^{-\frac{E(h(x))}{c(n)}}\]

👉E(H(x)): is the average path length of across a forest of trees 🌲.

  • \(s\rightarrow 1\): Point is an anomaly; Path length is very short.
  • \(s\approx 0.5\): Point is normal, path length approximately equal to c(n).
  • \(s\rightarrow 0\): Point is normal; deeply buried point, path length is much larger than c(n).
images/machine_learning/unsupervised/anomaly_detection/isolation_forest/slide_12_01.tif
Drawbacks
  • Axis-Parallel Splits:
    • Standard iTrees 🌲 split only on one feature at a time, so:
      • We do not get a smooth decision boundary.
      • Anything off-axis has a higher probability of being marked as an outlier.
      • Note: Extended Isolation Forest fixes this by using random slopes.
  • Score Sensitivity: The threshold for what constitutes an ‘anomaly’ often requires manual tuning or domain knowledge.



End of Section

1.3.5.6 - RANSAC

RANSAC

Use Case 🐝
⭐️Estimate the parameters of a model from a set of observed data that contains a significant number of outliers.
Intuition 💡

👉Ordinary Least Squares use all data points to find a fit.

  • However, a single outlier can ‘pull’ the resulting line significantly, leading to a poor representative model.

💡If we pick a very small random subset of points, there is a higher probability that this small subset contains only good data (inliers) compared to a large set.

Problem 🦀
  • Ordinary Least Squares (OLS) minimizes the Sum of Errors.

    • A huge outlier has an exponentially large impact on the final line.
    images/machine_learning/unsupervised/anomaly_detection/ransac/slide_03_01.png
Solution 🦉

💡Instead of using all points, iteratively pick the smallest possible random subset to fit a model, then check (votes) how many other points in the dataset ‘agree’ with that model.

This gives the name to our algorithm:

  • Random: Random subsets.
  • Sampling: Small subsets.
  • Consensus: Agreement with other points.
RANSAC Algorithm ⚙️
  1. Random Sampling:
    • Randomly select a Minimal Sample Set (MSS) of ’n’ points from the input data ‘D’.
    • e.g. n=2 for a line, or n=3 for a plane in 3D.
  2. Model Fitting:
    • Compute the model parameters using only these ’n’ points.
  3. Test:
    • For all other points in ‘D’, calculate the error relative to the model.
    • 👉 Points with error < \(\tau\)(threshold) are added to the ‘Consensus Set’.
  4. Evaluate:
    • If the consensus set is larger than the previous best, save this model and set.
  5. Repeat 🔁:
    • Iterate ‘k’ times.
  6. Refine (Optional):
    • Once the best model is found, re-estimate it using all points in the final consensus set (usually via Least Squares) for a more precise fit.

👉Example:

images/machine_learning/unsupervised/anomaly_detection/ransac/slide_09_01.png
How Many Iterations ‘k’ ?

👉To ensure the algorithm finds a ‘clean’ sample set (no outliers) with a desired probability(often 99%), we use the following formula:

\[k=\frac{\log (1-P)}{\log (1-w^{n})}\]
  • k: Number of iterations required.
  • P: Probability that at least one random sample contains only inliers.
  • w: Ratio of inliers in the dataset (number of inliers / total points).
  • n: Number of points in the Minimal Sample Set.



End of Section

1.3.6 - Dimensionality Reduction

Dimensionality Reduction Techniques



End of Section

1.3.6.1 - PCA

PCA

Use Case 🐝
  • Data Compression
  • Noise Reduction
  • Feature Extraction: Create a smaller set of meaningful features from a larger one.
  • Data Visualization: Project high-dimensional data down to 2 or 3 dimensions.

Assumption: Linear relationship between features.

Intuition 💡

💡‘Information = Variance’

☁️ Imagine a cloud of points in 2D space.
👀 Look for the direction (axis) along which the data is most ‘spread out’.

🚀 By projecting data onto this axis, we retain the maximum amount of information (variance).

👉Example 1: Var(Feature 1) < Var(Feature 1)

images/machine_learning/unsupervised/dimensionality_reduction/pca/slide_03_01.png

👉Example 2: Red line shows the direction of maximum variance

images/machine_learning/unsupervised/dimensionality_reduction/pca/slide_04_01.png
Principal Component Analysis

🧭 Finds the direction of maximum variance in the data.

Note: Some loss of information will always be there in dimensionality reduction, because there will be some variability in data along the direction that is dropped, and that will be lost.

Goal 🎯
Fundamental goal of PCA is to find the new set of orthogonal axes, called the principal components, onto which the data can be projected, such that, the variance of the projected data is maximum.
PCA as Optimization Problem 🦀
  • PCA seeks a direction 🧭, represented by a unit vector \(\hat{u}\) onto which data can be projected to maximize variance.
  • The projection of a mean centered data point \(x_i\) onto \(u\) is \(u^Tx_i\).
  • The variance of these projections can be expressed as \(u^{T}\Sigma u\), where \(\Sigma\) is the data’s covariance matrix.
Why Variance of Projection is \(u^{T}\Sigma u\)?
  • Data: Let \(X = \{x_{1},x_{2},\dots ,x_{n}\}\) is the mean centered (\(\bar{x} = 0\)) dataset.
  • Projection of point \(x_i\) on \(u\) is \(z_i = u^Tx_i\)
  • Variance of projected points (since \(\bar{x} = 0\)): \[\text{Var}(z)=\frac{1}{n}\sum _{i=1}^{n}z_{i}^{2}=\frac{1}{n}\sum _{i=1}^{n}(x_{i}^{T}u)^{2}\]
  • 💡Since, \((x_{i}^{T}u)^{2} = (u^{T}x_{i})(x_{i}^{T}u)\) \( \implies\text{Var}(z)=u^{T}\left(\frac{1}{n}\sum _{i=1}^{n}x_{i}x_{i}^{T}\right)u\)
  • 💡Since, Covariance Matrix, \(\Sigma = \left(\frac{1}{n}\sum _{i=1}^{n}x_{i}x_{i}^{T}\right)\)
  • 👉 Therefore, \(\text{Var}(z)=u^{T}\Sigma u\)
Constrained 🐣 Optimization

👉 To prevent infinite variance, PCA constrains \(u\) to be a unit vector (\(\|u\|=1\)).

\[\text{maximize\ }u^{T}\Sigma u, \quad \text{subject\ to\ }u^{T}u=1\]

Note: This constraint forces the optimization algorithm to focus purely on the direction that maximizes variance, rather than allowing it to artificially inflate the variance by increasing the length of the vector.

Constrained Optimization Solution 🦉

⏲️ Lagrangian function: \(L(u,\lambda )=u^{T}\Sigma u-\lambda (u^{T}u-1)\)
🔦 To find \(u\) that maximizes above constrained optimization:

\[\frac{\partial L}{\partial u} = 0\]

\[\implies 2\Sigma u - 2\lambda u = 0 \implies \Sigma u = \lambda u\]

\[\because \frac{\partial }{\partial x}x^{T}Ax=2Ax \text{ for symmetric } A\]

💎 This is the standard Eigenvalue Equation.
🧭 So, the optimal directions \(u\) are the eigenvectors of the covariance matrix \(\Sigma\).

👉 To see which eigenvector maximizes variance, substitute the result back into the variance equation:

\[\text{Variance}=u^{T}\Sigma u=u^{T}(\lambda u)=\lambda (u^{T}u)=\lambda \]

🧭 Since the variance equals the eigenvalue \(\lambda\), the direction \(u\) that maximizes variance is the eigenvector associated with the largest eigenvalue.

PCA Algorithm ⚙️
  1. Center the data: \(X = X - \mu\)
  2. Compute the Covariance Matrix: \(\Sigma = \frac{1}{n-1} X^T X\)
  3. Compute Eigenvectors and Eigenvalues of \(\Sigma\) .
  4. Sort eigenvalues in descending order and select the top ‘k’ eigenvectors.
  5. Project the original data onto the subspace: \(Z = X W_k\) where, \(W_{k}=[u_{1},u_{2},\dots ,u_{k}]\) , matrix formed by ‘k’ eigenvectors corresponding to ‘k’ largest eigenvalues.
images/machine_learning/unsupervised/dimensionality_reduction/pca/slide_13_01.png
Drawbacks
  • Cannot model non-linear relationships.
  • Sensitive to outliers.



End of Section

1.3.6.2 - t-SNE

t-Distributed Stochastic Neighbor Embedding (t-SNE)

Use Case 🐝
⭐️ Visualizing complex datasets, such as MNIST handwritten digits, text embeddings, or biological data, where clusters are expected to form naturally.
Intuition 💡

👉 PCA preserves variance, not neighborhoods.
🔬 t-SNE focuses on the ‘neighborhood’ (local structure).

💡Tries to keep points that are close together in high-dimensional space close together in low-dimensional space.

t-SNE
⭐️ Non-linear dimensionality reduction technique to visualize high-dimensional data (like images, gene expressions, text embeddings) in a lower-dimensional space (typically 2D or 3D) by preserving local structures, making clusters and patterns visible.
Problem 🦀

👉 Map high-dimensional points to low-dimensional points , such that the pairwise similarities are preserved, while solving the ‘crowding problem’ (where points collapse into a single cluster).

👉 Crowding Problem

images/machine_learning/unsupervised/dimensionality_reduction/tsne/slide_05_01.png
Solution 🦉

📌 Convert Euclidean distances into conditional probabilities representing similarities.
⚖️ Minimize the divergence between the probability distributions of the high-dimensional (Gaussian) and low-dimensional (t-distribution) spaces.

Note: Probabilistic approach to defining neighbors is the core ‘stochastic’ element of the algorithm’s name.

High Dimensional Space 🚀(Gaussian)

💡The similarity of datapoint \(x_j\) to datapoint \(x_i\) is the conditional probability \(p_{j|i}\), that \(x_i\) would pick \(x_j\) as its neighbor.

Note: If neighbors are picked in proportion to their probability density under a Gaussian centered at \(x_i\).

\[p_{j|i} = \frac{\exp(-||x_i - x_j||^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-||x_i - x_k||^2 / 2\sigma_i^2)}\]
  • \(p_{j|i}\) high: nearby points.
  • \(p_{j|i}\) low: widely separated points.

Note: Make probabilities symmetric: \(p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}\)

Low Dimensional Space 🚀 (t-distribution)

🧠 To solve the crowding problem, we use a heavy-tailed 🦨 distribution (Student’s-t distribution with degree of freedom \(\nu=1\)).

\[q_{ij} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\sum_{k \neq l} (1 + ||y_k - y_l||^2)^{-1}}\]

Note: Heavier tail spreads out dissimilar points more effectively, allowing moderately distant points to be mapped further apart, preventing clusters from collapsing and ensuring visual separation and cluster distinctness.

images/machine_learning/unsupervised/dimensionality_reduction/tsne/slide_10_01.png
Optimization 🕸️

👉 Measure the difference between the distributions ‘p’ and ‘q’ using the Kullback-Leibler (KL) divergence, which we aim to minimize:

\[C = KL(P||Q) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}\]
Gradient Descent 🎢

🏔️ Use gradient descent to iteratively adjust the positions of the low-dimensional points \(y_i\).

👉 The gradient of the KL divergence is:

\[\frac{\partial C}{\partial y_{i}}=4\sum _{j\ne i}(p_{ij}-q_{ij})(y_{i}-y_{j})(1+||y_{i}-y_{j}||^{2})^{-1}\]
Meaning of Terms
  • C : t-SNE cost function (sum of all KL divergences).
  • \(y_i, y_j\): coordinates of data points and in the low-dimensional space (typically 2D or 3D).
  • \(p_{ij}\): high-dimensional similarity (joint probability) between points \(x_i\) and \(x_j\), calculated using a Gaussian distribution.
  • \(q_{ij}\): low-dimensional similarity (joint probability) between points \(y_i\) and \(y_j\), calculated using a Student’s t-distribution with one degree of freedom.
  • \((1+||y_{i}-y_{j}||^{2})^{-1}\): term comes from the heavy-tailed Student’s t-distribution, which helps mitigate the ‘crowding problem’ by allowing points that are moderately far apart to have a small attractive force.
Interpretation 🦒

💡 The gradient can be understood as a force acting on each point \(y_i\) in the low-dimensional map:

\[\frac{\partial C}{\partial y_{i}}=4\sum _{j\ne i}(p_{ij}-q_{ij})(y_{i}-y_{j})(1+||y_{i}-y_{j}||^{2})^{-1}\]
  • Attractive forces: If \(p_{ij}\) is large ⬆️ and \(q_{ij}\) is small ⬇️ (meaning two points were close in the high-dimensional space but are far in the low-dimensional space), the term \((p_{ij}-q_{ij})\) is positive, resulting in a strong attractive force pulling \(y_i\) and \(y_j\) closer.
  • Repulsive forces: If \(p_{ij}\) is small ⬇️ and \(q_{ij}\) is large ⬆️ (meaning two points were far in the high-dimensional space but are close in the low-dimensional space), the term \((p_{ij}-q_{ij})\) is negative, resulting in a repulsive force pushing \(y_i\) and \(y_j\) apart.
Gradient Descent Update Step

👉 Update step of in low dimensions:

\[y_{i}^{(t+1)}=y_{i}^{(t)}-\eta \frac{\partial C}{\partial y_{i}}\]
  • Attractive forces (\(p_{ij}>q_{ij}\)):

    • The negative gradient moves \(y_i\) opposite to the (\(y_i - y_j\)) vector, pulling it toward \(y_j\).
  • Repulsive forces (\(p_{ij} < q_{ij}\)):

    • The negative gradient moves \(y_i\) in the same direction as the (\(y_i - y_j\)) vector, pushing it away from \(y_j\).
    images/machine_learning/unsupervised/dimensionality_reduction/tsne/slide_15_01.png
Perplexity 😵‍💫

🏘️ User-defined parameter that loosely relates to the effective number of neighbors.

Note: Variance \(\sigma_i^2\) (Gaussian) is adjusted for each point to maintain a consistent perplexity.

t-SNE Plot of MNIST Digits
images/machine_learning/unsupervised/dimensionality_reduction/tsne/slide_18_01.png



End of Section

1.3.6.3 - UMAP

Uniform Manifold Approximation and Projection (UMAP)

Use Case 🐝

🐢Visualizing massive datasets where t-SNE is too slow.

⭐️ Creating robust low-dimensional inputs for subsequent machine learning models.

Intuition 💡

⭐️ Using a world map 🗺️ (2D)instead of globe 🌍 for spherical(3D) earth 🌏.

👉It preserves the neighborhood relationships of countries (e.g., India is next to China), and to a good degree, the global structure.

UMAP

⭐️ Non-linear dimensionality reduction technique to visualize high-dimensional data (like images, gene expressions) in a lower-dimensional space (typically 2D or 3D), preserving its underlying structure and relationships.
👉 Constructs a high-dimensional graph of data points and then optimizes a lower-dimensional layout to closely match this graph, making complex datasets understandable by revealing patterns, clusters, and anomalies.

Note: Similar to t-SNE but often faster and better at maintaining global structure.

Problem 🦀
👉 Create a low-dimensional representation that preserves the topological connectivity and manifold structure of the high-dimensional data efficiently.
Solution 🦉

💡 Create a weighted graph (fuzzy simplicial set) representing the data’s topology and then find a low-dimensional graph that is as structurally similar as possible.

Note: Fuzzy means instead of using binary 0/1, we use weights in the range [0,1] for each edge.

High Dimensional Graph (Manifold Approximation)
  • UMAP determines local connectivity based on a user-defined number of neighbors (n_neighbors).
  • Normalizes distances locally using the distance to the nearest neighbor (\(\rho_i\)) and a scaling factor (\(\sigma_i\)) adjusted to enforce local connectivity constraints.
  • The weight \(W_{ij}\) (fuzzy similarity) in high-dimensional space is: \[W_{ij}=\exp \left(-\frac{\max (0,\|x_{i}-x_{j}\|-\rho _{i})}{\sigma _{i}}\right)\]

Note: This ensures that the closest point always gets a weight of 1, preserving local structure.

Low Dimensional Space (Optimization)

👉In the low-dimensional space (e.g., 2D), UMAP uses a simple curve (similar to the t-distribution used in t-SNE) for edge weights:

\[Z_{ij}=(1+a\|y_{i}-y_{j}\|^{2b})^{-1}\]

Note: The parameters ‘a’ and ‘b’ are typically fixed based on the ‘min_dist’ user parameter (e.g. min_dist = 0.1, then a≈1,577, b≈0.895).

Optimization

⭐️ Unlike t-SNE’s KL divergence, UMAP minimizes the cross-entropy between the high-dimensional weights \(W_{ij}\) and the low-dimensional weights \(Z_{ij}\).

👉 Cost 💰 Function (C):

\[C=\sum _{i,j}\left(W_{ij}\log \frac{W_{ij}}{Z_{ij}}+(1-W_{ij})\log \frac{1-W_{ij}}{1-Z_{ij}}\right)\]
Cross Entropy Loss

Cost 💰 Function (C):

\[C=\sum _{i,j}\left(W_{ij}\log \frac{W_{ij}}{Z_{ij}}+(1-W_{ij})\log \frac{1-W_{ij}}{1-Z_{ij}}\right)\]

🎯 Goal : Reduce overall Cross Entropy Loss.

  • Attractive Force: \(W_{ij}\) high; make \(Z_{ij}\) high to make the log \(\frac{W_{ij}}{Z_{ij}} \)term small.
  • Repulsive Force: \(W_{ij}\) low; make \(Z_{ij}\) low to make the \(log\frac{1-W_{ij}}{1-Z_{ij}}\) term small.

Note: This push and pull of 2 ‘forces’ will make the data in low dimensions settle into a position that is overall a good representation of the original data in higher dimensions.

Stochastic Gradient Descent
👉 Optimization uses stochastic gradient descent (SGD) to minimize this cross-entropy, balancing attractive forces (edges present in high-dimension, \(W_{ij} \approx 1\)) and repulsive forces (edges absent in high-dimension, \(W_{ij} \approx 0\)).
UMAP Plot of MNIST Digits
images/machine_learning/unsupervised/dimensionality_reduction/umap/slide_13_01.png
Drawbacks 🦂
  • Mathematically complex.
  • Requires tuning (n_neighbors, min_dist).



End of Section

1.4 - Feature Engineering

Feature Engineering



End of Section

1.4.1 - Data Pre Processing

Data Pre Processing

Real World 🌎 Data

Messy and Incomplete.
We need to pre-process the data to make it:

  • Clean
  • Consistent
  • Mathematically valid
  • Computationally stable

👉 So that, the machine learning algorithm can safely consume the data.

Missing Values
  • Missing Completely At Random (MCAR)
    • The missingness occurs entirely by chance, such as due to a technical glitch during data collection or a random human error in data entry.
  • Missing At Random (MAR)
    • The probability of missingness depends on the observed data and not on the missing value itself.
    • e.g. In some survey, the age of many females are missing, because they may not like to disclose the information.
  • Missing Not At Random (MNAR)
    • The probability of missingness is directly related to the unobserved missing value itself.
    • e.g. Individuals with very high incomes 💰may intentionally refuse to report their salary due to privacy concerns, making the missing data directly dependent on the high income 💰value itself.
Handle Missing Values (Imputation)
  • Simple Imputation:
    • Mean: Normally distributed numerical features.
    • Median: Skewed numerical features.
    • Mode: Categorical features, most frequent.
  • KNN Imputation:
    • Replace the missing value with mean/median/mode of ‘k’ nearest (similar) neighbors of the missing value.
  • Predictive Imputation:
    • Use another ML model to estimate missing values.
  • Multivariate Imputation by Chained Equations (MICE):
    • Iteratively models each variable with missing values as a function of other variables using flexible regression models (linear regression, logistic regression, etc.) in a ‘chained’ or sequential process.
    • Creates multiple datasets, using slightly different random starting points.
Handle Outliers 🦄

🦄 Outliers are extreme or unusual data points, can mislead models, causing inaccurate predictions.

  • Remove invalid or corrupted data.
  • Replace (Impute): Median or capped value to reduce impact.
  • Transform: Apply log or square root to reduce skew.

👉 For example: Log and Square Root Transformed Data

images/machine_learning/feature_engineering/data_pre_processing/slide_05_01.png
Scaling and Normalization

💡 If one feature ranges from 0-1 and another from 0-1000, larger feature will dominate the model.

  • Standardization (Z-score) :
    • μ=0, σ=1; less sensitive to outliers.
    • \(x_{std} = (x − μ) / σ\)
  • Min-Max Normalization:
    • Maps data to specific range, typically [0,1]; sensitive to outliers.
    • \(x_{minmax} = (x − min) / (max − min)\)
  • Robust Scaling:
    • Transforms features using median and IQR; resilient to outliers.
    • \(x_{scaled}=(x-\text{median})/\text{IQR}\)

👉 Standardization Example

images/machine_learning/feature_engineering/data_pre_processing/slide_07_01.png



End of Section

1.4.2 - Categorical Variables

Categorical Variables

Categorical Variables

💡 ML models operate on numerical vectors.

👉 Categorical variables must be transformed (encoded) while preserving information and semantics.

  • One Hot Encoding (OHE)
  • Label Encoding
  • Ordinal Encoding
  • Frequency/Count Encoding
  • Target Encoding
  • Hash Encoding
One Hot 🔥 Encoding (OHE)

⭐️ When the categorical data (nominal) is without any inherent ordering.

  • Create binary columns per category.
    • e.g.: Colors: Red, Blue, Green.
    • Colors: [1,0,0], [0,1,0], [0,0,1]

Note: Use when low cardinality, or small number of unique values (<20).

Label 🏷️ Encoding

⭐️ Assigns a unique integer (e.g., 0, 1, 2) to each category.

  • When to use ?
    • Target variable, i.e, unordered (nominal) data, in classification problems.
    • e.g. encoding a city [“Paris”, “Tokyo”, “Amsterdam”] -> [1, 2, 0], (Alphabetical: Amsterdam=0, Paris=1, Tokyo=2).
  • When to avoid ?
    • For nominal data in linear models, because it can mislead the model to assume an order/hierarchy, when there is none.
Ordinal Encoding

⭐️ When categorical data has logical ordering.

  • Best for: Ordered (ordinal) input features.

    images/machine_learning/feature_engineering/categorical_variables/slide_04_01.png
Frequency/Count 📟 Encoding

⭐️ Replace categories with their frequency or count in the dataset.

  • Useful for high-cardinality features where many unique values exist.

👉 Example

images/machine_learning/feature_engineering/categorical_variables/slide_06_01.png

👉 Frequency of Country

images/machine_learning/feature_engineering/categorical_variables/slide_06_03.png

👉 Country replaced with Frequency

images/machine_learning/feature_engineering/categorical_variables/slide_06_02.png
Target 🎯 Encoding

⭐️ Replace a category with the mean of the target variable for that specific category.

  • When to use ?
    • For high-cardinality nominal features, where one hot encoding is inefficient, e.g., zip code, product id, etc.
    • Strong correlation between the category and the target variable.
  • When to avoid ?
    • With small datasets, because the category averages (encodings) are based on too few samples, making them unrepresentative.
    • Also, it can lead to target leakage and overfitting unless proper smoothing or cross-validation techniques (like K-fold or Leave-One-Out) are used.
Hash 🌿 Encoding

⭐️ Maps categories to a fixed number of features using a hash function.

  • Useful for high-cardinality features where we want to limit the dimensionality.

    images/machine_learning/feature_engineering/categorical_variables/slide_09_01.png



End of Section

1.4.3 - Feature Engineering

Feature Engineering

Feature Engineering
Use domain knowledge 📕 to create new or transform existing features to improve model performance.
Polynomial 🐙 Features

Create polynomial features, such as, x^2, x^3, etc., to learn non-linear relationship.

images/machine_learning/feature_engineering/feature_engineering/slide_04_01.png
Feature Crossing 🦓

⭐️ Combine 2 or more features to capture non-linear relationship.

  • e.g. combine latitude and longitude into one location feature ‘lat-long'.
Hash 🌿 Encoding

⭐️ Memory-efficient 🧠 technique to convert categorical (string) data into a fixed-size numerical feature vector.

  • Pros:
    • Useful for high-cardinality features where we want to limit the dimensionality.
  • Cons:
    • Hash collisions.
    • Reduced interpretability.

👉 Hash Encoding (Example)

images/machine_learning/feature_engineering/feature_engineering/slide_08_01.png
Binning (Discretization)

⭐️ Group continuous numerical values into discrete categories or ‘bin’.

  • e.g. divide age into groups 18-24, 25-35, 35-45, 45-55, >55 years etc.



End of Section

1.4.4 - Data Leakage

Data Leakage

Data Leakage

⭐️ Occurs when a model is trained using data that would not be available during real-world predictions, leading to good training performance, but poor real‑world 🌎 performance.
It is essentially the model ‘cheating’ by inadvertently accessing information about the target variable.

👉Any information from the validation/test set must NOT influence training, directly or indirectly.
❓So, how do we prevent this leakage of information or data leakage from training to validation or test set ?

Train-Test Contamination
  • Wrong: Applying preprocessing (like global StandardScaler, Mean_Imputation, Target_Encoding etc.) on the entire dataset before splitting.
  • Right: Compute mean, variance, etc. only on the training data and use the same for validation and test data.

Preventing Leakage in Cross-Validation:

  • Wrong: Perform preprocessing (e.g., scaling, normalization, missing value imputation) on the entire dataset before passing it to cross_val_score.
  • Right: Use sklearn.pipeline.Pipeline; Pipeline ensures that the ‘validation fold’ remains unseen until the transformation is applied using the training fold’s parameters.
Temporal Leakage

This happens in Time Series ⏰ data.

  • Wrong: Use standard random CV; it allows the model to ‘peek into the future’.
  • Right: Use Time-Series Nested Cross-Validation (Forward Chaining) instead of random shuffling.
Target Leakage
  • Wrong: Include features that are only available after the event we are trying to predict and are proxy for the target.
    • e.g. Including number_of_late_payments in a model to predict whether a person applying for a bank loan will default ?
  • Right: Do not include such features during training.

Group Leakage:

  • Wrong: If you have multiple rows that are correlated (same user).
    • For the same patient or user, you put some rows in Train and others in Test.
  • Right: Use GroupKFold to ensure all data from a specific group stays together in one fold.



End of Section

1.4.5 - Model Interpretability

Model Interpretability

House Price Prediction
images/machine_learning/feature_engineering/model_interpretability/slide_01_01.png
Can we explain why the model made a certain prediction ?

👉 Because without this capability the machine learning is like a black box to us.

👉 We should be able to answer which features had most influence on output.

⭐️ Let’s understand ‘Feature Importance’ and why the ML model output’s interpretability is important ?

Feature Importance
\[\hat{y_i} = w_0 + w_1x_{i_1} + w_2x_{i_2} + \dots + w_dx_{i_d}\]\[w_1 > w_2 : f_1 \text{ is more important feature than } f_2\]\[ \begin{align*} w_j &> 0: f_j \text { is directly proportional to target variable} \\ w_j &= 0: f_j \text { has no relation to target variable} \\ w_j &< 0: f_j \text { is inversely proportional to target variable} \\ \end{align*} \]

Note: Weights 🏋️‍♀️ represent the importance of feature with standardized data.

Why Model Interpretability Matters ?

💡 Overall model behavior + Why this prediction?

  • Trust: Stakeholders must trust predictions.
  • Model Debuggability: Detect leakage, spurious correlations.
  • Feature engineering: Feedback loop.
  • Regulatory compliance: Data privacy, GDPR.
Trust

⭐️ Stakeholders Must Trust Predictions.

  • Users, executives, and clients are more likely to trust and adopt an AI system if they understand its reasoning.
  • This transparency is fundamental, especially in high-stakes applications like healthcare, finance, or law, where decisions can have a significant impact.
Model Debuggability
⭐️ By examining which features influence predictions, developers can identify if the model is using misleading or spurious correlations, or if there is data leakage (where information not available in a real-world scenario is used during training).
Feature Engineering
⭐️ Insights gained from an interpretable model can provide a valuable feedback loop for domain experts and engineers.
Regulatory Compliance

⭐️ In many industries, regulations mandate the ability to explain decisions made by automated systems.

  • For instance, the General Data Protection Regulation (GDPR) in Europe includes a “right to explanation” for individuals affected by algorithmic decisions.
  • Interpretability ensures that organizations can meet these legal and ethical requirements.



End of Section

1.5 - ML System

Machine Learning System



End of Section

1.5.1 - Data Distribution Shift

Data Distribution Shift

Distribution Shift or Data Drift 🦣
⭐️ The data a model works with changes over time ⏰, which causes this model’s predictions to become less accurate as time passes⏳.
Bayes' Theorem
\[P(Y|X)=\frac{P(X|Y)\cdot P(Y)}{P(X)}\]
  • P(X | Y): Likelihood of X given Y (joint distribution)
  • P(Y | X) : Model (Posterior)
  • P(Y): Prior probability of the output Y.
  • P(X): Evidence (marginal probability of the input X).
Covariate Shift (P(X) Changes)

⭐️The input data distribution seen during training is different from the distribution seen during inference.

👉 P(X)(input) changes, but P(Y|X) (model) remains same.

  • e.g. Self-driving car 🚗 trained on a bright, sunny day is used during foggy winter.
Label Shift or Prior Probability Shift (P(Y) Changes)

⭐️The output distribution changes, but for a given output, the input distribution remains the same.

👉 P(Y) (output) changes, but P(X|Y) remains the same.

  • 😷 e.g. Flu-detection model is trained during summer, when only 1% of patients have flu.
    • The same model is used during winter when 40% of patients have flu.
    • 🍎 Prior probability of having flu P(Y) has changed from 1% to 40%, but the symptoms for a person to have flu P(X|Y) remains same.
Concept Drift or Posterior Shift (P(Y|X) Changes)

⭐️ The relationship between inputs and outputs changes.
i.e the very definition of what you are trying to predict changes.

👉 Concept drifts are cyclic or seasonal.

  • e.g. ‘Normal’ spending behavior in 2019 became ‘Abnormal’ during 2020 lockdowns 🔐.



End of Section

1.5.2 - Retraining Strategies

Retraining Strategies

Why Retrain 🦮 a ML Model?

⭐️In a production ML environment, retraining is the ‘maintenance engine’ ⚙️ that keeps our models from becoming obsolete.

❌ Don’t ask: When do we retrain?

✅ Ask: “How do we automate the decision to retrain while balancing compute cost 💰, model risk, and data freshness?”

Periodic Retraining (Fixed Interval) ⏳

👉 The model is retrained on a regular schedule (e.g., daily, weekly, or monthly).

  • Best for:
    • Stable environments where data changes slowly.
      (e.g. long-term demand forecast or a credit scoring model).
  • Pros:
    • Highly predictable; easy to schedule compute resources; simple to implement via a cron job or Airflow DAG.
  • Cons:
    • Inefficient. You might retrain when not needed (wasting money 💵) or fail to retrain during a sudden market shift (losing accuracy).
Trigger-Based Retraining (Reactive) 🔫

👉 Retraining is initiated only when a specific performance or data metric crosses a pre-defined threshold.

  • Metric Triggers:
    • Performance Decay: A drop in Precision, Recall, or RMSE (requires ground-truth labels).
    • Drift Detection: A high PSI (Population Stability Index) or K-S test score indicating covariate shift.
  • Pros:
    • Cost-effective; reacts to the ‘reality’ of the data rather than the calendar.
  • Cons:
    • Requires a robust monitoring stack 📺.
      If the ‘trigger’ logic is buggy, the model may never update.
Continual Learning (Online/Incremental) 🛜

👉 Instead of retraining from scratch on a massive batch, the model is updated incrementally as new data ‘streams’ into the system.

  • Mechanism: Using ‘Warm Starts’ where the model weights from the previous version are used as the starting point for the next few gradient descent steps.
  • Best for:
    • Recommendation engines (Netflix/TikTok) or High-Frequency Trading 💰where patterns change by the minute.
  • Pros:
    • Extreme ‘freshness’; low latency between data arrival and model update.
  • Cons:
    • High risk of ‘Catastrophic Forgetting’ (the model forgets old patterns) and high infrastructure complexity.



End of Section

1.5.3 - Deployment Patterns

Deployment Patterns

Deploy 🖥️

⭐️In a production ML environment, retraining is only half the battle, we must also safely deploy the new version.

Types of deployment (most common):

  • Shadow ❏ Deployment
  • A/B Testing 🧪
  • Canary 🦜 Deployment
Shadow ❏ Deployment

👉 Safest way to deploy our model or any software update.

  • Deploy the candidate model in parallel with the existing model.
  • For each incoming request, route it to both models to make predictions, but only serve the existing model’s prediction to the user.
  • Log the predictions from the new model for analysis purposes.

Note: When the new model’s predictions are satisfactory, we replace the existing model with the new model.

images/machine_learning/ml_system/deployment_patterns/slide_03_01.png
A/B Testing 🧪

👉A/B testing is a way to compare two variants of a model.

  • Deploy the candidate model in parallel with the existing model.
  • A percentage of traffic🚦is routed to the candidate for predictions; the rest is routed to the existing model for predictions.
  • Monitor 📺 and analyze the predictions, from both models to determine whether the difference in the two models’ performance is statistically significant.

Note: Say we run a two-sample test and get the result that model A is better than model B with the p-value of p = 0.05 or 5%.

images/machine_learning/ml_system/deployment_patterns/slide_05_01.png
Canary 🦜 Deployment

👉 Mitigates deployment risk by incrementally shifting traffic 🚦from a model version to a new version, allowing for real-world validation on a subset of users before a full-scale rollout.

  • Deploy the candidate model in parallel with the existing model.
  • A percentage of traffic🚦is routed to the candidate for predictions.
  • If its performance is satisfactory, increase the traffic to the candidate model.If not, abort the canary and route all the traffic🚦 back to the existing model.
  • Stop when either the canary serves all the traffic🚦 (the candidate model has replaced the existing model) or when the canary is aborted.

Note: Canary releases can be used to implement A/B testing due to the similarities in their setups. However, we can do canary analysis without A/B testing.

images/machine_learning/ml_system/deployment_patterns/canary_deployment.png



End of Section

1.6 - Interview Questions

Machine Learning Interview Questions
How would you evaluate a model for an imbalanced classification problem?
Which metrics would you report and why?

We should evaluate an imbalanced classification model using metrics that that focus on performance for each class, especially the minority class.

Why ?
Say, we have a dataset with high imbalance, i.e, 99% of data belongs to positive class and only 1% of data belongs to the negative class.
In such a case, standard metrics, such as, accuracy is misleading, because a model can achieve 99% accuracy
by simply predicting positive class all the time.

So, what to do ?
First, of all, start with the confusion matrix. (focus on minority class)
It provides the raw counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). This is the foundation for all other metrics.

Confusion Matrix:

Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)
  • Precision: Of all instances the model predicted as positive, how many were actually positive?
    • \( Precision = \frac{TP}{TP + FP} \)
    • Use Case: The cost of a False Positive is high (e.g., marking a legitimate email as spam).
  • Recall (Sensitivity): Of all actual positive instances, how many did the model find?
    • \( Recall = \frac{TP}{TP + FN} \)
    • Use Case: The cost of a False Negative is high (e.g., missing a cancer diagnosis or fraud transaction).
  • F1-Score: Harmonic mean of precision and recall.
    • \( F1 ~ Score = 2 * \frac{Precision \times Recall}{Precision + Recall}\)
    • Why report F1-Score?: To balance precision and recall. A model with 1.0 precision and 0.0 recall will have an F1-score of 0.
  • Precision-Recall (PR) AUC: Plots Precision against Recall for different classification thresholds.
    Better than ROC curve because it uses Precision instead of False Positive Rate (FPR), which can be misleading for imbalanced data.
    • \(FPR = \frac{FP}{FP + TN}\)

Read more about Performance Metrics

Why might ROC-AUC be misleading for imbalanced classes?

ROC curve plots TPR vs FPR, \(FPR = \frac{FP}{FP + TN}\), so for an imbalanced data, FPR can be misleading.
So for imbalanced data, we better use Precision-Recall curve that uses Precision instead of FPR and hence is more reliable.
Let’s look at the fraud detection example below, N = 10,000 transactions, Fraud = 100, NOT fraud = 9900:

Confusion Matrix:

Predicted FraudPredicted NOT Fraud
Actual Fraud80 (TP)20 (FN)
Actual NOT Fraud220 (FP)9680 (TN)
\[FPR = \frac{FP}{FP + TN} = \frac{220}{220 + 9680} \approx 0.022\]

\[Precision = \frac{TP}{TP + FP} = \frac{80}{80 + 220} = \frac{80}{300}\approx 0.267\]

The FPR is very low due to the class imbalance, and hence Precision gives us a better view of the model’s performance.

Describe how to avoid data leakage when performing feature engineering and cross-validation.

👉Any information from the validation/test set must NOT influence training, directly or indirectly.
So, how do we prevent this leakage of information or data leakage from training to validation or test set ?

  1. Train-Test Contamination:
  • Wrong: Applying preprocessing (like global StandardScaler, Mean_Imputation, Target_Encoding etc.) on the entire dataset before splitting.
  • Right: Compute mean, variance, etc. only on the training data and use the same for validation and test data.
  1. Preventing Leakage in Cross-Validation:
  • Wrong: Perform preprocessing (e.g., scaling, normalization, missing value imputation) on the entire dataset before passing it to cross_val_score.
  • Right: Use sklearn.pipeline.Pipeline; Pipeline ensures that the ‘validation fold’ remains unseen until the transformation is applied using the training fold’s parameters.
  1. Time Series Data:
  • Wrong: Use standard random CV; it allows the model to ‘peek into the future’.
  • Right: Use Time-Series Nested Cross-Validation (Forward Chaining) instead of random shuffling.
  1. Target Leakage:
  • Wrong: Include features that are only available after the event we are trying to predict and are proxy for the target.
    • e.g. Including number_of_late_payments in a model to predict whether a person applying for a bank loan will default ?
  • Right: Do not include such features during training.
  1. Group Leakage:
  • Wrong: If you have multiple rows that are correlated (same user).
    • For the same patient or user, you put some rows in Train and others in Test.
  • Right: Use GroupKFold to ensure all data from a specific group stays together in one fold.
Explain bias-variance tradeoff and write the bias and variance decomposition for squared error.

Bias-Variance Decomposition:
For Mean Squared Error (MSE) = \(\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y_i})^2\)

Total Error = Bias^2 + Variance + Irreducible Error

  • Bias = Systematic Error
    • Bias measures how far the average prediction of a model is from the true value.
  • Variance = Sensitivity to Data
    • Variance measures how much the predictions of a model vary for different training datasets.
  • Irreducible Error = Sensor noise, Human randomness
    • Inherent uncertainty in the data generation process itself and cannot be reduced by any model.

Bias-Variance Trade-Off:

  • High Bias (Underfitting): A model with high bias is too simple to capture the underlying patterns in the data
    • e.g., fitting a straight line to curved data.
  • High Variance (Overfitting): A model with high variance is too complex and learns the noise in the training data rather than the true relationship
    • e.g., a high-degree polynomial curve that perfectly fits training data points but performs poorly on new data.

🎯 The goal is to find a balance; a ‘sweet spot’ that minimizes the total error.

🦉 A good model ‘generalizes’ well, i.e., neither too simple (low bias) nor too complex (low variance).

How does L1 (Lasso) differ from L2 (Ridge) mathematically, and when does Lasso produce sparse solutions?
  • L1 Regularization:
    • \( \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ J(w) + \lambda_1.\sum_{j=1}^n |w_j| \)
  • L2 Regularization:
    • \( \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ J(w) + \lambda_2.\sum_{j=1}^n w_j^2 \)

L1 regularization produces sparse solutions when regularization coefficient \(\lambda_1\) is high.

  • Because the gradient of L1 penalty (absolute function) is a constant value, i.e, \(\pm 1\), this means a constant reduction in weight at each step, making it gradually reach to 0 in finite steps.
  • Whereas, the derivative of L2 penalty is proportional to the weight (\(2w_j\)) and as the weight reaches close to 0, the gradient also becomes very small, this means that the weight will become very close to 0, but not exactly equal to 0.
What is heteroscedasticity and how does it affect OLS? How would you test for it?

💡 Heteroscedasticity = Variance NOT Constant

Note: Linear regression assumes that the data has homoscedasticity (constant variance).

Ordinary Least Squares (OLS) is an unweighted estimator. It treats every data point as equallyinformative’.

  • Under Homoscedasticity: Every point has the same amount of noise, so giving them equal weight is logical.
  • Under Heteroscedasticity: Some points have very low variance (high certainty) and some have very high variance (lots of noise).

👉 By treating all the points equally, OLS is ‘wasting’ the precision of the low-variance points and being ‘skewed’ by the high-variance points.
This is why OLS is no longer efficient; does not produce the smallest possible standard errors.
Which means:

  • t-tests become unreliable.
  • p-values become misleading.
  • Confidence intervals are wrong.

👉 OLS is NO longer B.L.U.E. (Best Linear Unbiased Estimator).

  • While the coefficients remain unbiased,
    they are no longer the ‘best’ because there is another estimator (like Weighted Least Squares) that could provide a lower variance.

👉 How to Test for Heteroscedasticity ?

  • Visual (Residual Plot):
    • Heteroscedasticity: The points form a ‘fan’ or ‘funnel’ shape, widening or narrowing as values increase.
    • Homoscedasticity: The points look like a random ‘cloud’ with consistent thickness.
  • Breusch–Pagan Test
  • White Test
  • Goldfeld–Quandt Test
Explain the difference between likelihood and probability; explain how MAP differs from MLE with example.

Probability vs. Likelihood:
Difference lies in which variable is fixed and which is varying.

  • Probability(Forward View):
    • Quantifies the chance of observing a specific outcome given a known, fixed parameters \(\theta\).
  • Likelihood(Backward/Inverse View):
    • Inverse concept used for inference (working backward from results to causes).
    • It is a function of the parameters \(\theta\) and measures how ‘likely’ a specific set of parameters makes the observed (fixed) data appear.

MLE vs. MAP:
Both help us answer the question:
Which parameter \(\theta\) best explains the data we just saw?

  • Maximum Likelihood Estimation (MLE):
    • MLE believes the data should speak for itself.
    • It asks: ‘Which value of \(\theta\) makes the observed data most probable?
    • It ignores any outside context or common sense.
  • Maximum A Posteriori (MAP):
    • MAP believes the data is important, but so is prior knowledge.
    • It asks: ‘Given the data AND what we already know about the problem at hand, which value of \(\theta\) is most likely?

The relationship between them is rooted in Bayes’ Theorem:

\[P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) P(\theta)}{P(\text{data})}\]
  • MLE maximizes only the Likelihood: \(P(\text{data} \mid \theta)\)
  • MAP maximizes the Posterior: \(P(\text{data} \mid \theta) P(\theta)\)

Note:

  1. P(data) is a constant (a ’normalizing factor’), so we ignore it during maximization.
  2. For a prior with uniform distribution where every value is equally likely, MAP becomes MLE.

👉 Coin Toss Example:

  • Data: Toss the coin 3 times, 2H + 1T.
    • MLE: Probability of heads (\(\theta\))
      • \(\theta_{MLE}\) = 2/3 = 0.67
    • MAP (with prior belief that coin is fair)
      • Assume prior: \(\theta \sim \beta(10,10)\)
      • Posterior = \(\beta(12,11)\)
      • \(\theta_{MAP}\) = 0.52 (Prior pulls estimate towards 0.5)

✅ Use MLE when:

  • large dataset.
  • no reliable prior knowledge.

✅ Use MAP when:

  • small dataset.
  • reliable prior/domain knowledge.

Read more about MLE & MAP

How is entropy defined for a binary split? Derive information gain and show how it is used to choose a decision tree split.

Entropy (H) is a measure of impurity or randomness in a dataset.

\[H(S)=-\sum _{i=1}^{n}p_{i}\log(p_{i})\]

For binary classification, where the outcome is Yes/No, 0/1 etc., entropy will be:

\[H(S) = -p \log_2(p) - (1-p) \log_2(1-p)\]
  • Max Entropy: H(S) = 1, when the classes are split 50/50 (maximum uncertainty).
  • Min Entropy: H(S) = 0 when the set is pure (all examples belong to one class).

Information Gain:
️Measures the reduction in entropy (uncertainty) achieved by splitting a dataset based on a specific attribute.

\[ IG=Entropy(Parent)-\left[\frac{N_{left}}{N_{parent}}Entropy(Child_{left})+\frac{N_{right}}{N_{parent}}Entropy(Child_{right})\right] \]

Note: The goal of a decision tree algorithm is to find the split that maximizes information gain, meaning it removes the most uncertainty from the data.

👉 To understand how a Decision Tree selects the ‘best’ root node, let’s use the example below:

The Dataset: “Will they buy the product?”

IDAgeIncomeCredit ScoreBuy? (Target)
1YouthHighGoodNo
2YouthHighExcellentNo
3MiddleHighGoodYes
4SeniorMediumGoodYes
5SeniorLowGoodYes
6SeniorLowExcellentNo
  1. Calculate parent node’s entropy:
  • P(yes) = P(no) = 3/6 = 0.5
  • \(H(Parent) = -(0.5 \log_2 0.5 + 0.5 \log_2 0.5) = \mathbf{1.0}\)
  1. Evaluate feature ‘Age’:
  • Youth: 2 samples (0 Yes, 2 No). \(H(Youth) = \mathbf{0}\)
  • Middle: 1 sample (1 Yes, 0 No). \(H(Middle) = \mathbf{0}\)
  • Senior: 3 samples (2 Yes, 1 No). \(H(Senior) = -(\frac{2}{3} \log_2 \frac{2}{3} + \frac{1}{3} \log_2 \frac{1}{3}) \approx \mathbf{0.918}\)
  • Weighted Entropy for Age:
    • \((\frac{2}{6} \times 0) + (\frac{1}{6} \times 0) + (\frac{3}{6} \times 0.918) = \mathbf{0.459}\)
  • Information Gain (Age): \(1.0 - 0.459 = \mathbf{0.541}\)
  1. Evaluate feature ‘Income’:
  • High: 3 samples (1 Yes, 2 No). \(H(High) \approx \mathbf{0.918}\)
  • Medium: 1 sample (1 Yes, 0 No). \(H(Medium) = \mathbf{0}\)
  • Low: 2 samples (1 Yes, 1 No). \(H(Low) = \mathbf{1.0}\)
  • Weighted Entropy for Income:
    • \((\frac{3}{6} \times 0.918) + (\frac{1}{6} \times 0) + (\frac{2}{6} \times 1.0) = \mathbf{0.792}\)
  • Information Gain (Income): \(1.0 - 0.792 = \mathbf{0.208}\)
  1. Evaluate feature ‘Credit Score
  • Good: 4 samples (3 Yes, 1 No). \(H(Good) \approx \mathbf{0.811}\)
  • Excellent: 2 samples (0 Yes, 2 No). \(H(Excellent) = \mathbf{0}\)
  • Weighted Entropy for Credit Score:
    • \((\frac{4}{6} \times 0.811) + (\frac{2}{6} \times 0) = \mathbf{0.541}\)
  • Information Gain (Credit Score): \(1.0 - 0.541 = \mathbf{0.459}\)
  1. The Decision Tree algorithm compares the information gain for all the features, and splits on the feature with maximum information gain.
  • Here in our case it is “Age”, IG = 0.541.
  • The algorithm chooses ‘Age’ as the root node.
  • Splits the data into three branches (Youth, Middle, Senior).



End of Section

2 - Maths

Mathematics for AI & ML

Maths for AI & ML

This sheet contains all the topics that will be covered for Maths for AI & ML.

2.1 - Probability

Probability for AI & ML

2.1.1 - Introduction to Probability

Introduction to Probability


Why do we need to understand what is Probability?

Because the world around us is very uncertain, and Probability acts as -
the fundamental language to understand, express and deal with this uncertainty.

  1. Toss a fair coin, \(P(H) = P(T) = 1/2\)
  2. Roll a die, \(P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6\)
  3. Email classifier, \(P(spam) = 0.95 ,~ P(not ~ spam) = 0.05\)

Probability
Numerical measure of chance or likelihood that an event will occur.
Range: \([0,1]\)
\(P=0\): Highly unlikely
\(P=1\): Almost certain
Sample Space
Set of all possible outcomes of an experiment.
Symbol: \(\Omega\)
\(P(\Omega) = 1\)
Example
  1. Toss a fair coin, sample space: \(\Omega = \{H,T\}\)
  2. Roll a die, sample space: \(\Omega = \{1,2,3,4,5,6\}\)
  3. Choose a real number \(x\) from the interval \([2,3]\), sample space: \(\Omega = [2,3]\); sample size = \(\infin\)
    Note: There can be infinitely many points between 2 and 3, e.g: 2.21, 2.211, 2.2111, 2.21111, …
  4. Randomly put a point in a rectangular region; sample size = \(\infin\)
    Note: There can be infinitely many points in any rectangular region.
Event
An outcome of an experiment. A subset of all possible outcomes.
A,B,…⊆Ω
Example
  1. Toss a fair coin, set of possible outcomes: \(\{H,T\}\)
  2. Roll a die, set of possible outcomes: \(\{1,2,3,4,5,6\}\)
  3. Roll a die, event \(A = \{1,2\} => P(A) = 2/6 = 1/3\)
  4. Email classifier, set of possible outcomes: \(\{spam,not ~spam\}\).

Discrete
Number of potential outcomes from an experiment is countable, distinct, or can be listed in a sequence, even if infinite i.e countably infinite.
Example
  1. Toss a fair coin, possible outcomes: \(\Omega = \{H,T\}\)
  2. Roll a die, possible outcomes: \(\Omega = \{1,2,3,4,5,6\}\)
  3. Choose a real number \(x\) from the interval \([2,3]\) with decimal precision, sample space: \(\Omega = [2,3]\).
    Note: There are 99 real numbers between 2 and 3 with 2 decimal precision i.e from 2.01 to 2.99.
  4. Number of cars passing a specific traffic signal in 1 hour.
Continuous
Potential outcomes from an experiment can take any value within a given range or interval, representing an uncountably infinite set of possibilities.
Example
  1. A line segment between 2 and 3 - forms a continuum.
  2. Randomly put a point in a rectangular region.

graph TD
    A[Sample Space] --> |Discrete| B(Finite)
    A --> C(Infinite)
    C --> |Discrete| D(Countable)
    C --> |Continuous| E(Uncountable)
Mutually Exclusive (Disjoint) Events
Two or more events that cannot happen at the same time.
No overlapping or common outcomes.
If one event occurs, then the other event does NOT occur.
Example
  1. Roll a die, sample space: \(\Omega = \{1,2,3,4,5,6\}\)
    Odd outcome = \(A = \{1,3,5\}\)
    Even outcome = \(B = \{2,4,6\}\) are mutually exclusive.

    \(P(A \cap B) = 0\)
    Since, \(P(A \cup B) = P(A) + P(B) - (P(A \cap B)\)
    Therefore, \(P(A \cup B) = P(A) + P(B)\)

Note: If we know that event \(A\) has occurred, then we can say for sure that the event \(B\) did NOT occur.

Independent Events
Two events are independent if the occurrence of one event does NOT impact the outcome of the other event.
Example
  1. Roll a die twice , sample space: \(\Omega = \{1,2,3,4,5,6\}\)
    Odd number in 1st throw = \(A = \{1,3,5\}\)
    Odd number in 2nd throw = \(B = \{1,3,5\}\)
    Note: A and B are independent because whether we get an odd number in 1st roll has NO impact of getting an odd number in second roll.

    \(P(A \cap B) = P(A)*P(B)\)

Note: If we know that event \(A\) has occurred, then that gives us NO new information about the event \(B\).

Does \(P=0\) mean that the event is impossible or improbable ?
No, it means that the event is highly unlikely to occur.

Let’s understand this answer with an example.

What is the probability of choosing a real number, say 2.5, from the interval \([2,3]\) ?

Probability of choosing exactly one point on the number line or a real number, say 2.5,
from the interval \([2,3]\) is almost = 0, because there are infinitely many points between 2 and 3.

Also, we can NOT say that choosing exactly 2.5 is impossible, because it exists there on the number line.
But, for all practical purposes, \(P(2.5) = 0\).

Therefore, we say that \(P=0\) means “Highly Unlikely” and NOT “Impossible”.

Extending this line of reasoning, we can say that probability of NOT choosing 2.5, \(P(!2.5) = 1\).
Theoretically yes, because there are infinitely many points between 2 and 3.
But, we cannot say for sure that we cannot choose 2.5 exactly.
There is some probability of choosing 2.5, but it is very small.

Therefore, we say that \(P=1\) means “Almost Sure” and NOT “Certain”.

Note
Now, lets also see another example where \(P=0\) means Impossible and \(P=1\) means Certain.
What is the probability of getting a 7 when we roll a 6 faced die ?

Here, in this case we can say that \(P(7)=0\) and that means Impossible.

Similarly, we can say that \(P(get ~any ~number ~between ~1 ~and ~6)=1\) and \(P=1 => \) Certain.


End of Introduction

2.1.2 - Conditional Probability

Conditional Probability & Bayes Theorem


Conditional Probability

It is the probability of an event occurring, given that another event has already occurred.
Allows us to update probability when additional information is revealed.

\[P(A \mid B) = \frac{P(A \cap B)}{P(B)}\]
Chain Rule
\(P(A \cap B) = P(B)*P(A \mid B)\) or
\(P(A \cap B) = P(A)*P(B \mid A)\)
Example
  1. Roll a die, sample space: \(\Omega = \{1,2,3,4,5,6\}\)
    Event A = Get a 5 = \(\{5\} => P(A) = 1/6\)
    Event B = Get an odd number = \(\{1, 3, 5\} => P(B) = 3/6 = 1/2\)
\[ \begin{aligned} \because (A \cap B) = \{5\} & \implies P(A \cap B) = 1/6 \\ P(A \mid B) &= \frac{P(A \cap B)}{P(B)} \\ &= \frac{1/6}{1/2} \\ \implies P(A \mid B)&= 1/3 \end{aligned} \]
Bayes' Theorem

It is a formula that uses conditional probability.
It allows us to update our belief about an event’s probability based on new evidence.
We know from conditional probability and chain rule that:

$$ \begin{aligned} P(A \cap B) = P(B)*P(A \mid B) \\ P(A \cap B) = P(A)*P(B \mid A) \\ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \end{aligned} $$

Combining all the above equations gives us the Bayes’ Theorem:

$$ \begin{aligned} P(A \mid B) = \frac{P(A)*P(B \mid A)}{P(B)} \end{aligned} $$
images/maths/probability/bayes_theorem.png
images/maths/probability/bayes_likelihood.png
Example
  1. Roll a die, sample space: \(\Omega = \{1,2,3,4,5,6\}\)
    Event A = Get a 5 = \(\{5\} => P(A) = 1/6\)
    Event B = Get an odd number = \(\{1, 3, 5\} => P(B) = 3/6 = 1/2\)
    Task: Find the probability of getting a 5 given that you rolled an odd number.

\(P(B \mid A) = 1\) = Probability of getting an odd number given that we have rolled a 5.

\[ \begin{aligned} P(A \mid B) &= \frac{P(A) * P(B \mid A)}{P(B)} \\ &= \frac{1/6 * 1}{1/2} \\ &= 1/3 \end{aligned} \]

Now, let’s understand another concept called Law of Total Probability.
Here, we can say that the sample space \(\Omega\) is divided into 2 parts - \(A\) and \(A ^ \complement \)

So, the probability of an event \(B\) is given by:

\[ B = B \cap A + B \cap A ^ \complement \\ P(B) = P(B \cap A) + P(B \cap A ^ \complement ) \\ By ~Chain ~Rule: P(B) = P(A)*P(B \mid A) + P(A ^ \complement )*P(B \mid A ^ \complement ) \]
What if the sample space is divided into ’n’ such partitions ?
Law of Total Probability

Overall probability of an event B, considering all the different, mutually exclusive ways it can occur.
If A₁, A₂, …, Aₙ are a set of events that partition the sample space, such that they are -

  • Mututally exclusive : \(A_i \cap A_j = \emptyset\) for all \(i, j\)
  • Exhaustive: \(A₁ \cup A₂ ... \cup Aₙ = \Omega\) for all \(i \neq j\) \[ P(B) = \sum_{i=1}^{n} P(A_i)*P(B \mid A_i) \] where \(n\) is the number of mutually exclusive partitions of the sample space \(\Omega\) .

Now, we can also generalize the Bayes’ Theorem using the Law of Total Probability.

Generalised Bayes' Theorem
\[ P(A_i \mid B) = \frac{P(A_i)*P(B \mid A_i)}{\sum_{j=1}^{n} P(A_j)*P(B \mid A_j)} \]


End of Section

2.1.3 - Independence of Events

Independence of Events


Independence of Events

Two events are independent if the occurrence of one event does not affect the probability of the other event.
There are 3 types of independence of events:

  • Mutual Independence
  • Pair-Wise Independence
  • Conditional Independence
Mutual Independence

Joint probability of two events is equal to the product of the individual probabilities of the two events.
\(P(A \cap B) = P(A)*P(B)\)

Joint probability: The probability of two or more events occurring simultaneously.
\(P(A \cap B)\) or \(P(A, B)\)

Example
  1. Toss a coin and roll a die -
    \(A\) = Get a heads; \(P(A)=1/2\)
    \(B\) = Get an odd number; \(P(B)=1/2\)
\[ \begin{aligned} P(A \cap B) &= P(\text{Heads and Odd}) \\ &= \frac{1}{2} * \frac{1}{2} \\ &= \frac{1}{4} \\ \\ \text{also } P(A) * P(B) &= \frac{1}{2} * \frac{1}{2} \\ &= \frac{1}{4} \end{aligned} \]

=> A and B are mutually independent.

Pair-Wise Independence
Every pair of events in the set is independent.
Pair-wise independence != Mutual independence.
Example
  1. Toss 3 coins;
    For 2 tosses, sample space: \(\Omega = \{HH,HT, TH, TT\}\)
    \(A\) = First and Second toss outcomes are same i.e \(\{HH, TT\}\); \(P(A)= 2/4 = 1/2\)
    \(B\) = Second and Third toss outcomes are same i.e \(\{HH, TT\}\); \(P(B)= 2/4 = 1/2\)
    \(C\) = Third and First toss outcomes are same i.e \(\{HH, TT\}\); \(P(C)= 2/4 = 1/2\)

Now, pair-wise independence of the above events A & B is - \(P(A \cap B)\)
\(P(A \cap B)\) => Outcomes of first and second toss are same &
outcomes of second and third toss are same.
=> Outcomes of all the three tosses are same.

Total number of outcomes = 8
Desired outcomes = \(\{HHH, TTT\}\) = 2
=> \(P(A \cap B) = 2/8 = 1/4 = P(A) * P(B) = 1/2 * 1/2 = 1/4\)

Therefore, \(A\) and \(B\) are pair-wise independent.
Similarly, we can also prove that \(A\) and \(C\) and \(B\) and \(C\) are also pair-wise independent.

Now, let’s check for mutual independence of the above events A, B & C.
\(P(A \cap B \cap C) = P(A)*P(B)*P(C)\)
\(P(A \cap B \cap C)\) = Outcomes of all the three tosses are same i.e \(\{HHH, TTT\}\)
Total number of outcomes = 8
Desired outcomes = \(\{HHH, TTT\}\) = 2
So, \(P(A \cap B \cap C)\) = 2/8 = 1/4
But, \(P(A)*P(B)*P(C) = 1/2*1/2*1/2 = 1/8\)
Therefore \(P(A \cap B \cap C)\) ≠ \(P(A)*P(B)*P(C)\)
=> \(A, B, C\) are NOT mutually independent but only pair wise independent.

Conditional Independence
Two events A & B are conditionally independent given a third event C,
if they are independent given that C has occurred.
Occurrence of C changes the context, causing the events A & B to become independent of each other.
Example
images/maths/probability/conditional_independence.png
\[ \begin{aligned} A = 10 ,~ B = 10 ,~ C = 20 ~and~ \Omega = 50 \\ P(A) = 10/50 = 1/5 \\ P(B) = 10/50 = 1/5 \\ P(A) * P(B) = 1/5*1/5 =1/25 \\ P(A \cap B) = 3/50 \\ \text{clearly, } P(A \cap B) ~⍯ ~P(A) * P(B) \\ \end{aligned} \]

=> A & B are NOT independent.
Now, let’s check for conditional independence of A & B given C.

\[ \begin{aligned} P(A \mid C) &= \frac{P(A \cap C)}{P(C)} = 4/20 = 1/5 \\ P(B \mid C) &= \frac{P(B \cap C)}{P(C)} = 5/20 = 1/4 \\ P(A \mid C) * P(B \mid C) &= 1/5 * 1/4 = 1/20 \\ P(A \cap B \mid C) &= \frac{P(A \cap B \cap C)}{P(C)} = 1/20 \\ \text{clearly, } P(A \cap B \mid C) &= P(A \mid C)*P(B \mid C) \\ \end{aligned} \]

Therefore, A & B are conditionally independent given C.


End of Section

2.1.4 - Cumulative Distribution Function

Cumulative Distribution Function of a Random Variable


Random Variable(RV)
A random variable is a function that maps the outcomes of a sample space to a real number.
Random Variable X is represented as, \(X: \Omega \to \mathbb{R} \)
👉 It maps abstract outcomes of a random experiment to concrete numerical values required for mathematical analysis.
Example
  1. Toss a coin 2 times, sample space: \(\Omega = \{HH,HT, TH, TT\}\)
    The above random experiment of coin tosses can be mapped to a random variable \(X: \Omega \to \mathbb{R} \)
    \(X: \Omega = \{HH,HT, TH, TT\} \to \mathbb{R}) \)
    Say, if we count the number of heads, then
    \[ \begin{aligned} X(0) = \{TT\} = 1 \\ X(1) = \{HT, TH\} = 2 \\ X(2) = \{HH\} = 1 \\ \end{aligned} \] Similar, output will be observed for number of tails.

Depending upon the nature of output, random variables are of 2 types - Discrete and Continuous.

Discrete Random Variable
A random variable whose possible outcomes are finite or countably infinite.
Typically obtained by counting.
Discrete random variable cannot take any value between 2 consecutive values.
Example
👉 The number of heads in 2 coin tosses can be 0, 1 or 2 but NOT 1.5.
Continuous Random Variable
A random variable that can take any value between a given range/interval.
Possible outcomes are infinite.
Example
  1. A person’s height in a given range of say 150cm-200cm.
    Height can take any value, not just round values, e.g: 150.1cm, 167.95cm, 180.123cm etc.

Now, that we have understood that how random variable can be used to map outcomes of abstract random experiment to real values for mathematical analysis, we will understand its applications.

How to calculate the probability of a random variable?
Probability of a random variable is given by something called - Cumulative Distribution Function (CDF).
Cumulative Distribution Function(CDF)
It gives the cumulative probability of a random variable \(X\).
CDF = \(F(X) = P(X \leq x)\) i.e Probability a random variable \(X\) will take for a value \(<=x\).
Example
  1. Discrete random variable - Toss a coin 2 times, sample space: \(\Omega = \{HH,HT, TH, TT\}\)
    Count the number of heads.
    \[ \begin{aligned} X(0) = \{TT\} = 1 => P(X = 0) = 1/4 \\ X(1) = \{HT, TH\} = 2 => P(X = 1) = 1/2\\ X(2) = \{HH\} = 1 => P(X = 2) = 1/4 \\ \\ CDF = F(X) = P(X \leq x) \\ F(0) = P(X \leq 0) = P(X < 0) + P(X = 0) = 1/4 \\ F(1) = P(X \leq 1) = P(X < 1) + P(X = 1) = 1/4 + 1/2 = 3/4 \\ F(2) = P(X \leq 2) = P(X < 2) + P(X = 2) = 3/4 + 1/4 = 1 \\ \end{aligned} \]
images/maths/probability/cdf_example_1.png
  1. Continuous random variable - Consider a line segment/interval from \(\Omega = [0,2] \)
    Random variable \(X(\omega) = \omega\)
    i.e \(X(1) = 1 ~and~ X(1.1) = 1.1 \)
    \[ \begin{aligned} P[(0,1/2)] = (1/2)/2 = 1/4 \\ P[(3/4, 2)] = (2-3/4)/2 = 5/8 \\ P(X<=1.1) = P(-\infty, 1.1)/2 = 1.1/2 \end{aligned} \] \[ F_X(x) = P(X \leq x) = \begin{cases} \frac{x}{2} & \text{if } x \in [0,2] \\ 1 & \text{if } x > 2 \\ 0 & \text{if } x < 0 \end{cases} \]
Key properties of CDF
  1. Non-Decreasing:
    For any 2 values \(x_1\) and \(x_2\) such that \(x_1 \leq x_2\), corresponding CDF must satisfy -
    \(F(x_1) \leq F(x_2)\)
    Note: Cumulative function can never decrease as x increases.

  2. Bounded:
    Range of CDF is always between 0 and 1, because CDF represents total probability, which cannot be negative or greater than 1.

    \(\lim_{x \to -\infty} F(X) = 0\); as \(x \to -\infty\), event \((X \le x)\) becomes an impossible event
    i.e \(P(X \le x) =0\)
    \(\lim_{x \to \infty} F(X) = 1\); as \(x \to \infty\), event \((X \le x)\) includes all possible outcomes of the event, making sure that \(P(X \le x) =1\)

  3. Right Continuous:
    Function’s value at a point is same as the limit of the function, as we approach that point from right hand side (RHS).
    Note: Reason for only right continuity is because how the cumulative distribution function is defined, as we may have a jump on the left side of the point.

    \(\lim_{h \to o^+} F(X+h) = F(X)\)

    For example:
    In the above coin toss CDF example, if we approach X=1 from right, say 1.001, 1.01 etc, the value of \(F(X) = P(X \leq x) = F(1) = 3/4\).
    But, if we approach \(X=1\) from left, say 0.99, 0.999 etc, the value of \(F(X) = 1/4\), as these values do NOT yet include the value of the probability at \(X=1\).
Discrete Case
For a discrete random variable, the CDF is a step function (i.e with jumps).
Value of the probability of a random variable X, at any given value x, is calculated by summing up all the probabilities for values \(\le x\).

\( CDF = F_X(x) = \sum_{x_i \le x} P(X=x_i) \), where \(P(X=x_i)\) is the Probability Mass Function (PMF) at \(x_i\)

In the above coin toss example -
Height of the jump at ‘x’ = Probability at that value ‘x’.
e.g: Jump at (x=1) = 1/2 = Probability at (x=1).
Continuous Case
For a continuous random variable, the CDF is a continuous function.
CDF for continuous random variable is calculated by integrating the probability density function (PDF) from \(-\infty\) to the given value \(x\).

\( CDF = F_X(x) = \int_{-\infty}^{x} f(x) \,dx \), where \(f(x)\) is the Probability Density Function (PDF) of random variable.

Note: We can also say that PDF is the derivative of CDF for continuous random variable.

\(PDF = f(x) = F'(X) = \frac{dF_X(x)}{dx} \)

Read more about Integration



End of Section

2.1.5 - Probability Mass Function

Probability Mass Function of a Discrete Random Variable


Probability Mass Function(PMF)
It gives the exact value of a probability for a discrete random variable at a specific value \(x\).
It assigns a “non-zero” mass or probability to a specific countable outcome.
Note: Called ‘mass’ because probability is concentrated at a single discrete point.
\(PMF = P(X=x)\)
e.g: Bernoulli, Binomial, Multinomial, Poisson etc.

Commonly visualised as a bar chart.
Note: PMF = Jump at a given point in CDF.

\(PMF = P_x(X=x_i) = F_X(X=x_i) - F_X(X=x_{i-1})\)
images/maths/probability/cdf_example_1.png
Key properties of PMF
  1. Non-Negative: Probability of any value ‘x’ must be non-negative i.e \(P(X=x) \ge 0 ~\forall x~\).
  2. Sum = 1: Sum of probabilities of all possible outcomes must be 1.
    \( \sum_{x} P(X=x) = 1 \)
  3. For any value that the discrete random variable can NOT take, the probability must be zero.
Bernoulli Distribution
It models a single event with two possible outcomes, success (1) or failure (0), with a fixed probability of success, ‘p’.
p = Probability of success
1-p = Probability of failure
Mean = p
Variance = p(1-p)
Note: A single trial that adheres to these conditions is called a Bernoulli trial.
\(PMF, P(x) = p^x(1-p)^{1-x}\), where \(x \in \{0,1\}\)
Example
  1. Toss a coin, we get heads or tails.
  2. Result of a test, pass or fail.
  3. Machine learning, binary classification model.

Binomial Distribution

It extends the Bernoulli distribution by modeling the number of successes that occur over a fixed number of independent trials.
n = Number of trials
k = Number of successes
p = Probability of success
Mean = np
Variance = np(1-p)

\(PMF, P(x=k) = \binom{n}{k}p^k(1-p)^{n-k}\), where \(k \in \{0,1,2,3,...,n\}\)
\(\binom{n}{k} = \frac{n!}{k!(n-k)!}\) i.e number of ways to achieve ‘k’ successes in ‘n’ independent trials.

Note: Bernoulli is a special case of Binomial distribution where n = 1.
Also read about Multinomial distribution i.e where number of possible outcomes is > 2.

Example
  • Counting number of heads(success) in ’n’ coin tosses.

Assumptions
  1. Trials are independent.
  2. Probability of success remains constant for every trial.

images/maths/probability/binomial.png

What is the probability of getting exactly 2 heads in 3 coin tosses?

Total number of outcomes in 3 coin tosses = 2^3 = 8
Desired outcomes i.e 2 heads in 3 coin tosses = \(\{HHT, HTH, THH\}\) = 3
Probability of getting exactly 2 heads in 3 coin tosses = \(\frac{3}{8}\) = 0.375
Now lets solve the question using the binomial distribution formula.

\[P(k=2) = \binom{3}{2}p^2(1-p)^{3-2} \\ = \frac{3!}{2!(3-2)!}(0.5)^2(0.5) \\ = 3*0.25*0.5 = 3*0.125 = 0.375\]



What is the probability of winning a lottery 1 out of 10 times, given that the probability of winning a single lottery = 1/3?

Number of successes, k = 1
Number of trials, n = 10
Probability of success, p = 1/3
Probability of winning lottery, P(k=1) =

\[\binom{10}{1}p^1(1-p)^{10-1} \\ = \frac{10!}{1!(10-1)!}(1/3)^1(2/3)^9 \\ = 10*0.333*0.026 = 0.866 \approx 8.66\% \]


Poisson Distribution

It expresses the probability of an event happening a certain number of times ‘k’ within a fixed interval of time.
Given that:

  1. Events occur with a known constant average rate.
  2. Occurrence of an event is independent of the time since the last event.

    Parameters:
    \(\lambda\): Expected number of events per interval
    \(k\) = Number of events in the same interval
    Mean = \(\lambda\)
    Variance = \(\lambda\)

PMF = Probability of occurrence of ‘k’ events in the same interval

\[PMF = \lambda^ke^{-\lambda}/k!\]

Note: Useful to count data where total population size is large but the probability of an individual event is small.

Example
  1. Model the number of customers arrival at a service center per hour.
  2. Number of website clicks in a given time period.

images/maths/probability/poisson_pmf.png
PMF of Poisson Distribution

A company receives, on an average, 5 customer emails per hour. What is the probability of receiving exactly 3 emails in the next hour?

Expected (average) number of emails per hour, \(\lambda\) = 5
Probability of receiving exactly k=3 emails in the next hour =

\[P(k=3) = \lambda^3e^{-\lambda} / 3! \\ = 5^3e^{-5} / 3! = 125*e^{-5} / 6 \\ = 125*0.00674 / 6 \approx 0.14 ~or~ 14\% \]




End of Section

2.1.6 - Probability Density Function

Probability Density Function of a Continuous Random Variable


Probability Density Function(PDF)

This is a function used for continuous random variables to describe the likelihood of the variable taking on a value within a specific range or interval.
Since, at any given point the probability of a continuous random variable is zero, we find the probability within a given range.
Note: Called ‘density’ because probability is spread continuously over a range of values rather than being concentrated at a single point as in PMF.
e.g: Uniform, Gaussian, Exponential, etc.

Note: PDF is a continuous function \(f(x)\).
It is also the derivative of Cumulative Distribution Function (CDF) \(F_X(x)\)


\(PDF = f(x) = F'(X) = \frac{dF_X(x)}{dx} \)

Key properties of PDF
  1. Non-Negative: Function must be non-negative everywhere i.e \(f(x) \ge 0 \forall x\).
  2. Sum = 1: Total area under curve must be equal to 1.
    \( \int_{-\infty}^{\infty} f(x) \,dx = 1\)
  3. Probability of a continuous random variable in the range [a,b] is given by -
    \( P(a \le x \le b) = \int_{a}^{b} f(x) \,dx\)

Read more about Integration

Note
We use a general term Probability Distribution Function for both PMF(discrete) and PDF(continuous) because both describe how the probability is distributed across a random variable’s entire domain.
Example

Consider a line segment/interval from \(\Omega = [0,2] \)
Random variable \(X(\omega) = \omega\)
i.e \(X(1) = 1 ~and~ X(1.1) = 1.1 \)

\[ F_X(x) = P(X \leq x) = \begin{cases} \frac{x}{2} & \text{if } x \in [0,2] \\ 1 & \text{if } x > 2 \\ 0 & \text{if } x < 0 \end{cases} \]


\[ \begin{aligned} PDF = f_X(x) = \frac{dF_X(x)}{dx} \\ \end{aligned} \]

\[ \text{PDF } = f_X(x) = \begin{cases} \dfrac{1}{2}, & x \in [0,2] \\ 0, & \text{otherwise.} \end{cases} \]
images/maths/probability/pdf_uniform.png

Note: If we know the PDF of a continuous random variable, then we can find the probability of any given region/interval by calculating the area under the curve.

Uniform Distribution

All the outcomes within the given range are equally likely to occur.
Also known as ‘fair’ distribution.
Note: This is a natural starting point to understand randomness in general.

\[ X \sim U(a,b) \]

$$ \begin{aligned} PDF = f(x) = \begin{cases} \frac{1}{b-a} & \text{if } x \in [a,b] \\ 0 & \text{~otherwise } \end{cases} \end{aligned} $$


Mean = Median = \( \frac{a+b}{2} ~if~ x \in [a,b] \)
Variance = \( \frac{(b-a)^2}{12} \)

Standard uniform distribution: \( X \sim U(0,1) \)

$$ \begin{aligned} PDF = f(x) = \begin{cases} 1 & \text{if } x \in [0,1] \\ 0 & \text{~otherwise } \end{cases} \end{aligned} $$

PDF in terms of mean(\(\mu\)) and standard deviation(\(\sigma\)) -

$$ \begin{aligned} PDF = f(x) = \begin{cases} \frac{1}{2\sigma\sqrt{3}} & \text{if } \mu -\sigma\sqrt{3} \le x \le \mu + \sigma\sqrt{3}\\ \\ 0 & \text{~otherwise } \end{cases} \end{aligned} $$
Example
  • Random number generator that generates a random number between 0 and 1.

    images/maths/probability/uniform_pdf.png
PDF of Uniform Distribution
images/maths/probability/uniform_cdf.png
CDF of Uniform Distribution

Gaussian(Normal) Distribution

It is a continuous probability distribution characterized by a symmetrical, bell-shaped curve with most data clustered around the central average, with frequency of values decreasing as they move away from the center.

  • Most outcomes are average; extremely low or extremely high values are rare.
  • Characterised by mean and standard deviation/variance.
  • Peak at mean = median, symmetric around mean.
    Note: Most important and widely used distribution.

    \[ X \sim N(\mu, \sigma^2) \] $$ \begin{aligned} PDF = f(x) = \dfrac{1}{\sqrt{2\pi}\sigma}e^{-\dfrac{(x-\mu)^2}{2\sigma^2}} \\ \end{aligned} $$
    Mean = \(\mu\)
    Variance = \(\sigma^2\)

Standard normal distribution:

\[ Z \sim N(0,1) ~i.e~ \mu = 0, \sigma^2 = 1 \]


Any normal distribution can be standardized using Z-score transformation:

$$ \begin{aligned} Z = \dfrac{X-\mu}{\sigma} \end{aligned} $$
Example
  • Human height, IQ scores, blood-pressure etc.
  • Measurement of errors in scientific experiments.
PDF of Gaussian Distribution
images/maths/probability/gaussian_pdf.png
CDF of Gaussian Distribution
images/maths/probability/gaussian_cdf.png
68-95-99 Rule
  • 68.27% of the data lie within 1 standard deviation of the mean i.e \(\mu \pm \sigma\)
  • 95.45% of the data lie within 2 standard deviations of the mean i.e \(\mu \pm 2\sigma\)
  • 99.73% of the data lie within 3 standard deviations of the mean i.e \(\mu \pm 3\sigma\)
Exponential Distribution

It is used to model the amount of time until a specific event occurs.
Given that:

  1. Events occur with a known constant average rate.
  2. Occurrence of an event is independent of the time since the last event.

    Parameters:
    Rate parameter: \(\lambda\): Average number of events per unit time
    Scale parameter: \(\mu ~or~ \beta \): Mean time between events
    Mean = \(\frac{1}{\lambda}\)
    Variance = \(\frac{1}{\lambda^2}\)
    \[ \lambda = \dfrac{1}{\beta} = \dfrac{1}{\mu}\]
\[ \begin{aligned} PDF = f(x) = \lambda e^{-\lambda x} ~~~ \forall ~~~ x \ge 0 ~~~\&~~~ \lambda > 0 \\ CDF = F(x) = 1 - e^{-\lambda x} ~~~ \forall ~~~ x \ge 0 ~~~\&~~~ \lambda > 0 \end{aligned} \]
PDF of Exponential Distribution
images/maths/probability/exponential_pdf.png
CDF of Exponential Distribution
images/maths/probability/exponential_cdf.png
At a bank, a teller spends 4 minutes, on an average, with every customer. What is the probability that a randomly selected customer will be served in less than 3 minutes?

Mean time to serve 1 customer = \(\mu\) = 4 minutes
So, \(\lambda\) = average number of customers served per unit time = \(1/\mu\) = 1/4 = 0.25 minutes
Probability to serve a customer in less than 3 minutes can be found using CDF -

\[ F(x) = P(X \le x) = 1 - e^{-\lambda x}\]


$$ \begin{aligned} P(X \leq 3) &= 1 - e^{0.25*3} \\ &= 1 - e^{-0.75} \\ &= 1 - 0.47 \\ &\approx 0.53 \\ &\approx 53\% \end{aligned} $$

So, probability of a customer being served in less than 3 minutes is 53%(approx).


At a bank, a teller spends 4 minutes, on an average, with every customer. What is the probability that a randomly selected customer will be served in greater than 2 minutes?

\[ CDF = F(x) = P(X \le x) = 1 - e^{-\lambda x} \\ \text{Total probability} = P(X \le x) + P(X > x) = 1\\ => 1 - e^{-\lambda x} + P(X > x) = 1 \\ => P(X > x) = e^{-\lambda x} \]


In this case x = 2 minutes, and \(\lambda\) = 0.25 so,

\[ P(X > 2) = e^{-\lambda x} \\ = e^{-0.25*2} \\ = e^{-0.5} \\ = 0.6065 \\ \approx 60.65\% \]


So, probability of a customer being served in greater than 2 minutes is 60%(approx).


Memoryless Property of Exponential Distribution

Probability of waiting for an additional period of time for an event to occur is independent of how long you have already waited.
e.g: Lifetime of electronic items follow exponential distribution.

  • Probability of a computer part failing in the next 1 hour is the same regardless of whether it has been working for 1 day or 1 year or 5 years.

Note: Memoryless property makes exponential distribution particularly useful for -

  • Modeling systems that do not experience ‘wear and tear’; where failure is due to a constant random rate rather than degradation over time.
  • Also, useful for ‘reliability analysis’ of electronic systems where a ‘random failure’ model is more appropriate than a ‘wear out’ model.
Suppose, we know that an electronic item has lasted for time \(x>t\) days, then what is the probability that it will last for an additional time of ’s’ days ?

Task: We want to find the probability of the electronic item lasting for \(x > t+s \) days,
given that it has already lasted for \(x>t\) days.
This is a ‘conditional probability’.
Since,

\[ P(A \mid B) = \dfrac{P(A \cap B)}{P(B)} \]


We want to find: \( P(X > t+s \mid X > t) = ~? \)

\[ P(X > t+s \mid X > t) = \dfrac{P(X > t+s ~and~ X > t)}{P(X > t)} \\ \text{Since, t+s > t, we can only consider t+s} \\ = \dfrac{P(X > t+s)}{P(X > t)} \\ = \dfrac{e^{-\lambda(t+s)}}{e^{-\lambda(t)}} \\ = e^{-\lambda(t) -\lambda(s) + \lambda(t)} \\ = e^{-\lambda(s)} \\ => \text{Independent of time 't'} \]

Hence, probability that the electronic item will survive for an additional time ’s’ days
is independent of the time ’t’ days it has already survived.


Relation of Exponential Distribution and Poisson Distribution

Poisson distribution models the number of events occurring in a fixed interval of time, given a constant average rate \(\lambda \).
Exponential distribution models the time interval between those successive events.

  • 2 faces of the same coin.
  • \(\lambda_{poisson}\) is identical to rate parameter \(\lambda_{exponential}\).

Note: If the number of events in a given interval follow a Poisson distribution, then the waiting time between those events will necessarily follow an Exponential distribution.

Lets see the proof for the above statement.

Poisson Case:
The probability of observing exactly ‘k’ events in a time interval of length ’t’
with an average rate of \( \lambda \) events per unit time is given by -

\[ PMF = P(X=k) = \dfrac{(\lambda t)^k e^{-\lambda t}}{k!} \]


The event that waiting time until next event > t, is same as observing ‘0’ events in that interval.
=> We can use the PMF of Poisson distribution with k=0.

\[ PMF = P(X=k=0) = \dfrac{(\lambda t)^0 e^{-\lambda t}}{0!} = e^{-\lambda t} \]


Exponential Case:
Now, lets consider exponential distribution that models the waiting time ‘T’ until next event.
The event that waiting time ‘T’ > t, is same as observing ‘0’ events in that interval.

\[ CDF = P(X>t) = e^{-\lambda t} \]

Observation:
Exponential distribution is a direct consequence of Poisson distribution probability of ‘0’ events.


Consider a machine that fails, on an average, every 20 hours. What is the probability of having NO failures in the next 10 hours?

Using Poisson:
Average failure rate, \(\lambda\) = 1/20 = 0.05
Time interval, t = 10 hours
Number of events, k = 0
Average number of events in interval, (\(\lambda t\)) = (1/20) * 10 = 0.5
Probability of having NO failures in the next 10 hours = ?

\[ P(X=0) = \dfrac{(\lambda t)^0 e^{-\lambda t}}{0!} = e^{-\lambda t} \\ = e^{-0.5} \approx 0.6065 \\ \approx 60.65\%\]


Using Exponential:
What is the probability that wait time until next failure > 10 hours?
Waiting time, T > t = 10 hours.
Average number of failures per hour, \(\lambda\) = 1/20 per hour

\[ CDF = P(X>t) = e^{-\lambda t} \\ P(T>10) = e^{-(1/20) * 10} \\ = e^{-0.5} \approx 0.6065 \\ \approx 60.65\%\]


Therefore, we have seen that this problem can be solved using both Poisson and Exponential distribution.



End of Section

2.1.7 - Expectation

Expectation of a Random Variable


Expectation

Long run average of the outcomes of a random experiment.
When we talk about ‘Expectation’ we mean - on an average.

  • Value we expect to get when we repeated an experiment multiple times and took an average of the results.
  • Measures the central tendency of a data distribution.
  • Similar to ‘Center of Mass’ in Physics.

Note: If the probability of each outcome is different, then we take a weighted average.

Discrete Case:
\(E[X] = \sum_{i=1}^{n} x_i.P(X = x_i) \)

Continuous Case:
\(E[X] = \int_{-\infty}^{\infty} x.f(x) dx \)

Let’s play a game where we flip a fair coin. If we get a head, then you win Rs. 100, and
if its a tail, then you lose Rs. 50. What is the expected value of the amount that you will win per toss?

Possible outcomes are: \(x_1 = 100 ,~ x_2 = -50 \)
Probability of each outcome is \(P(X = 100) = 0.5,~ P(X = -50) = 0.5 \)

\[ E[X] = \sum_{i=1}^{n} x_i.P(X = x_i) \\ = x_1.P(X = x_1) + x_2.P(X = x_2) \\ = 100*0.5 + (-50)*0.5 \\ = 50 - 25 \\ = 25 \]

Therefore, the expected value of the amount that you will win per toss is Rs. 25 in the long run.

What is the expected value of a continuous uniform random variable distributed over the interval [a,b]?

$$ \text{PDF of continuous uniform random variable } = \\ f_X(x) = \begin{cases} \dfrac{1}{b-a}, & x \in [a,b] \\ 0, & \text{otherwise.} \end{cases} $$\[ \text{Expected value of continuous uniform random variable } = \\ E[X] = \int_{-\infty}^{\infty} x.f(x) dx \\ = \int_{-\infty}^{a} x.f(x) dx + \int_{a}^{b} x.f(x) dx + \int_{b}^{\infty} x.f(x) dx\\[0.5em] \text{Since, the PDF is defined only in the range [a,b] } \\[0.5em] = 0 + \int_{a}^{b} x.f(x) dx + 0 \\ = \int_{a}^{b} x.f(x) dx \\ = \dfrac{1}{b-a} \int_{a}^{b} x dx \\[0.5em] = \dfrac{1}{b-a} (\frac{x^2}{2})_{a}^{b} \\[0.5em] = \dfrac{1}{b-a} * (\frac{b^2 - a^2}{2}) \\[0.5em] = \dfrac{1}{b-a} * \{\frac{(b+a)(b-a)}{2}\}\\ = \dfrac{b+a}{2} \]



Variance

It is the average of the squared differences from the mean.

  • Measures the spread or variability in the data distribution.
  • If the variance is low, then the data is clustered around the mean i.e less variability;
    and if the variance is high, then the data is widely spread out i.e high variability.

Variance in terms of expected value, where \(E[X] = \mu\), is given by:

\[ \begin{aligned} Var[X] &= E[(X - E[X])^2] \\ &= E[X^2 + E[X]^2 - 2XE[X]] \\ &= E[X^2] + E[X]^2 - 2E[X]E[X] \\ &= E[X^2] + E[X]^2 - 2E[X]^2 \\ Var[X] &= E[X^2] - E[X]^2 \\ \end{aligned} \]

Note: This is the computational formula for variance, as it is easier to calculate than the average of square distances from mean.

What is the variance of a continuous uniform random variable distributed over the interval [a,b]?

PDF of continuous uniform random variable =

$$ \begin{aligned} f_X(x) &= \begin{cases} \dfrac{1}{b-a}, & x \in [a,b] \\ 0, & \text{otherwise.} \end{cases} \\ \end{aligned} $$\[ \begin{aligned} \text{Expected value = mean } = \\[0.5em] E[X] = \dfrac{b+a}{2} \\[0.5em] Var[X] = E[X^2] - E[X]^2 \end{aligned} \]

We know \(E[X]\) already, now we will calculate \(E[X^2]\):

\[ \begin{aligned} E[X] &= \int_{-\infty}^{\infty} x.f(x) dx \\ E[X^2] &= \int_{-\infty}^{\infty} x^2.f(x) dx \\ &= \int_{-\infty}^{a} x^2f(x) dx + \int_{a}^{b} x^2f(x) dx + \int_{b}^{\infty} x^2f(x) dx\\ &= 0 + \int_{a}^{b} x^2f(x) dx + 0 \\ &= \int_{a}^{b} x^2f(x) dx \\ &= \dfrac{1}{b-a} \int_{a}^{b} x^2 dx \\ &= \dfrac{1}{b-a} * \{\frac{x^3}{3}\}_{a}^{b} \\ &= \dfrac{1}{b-a} * \{\frac{b^3 - a^3}{3}\} \\ &= \dfrac{1}{b-a} * \{\frac{(b-a)(b^2+ab+a^2)}{3}\} \\ E[X^2] &= \dfrac{b^2+ab+a^2}{3} \end{aligned} \]

Now, we know both \(E[X]\) and \(E[X^2]\), so we can calculate \(Var[X]\):

\[ \begin{aligned} Var[X] &= E[X^2] - E[X]^2 \\ &= \dfrac{b^2+ab+a^2}{3} - \dfrac{(b+a)^2}{4} \\ &= \dfrac{b^2+ab+a^2}{3} - \dfrac{b^2+2ab+a^2}{4} \\ &= \dfrac{4b^2+4ab+4a^2-3b^2-6ab-3a^2}{12} \\ &= \dfrac{b^2-2ab+a^2}{12} \\ Var[X]&= \dfrac{(b-a)^2}{12} \end{aligned} \]
Co-Variance

It is the measure of how 2 variables X & Y vary together.
It gives the direction of the relationship between the variables.

\[ \begin{aligned} \text{Cov}(X, Y) &> 0 &&\Rightarrow \text{ } X \text{ and } Y \text{ increase or decrease together} \\[0.5em] \text{Cov}(X, Y) &= 0 &&\Rightarrow \text{ } \text{No linear relationship} \\[0.5em] \text{Cov}(X, Y) &< 0 &&\Rightarrow \text{ } \text{If } X \text{ increases, } Y \text{ decreases (and vice versa)} \end{aligned} \]

Note: For both direction as well as magnitude, we use Correlation.
Let’s use expectation to compute the co-variance of two random variables X & Y:

\[ \begin{aligned} Cov(X,Y) &= E[(X-E[X])(Y-E[Y])] \\ &\text{where, E[X] = mean of X and E[Y] = mean of Y} \\ & = E[XY - YE[X] -XE[Y] + E[X]E[Y] \\ &= E[XY] - E[Y]E[X] - \cancel{E[X]E[Y]} + \cancel{E[X]E[Y]} \\ Cov(X,Y) &= E[XY] - E[X]E[Y] \\ \end{aligned} \]

Note:

  • In a multivariate setting, relationship between all the pairs of random variables are summarized in a square symmetrix matrix called ‘Co-Variance Matrix’ \(\Sigma\).
  • Covariance of a random variable with self gives the variance, hence the diagonals of covariance matrix are variances.
\[ \begin{aligned} Cov(X,Y) &= E[XY] - E[X]E[Y] \\ Cov(X,X) &= E[X^2] - E[X]^2 = Var[X] \\ \end{aligned} \]



End of Section

2.1.8 - Moment Generating Function

Moment Generating Function


Moments

They are statistical measures that describe the characteristics of a probability distribution, such as its central tendency, spread/variability, and asymmetry.
Note: We will discuss ‘raw’ and ‘central’ moments.

Important moments:

  1. Mean : Central tendency
  2. Variance : Spread/variability
  3. Skewness : Asymmetry
  4. Kurtosis : Tailedness
Moment Generating Function

It is a function that simplifies computation of moments, such as, the mean, variance, skewness, and kurtosis, by providing a compact way to derive any moment of a random variable through differentiation.

  • Provides an alternative to PDFs and CDFs of random variables.
  • A powerful property of the MGF is that its \(n^{th}\) derivative, when evaluated at \((t=0)\) = the \(n^{th}\) moment of the random variable.
\[ \text{MGF of random variable X is the expected value of } e^{tX} \\ MGF = M_X(t) = E[e^{tX}] \]

Not all probability distributions have MGFs; sometimes, integrals may not converge and moment may not exist.

\[ \begin{aligned} e^x = 1 + \frac{x}{1!} + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots + \frac{x^n}{n!} + \cdots\\ e^{tX} = 1 + \frac{tx}{1!} + \frac{(tx)^2}{2!} + \frac{(tx)^3}{3!} + \cdots + \frac{(tx)^n}{n!} + \cdots\\ \end{aligned} \]

Since, \(MGF = M_X(t) = E[e^{tX}] \), we can write the MGF as:

\[ \begin{aligned} M_X(t) &= E[e^{tX}] \\ &= 1 + \frac{t}{1!}E[X] + \frac{t^2}{2!}E[X^2] + \frac{t^3}{3!}E[X^3] + \cdots + \frac{t^n}{n!}E[X^n] + \cdots \\ \end{aligned} \]

Say, \( E[X^n] = m_n \), where \(m_n\) is the \(n^{th}\) moment of the random variable, then:

\[ M_X(t) = 1 + \frac{t}{1!}m_1 + \frac{t^2}{2!}m_2 + \frac{t^3}{3!}m_3 + \cdots + \frac{t^n}{n!}m_n + \cdots \\ \]

Important: Differentiating \(M_X(t)\) with respect to \(t\), \(i\) times, and setting \((t =0)\) gives us the \(i^{th}\) moment of the random variable.

\[ \frac{dM_X(t) }{dt} = M'_X(t) = 0 + m_1 + \frac{2t}{2!}m_2 + \frac{3t^2}{3!}m_3 + \cdots + \frac{nt^n-1}{n!}m_n + \cdots \\ \]

Set \(t=0\), to get the first moment:

\[ \frac{dM_X(t) }{dt} = M'_X(t) = m_1 \]

Similarly, second moment = \(M''_X(t) = m_2 = E[X^2]\)
And, \(n^{th}\) derivative = \(M^{(n)}_X(t) = m_n = E[X^n]\)

\[ E[X^n] = \frac{d^nM_X(t)}{dt^n} \bigg|_{t=0} \]

e.g: \( Var[X] = M''_X(0) - (M'_X(0))^2 \)

Note: If we know the MGF of a random variable, then we do NOT need to do integration or summation to find the moments.
Continuous:

\[ M_X(t) = E[e^{tX}] = \int_{-\infty}^{\infty} e^{tx} f_X(x) dx \]

Discrete:

\[ M_X(t) = E[e^{tX}] = \sum_{i=1}^{\infty} e^{tx} P(X=x) \]


Read more about Integration & Differentiation



End of Section

2.1.9 - Joint & Marginal

Joint, Marginal & Conditional Probability


Joint Probability Distribution

It describes the probability of 2 or more random variables occurring simultaneously.

  • The random variables can be from different distributions, such as, discrete and continuous.

Joint CDF:

\[ F_{X,Y}(a,b) = P(X \le a, Y \le b),~ -\infty < a, b < \infty \]

Discrete Case:

\[ F_{X,Y}(a,b) = P(X \le a, Y \le b) = \sum_{x_i \le a} \sum_{y_j \le b} P(X = x_i, Y = y_j) \]

Continuous Case:

\[ F_{X,Y}(a,b) = P(X \le a, Y \le b) = \int_{-\infty}^{a} \int_{-\infty}^{b} f_{X,Y}(x,y) dy dx \]

Joint PMF:

\[ P_{X,Y}(x,y) = P(X = x, Y = y) \]

Key Properties:

  1. \(P(X = x, Y = y) \ge 0 ~ \forall (x,y) \)
  2. \( \sum_{i} \sum_{j} P(X = x_i, Y = y_j) = 1 \)

Joint PDF:

\[ f_{X,Y}(x,y) = \frac{\partial^2 F_{X,Y}(x,y)}{\partial x \partial y} \\ f_{X,Y}(x,y) = \iint_{A \in \mathbb{R}^2} f_{X,Y}(x,y) dy dx \]

Key Properties:

  1. \(f_{X,Y}(x,y) \ge 0 ~ \forall (x,y) \)
  2. \( \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x,y) dy dx = 1 \)
Example
  • If we consider 2 random variables, say, height(X) and weight(Y), then the joint distribution will tell us the probability of finding a person having a particular height and weight.
There are 2 bags; bag_1 has 2 red balls & 3 blue balls, bag_2 has 3 red balls & 2 blue balls.
A ball is picked at random from each bag, such that both draws are independent of each other.
Let’s use this example to understand joint probability.
images/maths/probability/joint_marginal_example_1.png

Let A & B be discrete random variables associated with the outcome of the ball drawn from first and second bags respectively.

A = RedA = Blue
B = Red2/5*3/5 = 6/253/5*3/5 = 9/25
B = Blue2/5*2/5 = 4/253/5*2/5 = 6/25

Since, the draws are independent, joint probability = P(A) * P(B)
Each of the 4 cells in above table shows the probability of combination of results from 2 draws or joint probability.

Marginal Probability Distribution

It describes the probability distribution of an individual random variable in a joint distribution, without considering the outcomes of other random variables.

  • If we have the joint distribution, then we can get the marginal distribution of each random variable from it.
  • Marginal probability equals summing the joint probability across other random variables.

Marginal CDF:
We know that Joint CDF =

\[ F_{X,Y}(a,b) = P(X \le a, Y \le b),~ -\infty < a, b < \infty \]

Marginal CDF =

\[ F_X(a, \infty) = P(X \le a, Y < \infty) = P(X \le a) \]

Discrete Case:

\[ F_X(a) = P(X \le a, Y \le \infty) = \sum_{x_i \le a} \sum_{y_j \in \mathbb{R}} P(X = x_i, Y = y_j) \]

Continuous Case:

\[ F_X(a) = P(X \le a, Y \le \infty) = \int_{-\infty}^{a} \int_{-\infty}^{\infty} f_{X,Y}(x,y)dydx = \int_{-\infty}^{\infty} f_{X,Y}(x,y)dy \]

Law of Total Probability
We know that Joint Probability Distribution =

\[ P_{X,Y}(x,y) = P(X = x, Y = y) \]

The events \((Y=y)\) partition the sample space, such that:

  1. \( (Y=y_1) \cap (Y=y_2) \cap ... \cap (Y=y_n) = \Phi \)
  2. \( (Y=y_1) \cup (Y=y_2) \cup ... \cup (Y=y_n) = \Omega \)

From Law of Total Probability, we get:

Marginal PMF:

\[ P_X(x) = P(X=x) = \sum_{y} P_{X,Y}(x,y) = \sum_{y} P(X = x, Y = y) \]

Marginal PDF:

\[ f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) dy \]


Setup: Roll a die + Toss a coin.
X: Roll a die ; \( \Omega = \{1,2,3,4,5,6\} \)
Y: Toss a coin ; \( \Omega = \{H,T\} \)

Joint PMF = \( P_{X,Y}(x,y) = P(X=x, Y=y) = 1/6*1/2 = 1/12\)
Marginal PMF of X = \( P_X(x) =\sum_{y \in \mathbb{\{H,T\}}} P_{X,Y}(x,y) = = 1/12+1/12 = 1/6\)
=> Marginally, X is uniform over 1-6 i.e a fair die.

Marginal PMF of Y = \( P_Y(y) = \sum_{1}^6 P_{X,Y}(x,y) = 6*(1/12) = 1/2 \)
=> Marginally, Y is uniform over H,T i.e a fair coin.

Setup: X and Y are two continuous uniform distribution.
\( X \sim U(0,1) \)
\( Y \sim U(0,1) \)

Marginal PDF = \(f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) dy \)
Joint PDF =

$$ f_{X,Y}(x,y) = \begin{cases} 1 & \text{if } x \in [0,1], y \in [0,1] \\ 0 & \text{otherwise } \end{cases} $$

Marginal PDF =

\[ \begin{aligned} f_X(x) &= \int_{0}^{1} f_{X,Y}(x,y) dy \\ &= \int_{0}^{1} 1 dy \\ &= 1 \\ f_X(x) &= \begin{cases} 1 & \text{if } x \in [0,1] \\ 0 & \text{otherwise } \end{cases} \end{aligned} \]

Let’s re-visit the ball drawing example.
There are 2 bags; bag_1 has 2 red balls & 3 blue balls, bag_2 has 3 red balls & 2 blue balls.
A ball is picked at random from each bag, such that both draws are independent of each other.
Let’s use this example to understand marginal probability.

images/maths/probability/joint_marginal_example_1.png

Let A & B be discrete random variables associated with the outcome of the ball drawn from first and second bags respectively.

A = RedA = BlueP(B) (Marginal)
B = Red2/5*3/5 = 6/253/5*3/5 = 9/256/25 + 9/25 = 15/25 = 3/5
B = Blue2/5*2/5 = 4/253/5*2/5 = 6/254/25 + 6/25 = 10/25 = 2/5
P(A) (Marginal)6/25 + 4/25 = 10/25 = 2/59/25 + 6/25 = 15/25 = 3/5

We can see from the table above - P(A=Red) is the sum of joint distribution over all possible values of B i.e Red & Blue.

Conditional Probability

It measures the probability of an event occurring given that another event has already happened.

  • It provides a way to update our belief about the likelihood based on new information.
\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \]

P(A, B) = Joint Probability of A and B
P(B) = Marginal Probability of B

=> Conditional Probability = Joint Probability / Marginal Probability

Conditional CDF:

\[ F_{X \mid Y}(x \mid y) = P(X \le x \mid Y = y) \\ \]

Discrete Case:

\[ F_{X \mid Y}(x \mid y) = P(X \le x \mid Y = y) = \sum_{x_i \le x} P(X = x_i \mid Y = y) \]

Continuous Case:

\[ F_{X \mid Y}(x \mid y) = \int_{-\infty}^{x} f_{X \mid Y}(x \mid y) dx = \int_{-\infty}^{x} \frac {f_{X,Y}(x, y)}{f_Y(y)} dx \\ f_Y(y) > 0 \]

Conditional PMF:

\[ P(X = x \mid Y = y) = \frac{P(X = x, Y = y)} {P(Y = y)} \\ P(Y = y) > 0 \]

Conditional PDF:

\[ f_{X \mid Y}(x \mid y) = \frac{F_{X,Y}(x,y)}{f_Y(y)} \\ f_Y(y) > 0 \]
Application
  • Generative machine learning models, such as, GANs, learn the conditional distribution of pixels, given the style of input image.

Let’s re-visit the ball drawing example.
Note: We only have information about the joint and marginal probabilities.
What is the conditional probability of drawing a red ball in the first draw, given that a blue ball is drawn in second draw?
images/maths/probability/joint_marginal_example_1.png

Let A & B be discrete random variables associated with the outcome of the ball drawn from first and second bags respectively.
A = Red ball in first draw
B = Blue ball in second draw.

A = RedA = BlueP(B) (Marginal)
B = Red6/259/253/5
B = Blue4/256/252/5
P(A) (Marginal)2/53/5
\[ \begin{aligned} P(A \mid B) &= \frac{P(A \cap B)}{P(B)} \\ &= \frac{4/25}{2/5} \\ &= 2/5 \end{aligned} \]

Therefore, probability of drawing a red ball in the first draw, given that a blue ball is drawn in second draw = 2/5.

Conditional Expectation

This gives us the conditional expectation of a random variable X, given another random variable Y=y.

Discrete Case:

\[ E[X \mid Y = y] = \sum_{x} x.P(X = x \mid Y = y) x = \sum_{x} x.P_{X \mid Y}(x \mid y) \]

Continuous Case:

\[ E[X \mid Y = y] = \int_{-\infty}^{\infty} x.f_{X \mid Y}(x \mid y) dx \]
Example
  • Conditional expectation of of a person’s weight, given his/her height = 165 cm, will give us the average weight of all people with height = 165 cm.

Applications:

  • Linear regression algorithm is conditional expectation of target variable ‘Y’, given input feature variable ‘X’.
  • Expectation Maximisation(EM) algorithm is built on conditional expectation.
Conditional Variance

This gives us the variance of a random variable calculated after taking into account the value(s) of another related variable.

\[ \begin{aligned} Var[X \mid Y = y] &= \sum_{x} [x - E[X \mid Y = y])^2 \mid Y=y] \\ => Var[X \mid Y = y] &= E[X^2 \mid Y=y] - (E[X \mid Y=y])^2 \\ \end{aligned} \]

For example:

  • Variance of car’s mileage for city driving might be small, but the variance will be large for mix of city and highway driving.

Note: Models that take into account the change in variance or heteroscedasticity tend to be more accurate.



End of Section

2.1.10 - Independent & Identically Distributed

Independent & Identically Distributed (I.I.D) Random Variables


I.I.D
There are 2 parts in I.I.D, “Independent” and “Identically Distributed”.
Let’s revisit and understand the independence of random variables first.
Independence of Random Variables

It means that the knowing the outcome of one random variable does not impact the probability of the other random variable.
Two random variables X & Y are independent if:

\[ CDF = F_{X,Y}(x,y) = F_{X}(x)F_{Y}(y) \\ \text{ }\\ \text{Generalised for 'n' random variables:} \\ CDF = F_{X_1,X_2,...,X_n}(x_1,x_2,...,x_n) = \prod_{i=1}^{n}F_{X_i}(x_i) \\ \text{Discrete case: } \\ PMF = P_{X,Y}(x,y) = P_{X}(x)P_{Y}(y) \\ \text{ }\\ \text{Continuous case: } \\ PDF = f_{X,Y}(x,y) = f_{X}(x)f_{Y}(y) \\ \]


  • We know that if 2 random variables X,Y are independent, then their covariance is zero, since there is no linear relationship between them.
  • But the converse may NOT be true, i.e, if the covariance is zero, then we can say for sure that the random variables are independent.

Read more about Covariance

\[ \text{For independent events: }\\ Cov(X,Y) = E[XY] - E[X]E[Y] = 0 \\ => E[XY] = E[X]E[Y] \\ \text{ }\\ \text{We know that: }\\ Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y) \\ \text{ }\\ \text{But, for independent events Cov(X,Y)=0}, so: \\ Var(X+Y) = Var(X) + Var(Y) \\ \]
Example

Let’s go through few examples to understand the independence of random variables.
For example:

  1. Toss a coin + Roll a die.
    \[ X = \begin{cases} 1 & \text{if Heads} \\ \\ 0 & \text{if Tails} \end{cases} \\ \text{} \\ Y = \{1,2,3,4,5,6\} \\ \text{} \\ \text{ Joint probability is all possible combinations of X and Y i.e 2x6 = 12 } \\[10pt] \text{ Sample space: } \Omega = \{ (H,1), (H,2), (H,3), (H,4), (H,5), (H,6), \\ (T,1), (T,2), (T,3), (T,4), (T,5), (T,6) \} \]

Here, clearly, X and Y are independent.

  1. Toss a coin twice.
    \(X\) = Number of heads = \(\{0,1,2\}\)
    \(Y\) = Number of tails = \(\{0,1,2\}\)

X, Y are NOT independent, but mutually exclusive, because if we know about one, then we automatically know about the other.

  1. \(X\) is a continuous uniform random variable \( X \sim U(-1,1) \).
    \(Y = 2X\).
\[ f_X(x) = \begin{cases} \frac{1}{b-a} = \frac{1}{1-(-1)} = \frac{1}{2} & \text{if } x \in [a,b] \\ \\ 0 & \text{otherwise} \end{cases} \\ \text{}\\ E[X] = \text{ mean } = \frac{a+b}{2} = \frac{-1+1}{2} = 0 \\ \]


Let’s check for independence of \(X\) and \(Y\) i.e \(E[XY] = E[X]E[Y]\) or not?

\[ \text{We know that: } E[X] = 0\\ E[Y] = E[2X] = 2E[X] = 0 \\ \tag{1} E[X]E[Y] = 0 \]

Now, lets calculate the value of \(E[XY]\):

\[ \begin{aligned} E[XY] &= E[X.2X] = 2.E[X^2] \\ &= 2*\int_{-1}^{1} x^2 dx = 2*{\frac{x^3}{3}} \bigg|_{-1}^1 \\ &= \frac{2}{3}*\{1^3-(-1)^3\} & = \frac{2}{3}*2 = \frac{4}{3} \\ \tag{2} E[XY] &= \frac{4}{3} \end{aligned} \]


From (1) and (2), we can say that:

\[ E[XY] ⍯ E[X]E[Y] \\ \]

Hence, \(X\) and \(Y\) are NOT independent.

Read more about Integration

Identically Distributed
Random variable X is said to be identically distributed if each sample comes from the same probability distribution, such as, Gaussian, Bernoulli, Uniform, etc with the same properties i.e mean, variance, etc are same.
Similarly, random variables X & Y are said to be identically distributed if they belong to the same probability distribution.

Independent & Identically Distributed(I.I.D)

I.I.D assumption for samples(data points) in a dataset means that the samples are:

  • Independent, i.e, each sample is independent of the other.
  • Identically distributed, i.e, each sample is drawn from the same probability distribution.
Example
  • We take the heights of a random sample of people to estimate the average height of the population of a city.
    • Here ‘independent’ assumption means that the height of each person in the sample is independent of the other person.
      Usually, heights of members of the same family may be highly correlated.
      However, for practical purposes, we can assume that all the heights are independent of one another.
    • And, for ‘identically distributed’ - we can assume that all the heights are from the same Gaussian distribution with some mean and variance.



End of Section

2.1.11 - Convergence

Convergence of Random Variables


Convergence in Probability

A sequence of random variables \(X_1, X_1, \dots, X_n\) is said to converge in probability to a known random variable \(X\),
if for every number \(\epsilon >0 \), the following is true:

\[ \lim_{n\rightarrow\infty} P(|X_n - X| > \epsilon) = 0, \forall \epsilon >0 \]

where,
\(X_n\): is the estimator or sample based random variable.
\(X\): is the known or limiting or target random variable.
\(\epsilon\): is the tolerance level or margin of error.

Read more about Limits

Example
  • Toss a fair coin:
    Estimator:
    \[ X_n = \begin{cases} \frac{n}{n+1} & \text{, if Head } \\ \\ \frac{1}{n} & \text{, if Tail} \\ \end{cases} \]
    Known random variable (Bernoulli):
    \[ X = \begin{cases} 1 & \text{, if Head } \\ \\ 0 & \text{, if Tail} \\ \end{cases} \]
\[ X_n - X = \begin{cases} \frac{n}{n+1} - 1 = \frac{-1}{n+1} & \text{, if Head } \\ \\ \frac{1}{n} - 0 = \frac{1}{n} & \text{, if Tail} \\ \end{cases} \]\[ |X_n - X| = \begin{cases} \frac{1}{n+1} & \text{, if Head } \\ \\ \frac{1}{n} & \text{, if Tail} \\ \end{cases} \]

Say, tolerance level \(\epsilon = 0.1\).
Then,

\[ \lim_{n\rightarrow\infty} P(|X_n - X| > \epsilon) = ? \]

If n=5;

\[ |X_n - X| = \begin{cases} \frac{1}{n+1} = \frac{1}{6} \approx 0.16 & \text{, if Head } \\ \\ \frac{1}{n} = \frac{1}{5} = 0.2 & \text{, if Tail} \\ \end{cases} \]

So, if n=5, then \(|X_n - X| > (\epsilon = 0.1)\).
\( \implies P(|X_n - X| \ge (\epsilon=0.1)) = 1\).

if n=20;

\[ |X_n - X| = \begin{cases} \frac{1}{n+1} = \frac{1}{21} \approx 0.04 & \text{, if Head } \\ \\ \frac{1}{n} = \frac{1}{20} = 0.05 & \text{, if Tail} \\ \end{cases} \]

So, if n=20, then \(|X_n - X| < (\epsilon=0.1)\).
\(\implies P(|X_n - X| \ge (\epsilon=0.1)) = 0\).
\( \implies P(|X_n - X| \ge (\epsilon=0.1)) = 0 ~\forall ~n \ge 10\)

Therefore,

\[ \lim_{n\rightarrow\infty} P(|X_n - X| \ge (\epsilon=0.1)) = 0 \]

Similarly, we can prove that if \(\epsilon = 0.01\), then the probability will be equal to 0 for \( n\ge 100 \).

\[ \lim_{n\rightarrow\infty} P(|X_n - X| \ge (\epsilon=0.01)) = 0 \]

Note: Task is to check whether the sequence of randome variables \(X_1, X_2, \dots, X_n\) converges in probability to a known random variable \(X\) as \(n\rightarrow\infty\).

So, we can conclude that, if \(n > \frac{1}{\epsilon}\), then:

\[ \lim_{n\rightarrow\infty} P(|X_n - X| \ge \epsilon) = 0, \forall ~ \epsilon >0 \\[10pt] \text{Converges in Probability } \\[10pt] \implies X_n \xrightarrow{Probability} X \]
Almost Sure Convergence

A sequence of random variables \(X_1, X_2, \dots, X_n\) is said to almost surely converge to a known random variable \(X\),
for \(n \ge 1\), if the following is true:

\[ P(\lim_{n\rightarrow\infty} X_n = X) = 1 \\[10pt] \text{Almost Sure or With Probability = 1 } \\[10pt] \implies X_n \xrightarrow{Almost ~ Sure} X \]

where,
\(X_n\): is the estimator or sample based random variable.
\(X\): is the known or limiting or target random variable.

If, \(X_n \xrightarrow{Almost ~ Sure} X \), \( \implies X_n \xrightarrow{Probability} X \)
But, converse is NOT true.

*Note: Almost Sure convergence is hardest to satisfy amongst all convergence, such as, convergence in probability, convergence in distribution, etc.

Read more about Limits

Example
  • \(X\) is random variable such that \(X = \frac{1}{2} \), a constant, i.e \(X_1 = X_2 = \dots = X_n = \frac{1}{2}\).
    \(Y_1, Y_2,\dots ,Y_n \) are another sequence of random variables, such that :
    \[ Y_1 = X_1 \\[10pt] Y_2 = \frac{X_1 + X_2}{2} \\[10pt] Y_3 = \frac{X_1 + X_2 + X_3}{3} \\[10pt] \dots \\ Y_n = \frac{1}{n} \sum_{i=1}^{n} X_i \xrightarrow{Almost ~ Sure} \frac{1}{2} \]

End of Section

2.1.12 - Law of Large Numbers

Law of Large Numbers


Weak Law of Large Numbers (WLLN)

This law states that given a sequence of independent and identically distributed (IID) samples \(X_1, X_1, \dots, X_n\) from a random variable with finite mean, the sample mean (\(\bar{X_n}\)) converges in probability to the expected value \(E[X]\) or population mean (\( \mu \)).

\[ \lim_{n\rightarrow\infty} P(|\bar{X_n} - E[X]| \ge \epsilon) = 0, \forall ~ \epsilon >0 \\[10pt] \\[10pt] \frac{1}{n} \sum_{i=1}^{n} X_i \xrightarrow{Probability} E[X], \text{ as } n \rightarrow \infty \]


Note: Does NOT guarantee that sample mean will be close to population mean,
but instead says that - the probability of sample mean being far away from the population mean is low.


Read more about Limits

Example
  • Toss a coin large number of times \('n'\), as \(n \rightarrow \infty\), the proportion of heads will probably be very close to \(0.5\).
    However, it does NOT rule out the possibility of a rare sequence, e.g., getting 10 consecutive heads.
    But, the probability of such a rare event is extremely low.
Strong Law of Large Numbers (SLLN)

This law states that given a sequence of independent and identically distributed (IID) samples \(X_1, X_1, \dots, X_n\) from a random variable with finite mean, the sample mean (\(\bar{X_n}\)) converges almost surely to the expected value \(E[X]\) or population mean (\( \mu \)).

\[ P(\lim_{n\rightarrow\infty} \bar{X_n} = E[X]) = 1, \text{ as } n \rightarrow \infty \\[10pt] \frac{1}{n} \sum_{i=1}^{n} X_i \xrightarrow{Almost ~ Sure} E[X], \text{ as } n \rightarrow \infty \]


Note:

  • It guarantees that the sequence of sample averages itself converges to population mean, with exception of set of outcomes that has probability = 0.
  • Almost certain guarantee; Much stronger statement than Weak Law of Large Numbers.

Read more about Limits

Example
  • Toss a coin large number of times \('n'\), as \(n \rightarrow \infty\), the proportion of heads will converge to \(0.5\), with probability = 1.
    This means that a sequence where the proportion of heads never settles down to 0.5, is a probability = 0 event.
Application
  • Almost sure convergence ensures ML model’s reliability by guaranteeing that the average error on a large dataset will converge to the true error.
    Thus, providing confidence that model will perform consistently and accurately on unseen data.



End of Section

2.1.13 - Markov's Inequality

Markov’s, Chebyshev’s Inequality & Chernoff Bound


Markov's Inequality

It gives an upper bound for the probability that a non-negative random variable will NOT exceed, based on its expected value.

\[ P(X \ge a) \le \frac{E[X]}{a} \]

Note: It gives a very loose upper bound.

Read more about Expectation

A restaurant, on an average, expects to serve 50 customers per hour.
What is the probability that the restaurant will serve more than 200 customers in the next hour?
\[ P(X \ge 200) \le \frac{E[X]}{200} = \frac{50}{200} = 0.25 \]

Hence, a 25% chance of serving more than 200 customers.

Consider a test where the average score is 70/100 marks.
What is the probability that a randomly selected student gets a score of 90 marks or more?

\[ P(X \ge a) \le \frac{E[X]}{a} \\[10pt] P(X \ge 90) \le \frac{70}{90} \approx 0.78 \]

Hence, there is a 78% chance that a randomly selected student gets a score of 90 marks or more.


Chebyshev's Inequality

It states that the probability of a random variable deviating from its mean is small if its variance is small.

  • It is a more powerful version of Markov’s Inequality.
  • It uses variance of the distribution in addition to the expected value or mean.
  • Also, it does NOT assume the random variable to be non-negative.
  • It uses more information about the data i.e mean and variance.
\[ P(|X - E[X]| \ge k) \le \frac{E[(X - E[X])^2]}{k^2} \text{ ; k > 0 } \\[10pt] \text{ We know that: } Var[X] = E[(X - E[X])^2] \\[10pt] => P(\big|X - E[X]\big| \ge k) \le \frac{Var[X]}{k^2} \]

Note: It gives a tighter upper bound than Markov’s Inequality.

Consider a test where the average score is 70/100 marks.
What is the probability that a randomly selected student gets a score of 90 marks or more?
Given the standard deviation of the test score is 10 marks.

Given, the standard deviation of the test score \(\sigma\) = 10 marks.
=> Variance = \(\sigma^2\) = 100

\[ P(\big|X - E[X]\big| \ge k) \le \frac{Var[X]}{k^2} \\[10pt] E[X] = 70, Var[X] = 100 \\[10pt] P(X \ge 90) \le P(\big|X - 70\big| \ge 20) \\[10pt] P(\big|X - 70\big| \ge 20) \le \frac{100}{20^2} = \frac{1}{4} = 0.25 \]

Hence, Chebyshev’s Inequality gives a far tighter upper bound of 25% than Markov’s Inequality of 78%(approx).

Chernoff Bound

It is an upper bound on the probability that a random variable deviates from its expected value.
It’s an exponentially decreasing bound on the “tail” of a random variable’s distribution, which can be calculated using its moment generating function.

  • It is used for sum or average of independent random variables (not necessarily identically distributed).
  • It provides exponentially tighter bounds, better than Chebyshev’s Inequality’s quadratic decay.
  • It uses all moments to capture the full shape of the distribution, using the moment generating function(MGF).
\[ P(X \ge c) \le e^{-tc}E[e^{tX}] , \forall t>0\\[10pt] \text{ where } E[e^{tX}] \text{ is the Moment Generating Function of } X \]


Proof:

\[ P(X \ge c) = P(e^{tX} \ge e^{tc}), \text{ provided } t>0 \\[10pt] \text{ using Markov's Inequality: } \\[10pt] P(e^{tX} \ge e^{tc}) \le \frac{E[e^{tX}]}{e^{tc}} =e^{-tc}E[e^{tX}] \\[10pt] \]


For the sum of ’n’ independent random variables,

\[ P(S_n \ge c) \le e^{-tc}(M_x(t))^n \\[10pt] \text{ where } M_x(t) \text{ is the Moment Generating Function of } X \\[10pt] \]

Note: Used to compute how far the sum of independent random variables deviate from their expected value.

Read more about Moment Generating Function



End of Section

2.1.14 - Cross Entropy & KL Divergence

Cross Entropy & KL Divergence


Surprise Factor

It is a measure of the amount of information gained when a specific, individual event occurs, and is defined based on the probability of that event.
It is defined as the negative logarithm of the probability of the event.

  • It is also called ‘Surprisal’.
\[ S(x) = -log(P(x))) \]

Note: Logarithm(for base > 1) is a monotonically increasing function, so as x increases, log(x) also increases.

  • So, if probability P(x) increases, then surprise factor S(x) decreases.
  • Common events have a high probability of occurrence, hence a low surprise factor.
  • Rare events have a low probability of occurrence, hence a high surprise factor.

Units:

  • The unit of surprise factor with log base 2 is bits.
  • with base ’e’ or natural log its nats (natural units of information).
Entropy

It conveys how much ‘information’ we expect to gain from a random event.

  • Entropy is the average or expected value of surprise factor of a random variable.
  • More uniform the distribution ⇒ greater average surprise factor, since all possible outcomes are equally likely.
\[ H(X) = E[-log(P(x)] = -\sum_{x \in X} P(x)log(P(x)) \]

Read more about Expectation

  • Case 1:
    Toss a fair coin; \(P(H) = P(T) = 0.5 = 2^{-1}\)
    Surprise factor (Heads or Tails) = \(-log(P(x)) = -log_2(0.5) = -log_2(2^{-1}) = 1~bit\)
    Entropy = \( \sum_{x \in X} P(x)log(-P(x)) = 0.5*1 + 0.5*1 = 1 ~bit \)

  • Case 2:
    Toss a biased coin; \(P(H) = 0.9 P(T) = 0.1 \)
    Surprise factor (Heads) = \(-log(P(x)) = -log_2(0.9) \approx 0.15 ~bits \)
    Surprise factor (Tails) = \(-log(P(x)) = -log_2(0.1) \approx 3.32 ~bits\)
    Entropy = \( \sum_{x \in X} P(x)log(-P(x)) = 0.9*0.15 + 0.1*3.32 \approx 0.47 ~bits\)

Therefore, a biased coin is less surprising on an average than a fair coin, hence lower entropy.

Cross Entropy

It is a measure of the average ‘information gain’ or ‘surprise’ when using an imperfect model \(Q\) to encode events from a true model \(P\).

  • It measures how surprised we are on an average, if the true distribution is \(P\), but we predict using another distribution \(Q\).
\[ H(P,Q) = E[-log(Q(x)] = -\sum_{i=1}^n P(x_i)log(Q(x_i)) \\[10pt] \text{ where true distribution of X} \sim P \text{ but the predictions are made using another distribution } Q \]

A model is trained to classify images as ‘cat’ or ‘dog’. Say, for an input image the true label is ‘cat’, so the true distribution:
\(P = [1.0 ~(cat), 0.0 ~(dog)]\).
Let’s calculate the cross-entropy for the outputs of 2 models A & B.

  • Model A:
    \(Q_A = [0.8 ~(cat), 0.2 ~(dog)]\)
    Cross Entropy = \(H(P, Q_A) = -\sum_{i=1}^n P(x_i)log(Q_A(x_i)) \)
    \(= -[1*log_2(0.8) + 0*log_2(0.2)] \approx 0.32 ~bits\)
    Note: This is very low cross-entropy, since the predicted value is very close to actual, i.e low surprise.

  • Model B:
    \(Q_B = [0.2 ~(cat), 0.8 ~(dog)]\)
    Cross Entropy = \(H(P, Q_B) = -\sum_{i=1}^n P(x_i)log(Q_B(x_i)) \)
    \(= -[1*log_2(0.2) + 0*log_2(0.8)] \approx 2.32 ~bits\)
    Note: Here the cross-entropy is very high, since the predicted value is quite far from the actual truth, i.e high surprise.

Kullback Leibler (KL) Divergence

It measures the information lost when one probability distribution \(Q\) is used to approximate another distribution \(P\).
It quantifies the ‘extra cost’ in bits needed to encode data using the approximate distribution \(Q\) instead of the true distribution \(P\).

KL Divergence = Cross Entropy(P,Q) - Entropy(P)

\[ \begin{aligned} D_{KL}(P \parallel Q) &= H(P, Q) - H(P) \\ &= -\sum_{i=1}^n P(x_i)log(Q(x_i)) - [-\sum_{i=1}^n P(x_i)log(P(x_i))] \\[10pt] &= \sum_{i=1}^n P(x_i)[log(P(x_i)) - log(Q(x_i)) ] \\[10pt] D_{KL}(P \parallel Q) &= \sum_{i=1}^n P(x_i)log(\frac{P(x_i)}{Q(x_i)}) \\[10pt] \text{ For continuous case: } \\ D_{KL}(P \parallel Q) &= \int_{-\infty}^{\infty} p(x)log(\frac{p(x)}{q(x)})dx \end{aligned} \]

Note:

  • If \(P = Q\) ,i.e, P and Q are the same distributions, then KL Divergence = 0.
  • KL divergence is NOT symmetrical ,i.e, \(D_{KL}(P \parallel Q) ⍯ (D_{KL}(Q \parallel P)\).

Using the same cat, dog classification problem example as mentioned above.
A model is trained to classify images as ‘cat’ or ‘dog’. Say, for an input image the true label is ‘cat’, so the true distribution:
\(P = [1.0 ~(cat), 0.0 ~(dog)]\).
Let’s calculate the KL divergence for the outputs of the 2 models A & B.
Let’s calculate entropy first, and we can reuse the cross-entropy values calculated above already.
Entropy: \(H(P) = -\sum_{x \in X} P(x)log(P(x)) = -[1*log_2(1) + 0*log_2(0)] = 0 ~bits\)
Note: \(0*log_2(0) = 0\), is an approximation hack, since \(log(0)\) is undefined.

  • Model A:
    \( D_{KL}(P \parallel Q) = H(P, Q) - H(P) = 0.32 - 0 = 0.32 ~bits \)
    Model A incurs an additional 0.32 bits of surprise due to its imperfect prediction.

  • Model B:
    \( D_{KL}(P \parallel Q) = H(P, Q) - H(P) = 2.32 - 0 = 2.32 ~bits \)
    Model B has much more ‘information loss’ or incurs higher ‘penalty’ of 2.32 bits as its prediction was from from the truth.

Jensen-Shannon Divergence

It is a smoothed and symmetric version of the Kullback-Leibler (KL) divergence and is calculated by averaging the
KL divergences between each of the two distributions and their combined average distribution.

  • Symmetrical and smoothed version of KL divergence.
  • Always finite; KL divergence can be infinite if \( P ⍯ 0 ~and~ Q = 0 \).
    • Makes JS divergence more stable for ML models where some predicted probabilities may be exactly 0.
\[ D_{JS}(P \parallel Q) = \frac{1}{2}[D_{KL}(P \parallel M) + D_{KL}(Q \parallel M)] \\ \text{ where: } M = \frac{P + Q}{2} \]

Let’s continue the cat and dog image classification example discussed above.
A model is trained to classify images as ‘cat’ or ‘dog’. Say, for an input image the true label is ‘cat’, so the true distribution:
\(P = [1.0 ~(cat), 0.0 ~(dog)]\).

Step 1: Calculate the average distribution M.

\[ M = \frac{P + Q}{2} = \frac{1}{2} [[1.0, 0.0] + [0.8, 0.2]] \\[10pt] => M = [0.9, 0.1] \]

Step 2: Calculate \(D_{KL}(P \parallel M)\).

\[ \begin{aligned} D_{KL}(P \parallel M) &= \sum_{i=1}^n P(x_i)log_2(\frac{P(x_i)}{M(x_i)}) \\[10pt] &= 1*log_2(\frac{1}{0.9}) + 0*log_2(\frac{0}{0.1}) \\[10pt] & = log_2(1.111) + 0 \\[10pt] => D_{KL}(P \parallel M) &\approx 0.152 ~bits \\[10pt] \end{aligned} \]

Step 3: Calculate \(D_{KL}(Q \parallel M)\).

\[ \begin{aligned} D_{KL}(Q \parallel M) &= \sum_{i=1}^n Q(x_i)log_2(\frac{Q(x_i)}{M(x_i)}) \\[10pt] &= 0.8*log_2(\frac{0.8}{0.9}) + 0.2*log_2(\frac{0.2}{0.1}) \\[10pt] &= 0.8*log_2(0.888) + 0.2*log_2(2) \\[10pt] &= 0.8*(-0.17) + 0.2*1 \\[10pt] &= -0.136 + 0.2 \\[10pt] => D_{KL}(Q \parallel M) &\approx 0.064 ~bits \\[10pt] \end{aligned} \]

Step 4: Finally, lets put all together to calculate \(D_{JS}(P \parallel Q)\).

\[ \begin{aligned} D_{JS}(P \parallel Q) &= \frac{1}{2}[D_{KL}(P \parallel M) + D_{KL}(Q \parallel M)] \\[10pt] &= \frac{1}{2}[0.152 + 0.064] \\[10pt] &= \frac{1}{2}*0.216 \\[10pt] => D_{JS}(P \parallel Q) &= 0.108 ~ bits \\[10pt] \end{aligned} \]

Therefore, lower JS divergence value => P and Q are more similar.



End of Section

2.1.15 - Parametric Model Estimation

Parametric Model Estimation


Parametric Model Estimation

It is the process of finding the best-fitting finite set of parameters \(\Theta\) for a model that assumes a specific probability distribution for the data.
It involves using the dataset to estimate the parameters (like the mean and standard deviation of a normal distribution) that define the model.

\(P(X \mid \Theta) \) : Probability of seeing data \(X: (X_1, X_2, \dots, X_n) \), given the parameters \(\Theta\) of the underlying probability distribution from which the data is assumed to be generated.

Goal of estimation:
We observed data, \( D = \{X_1, X_2, \dots, X_n\} \), and we want to infer the the unknow parameters \(\Theta\) of the underlying probability distribution, assuming that the data is generated I.I.D.
Read more for I.I.D

Note: Most of the time, from experience, we know the underlying probability distribution of data, such as, Bernoulli, Gaussian, etc.

2 Approaches

There are 2 philosophical approaches to estimate the parameters of a parametric model:

  1. Frequentist:
  • Parameters \(\Theta\) is fixed, but unknown, only data is random.
  • It views probability as the long-run frequency of events in repeated trials; e.g toss a coin ’n’ times.
  • It is favoured when the sample size is large.
  • For example, Maximum Likelihood Estimation(MLE), Method of Moments, etc.
  1. Bayesian:
  • Parameters \(\Theta\) itself is unknown, so we model it as a random variable with a probability distribution.
  • It views probability as a degree of belief that can be updated with new evidence, i.e. data.
    Thus, integrating prior knowledge with data to express the uncertainty about the parameters.
  • It is favoured when the sample size is small, as it uses prior belief about the data distribution too.
  • For example, Maximum A Posteriori Estimation(MAP), Minimum Mean Square Error Estimation(MMSE), etc..
Maximum Likelihood Estimation

It is the most popular frequentist approach to estimate the parameters of a model.
This method helps us find the parameters \(\Theta\) that make the data most probable.

Likelihood Function:
Say, we have data, \(D = X_1, X_2, \dots, X_n\) are I.I.D discrete random variable with PMF = \(P_{\Theta}(.)\)
Then, the likelihood function is the probability of observing the data, \(D = \{X_1, X_2, \dots, X_n\}\), given the parameters \(\Theta\).

\[ \begin{aligned} \mathcal{L_{X_1, X_2, \dots, X_n}}(\Theta) &= P_{\Theta}(X_1, X_2, \dots, X_n) \\ &= \prod_{i=1}^{n} P_{\Theta}(X_i) \text{ ~ since, I.I.D } \\ \text{ for continuous case: } \\ & = \int_{\Theta} f_{\Theta}(x_i) d\theta \end{aligned} \]

Task:
Find the value of parameter \(\Theta_{ML}\) that maximizes the likelihood function.

\[ \begin{aligned} \underset{\Theta}{\mathrm{argmax}}\ \mathcal{L_{X_1, X_2, \dots ,X_n}}(\Theta) &= \Theta_{ML}(X_1, X_2, \dots, X_n) \\ &= \prod_{i=1}^{n} P_{\Theta}(X_i) \end{aligned} \]

In order to find the parameter \(\Theta_{ML}\) that maximises the likelihood function,
we need to take the first derivative of the likelihood function with respect to \(\Theta\) and equate it to zero.
But, taking derivative of a product is challenging, so we will take the logarithm on both sides.
Note: Log is a monotonically increasing function, i.e, as x increases, log(x) increases too.

Let us denote the log-likelihood function as \(\bar{L}\).

\[ \begin{aligned} \mathcal{\bar{L}_{X_1, X_2, \dots ,X_n}}(\Theta) &= log [\prod_{i=1}^{n} P_{\Theta}(X_i)] \\ &= \sum_{i=1}^{n} log P_{\Theta}(X_i) \\ \end{aligned} \]

Therefore, Maximum Likelihood Estimation is the parameter \(\Theta_{ML}\) that maximises the log-likelihood function.

\[ \Theta_{ML}(X_1, X_2, \dots, X_n) = \underset{\Theta}{\mathrm{argmax}}\ \bar{L}_{X_1, X_2, \dots ,X_n}(\Theta) \]

Given \(X_1, X_2, \dots, X_n\) are I.I.D. Bernoulli random variable with PMF as below:

\[ P(X_i = 1) = \theta \\ P(X_i = 0) = 1 - \theta \]

Estimate the parameter \(\theta\) using Maximum Likelihood Estimation.

Let, \(n_1 = \) number of 1’s in the dataset.

Likelihood Function:

\[ \begin{aligned} \mathcal{L_{X_1, X_2, \dots, X_n}}(\Theta) &= \prod_{i=1}^{n} P_{\Theta}(X_i) \\ &= \theta^{n_1} (1 - \theta)^{n - n_1} \\ \end{aligned} \]

Log-Likelihood Function:

\[ \begin{aligned} \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) &= log [\theta^{n_1} (1 - \theta)^{n - n_1}] \\ &= n_1 log \theta + (n - n_1) log (1 - \theta) \\ \end{aligned} \]

Maximum Likelihood Estimation:

\[ \begin{aligned} \Theta_{ML} &= \underset{\Theta}{\mathrm{argmax}}\ \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) \\ &= \underset{\Theta}{\mathrm{argmax}}\ n_1 log \theta + (n - n_1) log (1 - \theta) \\ \end{aligned} \]

In order to find the parameter \(\theta_{ML}\), we need to take the first derivative of the log-likelihood function with respect to \(\theta\) and equate it to zero.
Read more about Derivative

\[ \begin{aligned} \Theta_{ML} &= \underset{\Theta}{\mathrm{argmax}}\ n_1 log \theta + (n - n_1) log (1 - \theta) \\ =>& \frac{d}{d\theta} (n_1 log \theta + (n - n_1) log (1 - \theta)) = 0\\ =>& \frac{n_1}{\theta} + \frac{(n - n_1)}{1 - \theta}*(-1) = 0 \\ =>& \frac{n_1}{\theta} = \frac{(n - n_1)}{1 - \theta} \\ =>& n_1 - n_1\theta = n\theta - n_1\theta\\ =>& n_1 = n\theta \\ =>& \theta = \frac{n_1}{n} \text{ ~ i.e proportion of 1's} \\ \end{aligned} \]

Say, e.g., we have 10 observations for the Bernoulli random variable as: 1,0,1,1,0,1,1,0,1,0.
Then, the parameter \(\Theta_{ML} = \frac{6}{10} = 0.6\) i.e proportion of 1’s.

Given \(X_1, X_2, \dots, X_n\) are I.I.D. Gaussian \( \sim N(\mu, \sigma^2) \)
\(x_1, x_2, \dots, x_n\) are the realisations/observations of the random variable.

Estimate the parameters \(\mu\) and \(\sigma\) of the Gaussian distribution using Maximum Likelihood Estimation.

Likelihood Function:

\[ \begin{aligned} \mathcal{L_{X_1, X_2, \dots, X_n}}(\Theta) &= \prod_{i=1}^{n} f_{\mu, \sigma}(X_i) \\ &= \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}} \\ &= (\frac{1}{\sqrt{2\pi\sigma^2}})^n \prod_{i=1}^{n} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}} \\ \end{aligned} \]

Log-Likelihood Function:

\[ \begin{aligned} \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) &= log [(\frac{1}{\sqrt{2\pi\sigma^2}})^n \prod_{i=1}^{n} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}}] \\ &= log (2\pi\ \sigma^2)^{\frac{-n}{2}} - \sum_{i=1}^{n} \frac{(x_i - \mu)^2}{2\sigma^2} \\ => \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) &= -\frac{n}{2} log (2\pi) -nlog(\sigma) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2 \\ \end{aligned} \]

Note: Here, the first term \( -\frac{n}{2} log (2\pi) \) is a constant wrt both \(\mu\) and \(\sigma\), so we can ignore the term.

Maximum Likelihood Estimation:

\[ \mu_{ML}, \sigma_{ML} = \underset{\mu, \sigma}{\mathrm{argmax}}\ \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) \\ \]

Instead of finding \(\mu\) and \(\sigma\) that maximises the log-likelihood function,
we can find \(\mu\) and \(\sigma\) that minimises the negative of the log-likelihood function.

\[ \mu_{ML}, \sigma_{ML} = \underset{\mu, \sigma}{\mathrm{argmin}}\ -\bar{L}_{X_1, X_2, \dots, X_n}(\Theta) \\ \mu_{ML}, \sigma_{ML} = \underset{\mu, \sigma}{\mathrm{argmin}}\ nlog(\sigma) + \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2 \]

Now, lets differentiate the log likelihood function wrt \(\mu\) and \(\sigma\) separately to get \(\mu_{ML}, \sigma_{ML}\).

Lets, calculate \(\mu_{ML}\) first by taking the derivative of the log-likelihood function wrt \(\mu\) and equating it to 0.
Read more about Derivative

\[ \begin{aligned} &\frac{d}{d\mu} \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) = \frac{d}{d\mu} [nlog(\sigma) + \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2] = 0 \\ &=> 0 + \frac{2}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)*(-1) = 0\\ &=> \sum_{i=1}^{n} x_i - n\mu = 0 \\ &=> n\mu = \sum_{i=1}^{n} x_i \\ &=> \mu_{ML} = \frac{1}{n} \sum_{i=1}^{n} x_i \\ \end{aligned} \]

Similarly, we can calculate \(\sigma_{ML}\) by taking the derivative of the log-likelihood function wrt \(\sigma\) and equating it to 0.

\[ \begin{aligned} \frac{d}{d\sigma} \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) &= \frac{d}{d\sigma} [nlog(\sigma) + \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2] = 0 \\ => \sigma^2_{ML} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu_{ML})^2 \end{aligned} \]

Note: In general MLE is biased, i.e does NOT give an unbiased estimate => divides by \(n\) instead of \((n-1)\).

Bayesian Statistics

Bayesian statistics model parameters by updating initial beliefs (prior probabilities) with observed data to form a final belief (posterior probability) using Bayes’ Theorem.
Instead of a single point estimate, it provides a probability distribution over possible parameter values, which allows to quantify uncertainty and yields more robust models, especially with limited data.

Bayes’ Theorem:

\[ P(\Theta \mid X) = \frac{P(\Theta)P(X \mid \Theta)}{P(X)} \]

Read more about Bayes’ Theorem

\(P(\Theta)\): Prior: Initial distribution of \(\Theta\) before seeing the data.
\(P(X \mid \Theta)\): Likelihood: Conditional distribution of data \(X\), given the parameter \(\Theta\).
\(P(\Theta \mid X)\): Posterior: Conditional distribution of parameter \(\Theta\), given the data \(X\).
\(P(X)\): Evidence: Probability of seeing the data \(X\).

\(X_1, X_2, \dots, X_n\) are I.I.D. Bernoulli random variable.
\(x_1, x_2, \dots, x_n\) are the realisations, \(\Theta \in [0, 1] \).
Estimate the parameter \(\Theta\) using Bayesian statistics.

Let, \(n_1 = \) number of 1’s in the dataset.
Prior: \(P(\Theta)\) : \(\Theta \sim U(0, 1) \), i.e, parameter \(\Theta\) comes from a continuos uniform distribution in the range [0,1].
\(f_{\Theta}(\theta) = 1, \theta \in [0, 1] \)

Likelihood: \(P_{X \mid \Theta}(x \mid \theta) = \theta^{n_1} (1 - \theta)^{n - n_1} \).

Posterior:

\[ f_{\Theta \mid X} (\theta \mid x) = \frac{f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)}{f_{X}(x)} \\ \]

Note: The most difficult part is to calculate the denominator \(f_{X}(x)\).
So, either we try NOT to compute it all together, or we try to map it to some known functions to make calculations easier.

We know that we can get the marginal probability by integrating the joint probability over another variable.

\[ \tag{1} f_{X}(x) = \int_{Y}f_{X,Y}(x,y)dy \\ \]

Also from conditional probability, we know:

\[ \tag{2} f_{X \mid Y}(x \mid y) = \frac{f_{X,Y}(x,y)}{f_{Y}(y)} \\ => f_{X,Y}(x,y) = f_{Y}(y) f_{X \mid Y}(x \mid y) \]

From equations 1 and 2, we have:

\[ \tag{3} f_{X}(x) = \int_{Y}f_{Y}(y) f_{X \mid Y}(x \mid y)dy \]

Now let’s replace the value of \(f_{X}(x) \) in the posterior from equation 3:
Posterior:

\[ \begin{aligned} f_{\Theta \mid X} (\theta \mid x) &= \frac{f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)}{f_{X}(x)} \\[10pt] &= \frac{f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)}{\int_{\Theta}f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)d\theta} \\[10pt] &= \frac{f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)}{\int_{0}^1f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)d\theta} \\[10pt] \text{ We know that: } f_{\Theta}(\theta) = 1, \theta \in [0, 1] \\ &= \frac{1* P_{X \mid \Theta}(x \mid \theta)}{\int_{0}^1 1* P_{X \mid \Theta}(x \mid \theta)d\theta} \\[10pt] => f_{\Theta \mid X} (\theta \mid x) & = \frac{\theta^{n_1} (1 - \theta)^{n - n_1}}{\int_{0}^1 \theta^{n_1} (1 - \theta)^{n - n_1}} \\ \end{aligned} \]

Beta Function:
It is a special mathematical function denoted by B(a, b) or β(a, b) that is defined by the integral formula:

\[ β(a, b) = \int_{0}^1 t^{a-1}(1-t)^{b-1}dt \]

Note: We can see that the denominator of the posterior is of the form of Beta function.
Posterior:

\[ f_{\Theta \mid X} (\theta \mid x) = \frac{\theta^{n_1} (1 - \theta)^{n - n_1}}{β(n_1+1, n-n_1+1)} \]

Suppose, in the above example, we are told that the parameter \(\Theta\) is closer to 1 than 0.
How will we incorporate this useful information (apriori knowledge) into our parameter estimation?

\[ f_{\Theta}(\theta) = \begin{cases} 2\Theta, & \forall ~ \theta \in [0,1] \\ \\ 0, & \text{otherwise} \end{cases} \\ \]

Since, we know apriori that \(\Theta\) is closer to 1 than 0, we should take this initial belief into account to do our parameter estimation.

Prior:

\[ f_{\Theta}(\theta) = \begin{cases} 2\Theta, & \forall ~ \theta \in [0,1] \\ \\ 0, & \text{otherwise} \end{cases} \\ \]

Posterior:

\[ \begin{aligned} f_{\Theta \mid X} (\theta \mid x) &= \frac{f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)}{f_{X}(x)} \\[10pt] &= \frac{f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)}{\int_{\Theta}f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)d\theta} \\[10pt] &= \frac{f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)}{\int_{0}^1f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)d\theta} \\[10pt] \text{ We know that: } f_{\Theta}(\theta) = 2\Theta, \theta \in [0, 1] \\ &= \frac{2\Theta * P_{X \mid \Theta}(x \mid \theta)}{\int_{0}^1 2\Theta* P_{X \mid \Theta}(x \mid \theta)d\theta} \\[10pt] & = \frac{2\Theta * \theta^{n_1} (1 - \theta)^{n - n_1}}{\int_{0}^1 2\Theta * \theta^{n_1} (1 - \theta)^{n - n_1}} \\[10pt] => f_{\Theta \mid X} (\theta \mid x) &= \frac{\theta^{n_1+1} (1 - \theta)^{n - n_1}}{β(n_1+2, n-n_1+1)} \end{aligned} \]

Note: If we do NOT have enough data, then we should NOT ignore our initial belief.
However, if we have enough data, then the data will override our initial belief and the posterior will be dominated by data.

Plot: Prior, Posterior & MLE
images/maths/probability/mle.png

Note: Bayesian approach gives us a probability distribution of parameter \(\Theta\).

What if we want to have a single point estimate of the parameter \(\Theta\) instead of a probability distribution?

We can use Bayesian Point Estimators, after getting the posterior distribution, to summarize it with a single value for practical use, such as,

  • Maximum A Posteriori (MAP) Estimator
  • Minimum Mean Square Error (MMSE) Estimator
Maximum A Posteriori (MAP) Estimator

It finds the mode(peak) of the posterior distribution.

  • MAP has the minimum probability of error, since it picks single most probable value.
\[ \Theta_{MAP} = \underset{\Theta}{\mathrm{argmax}}\ f_{\Theta \mid X} (\theta \mid x) \text{, \(\theta\) is continuous} \\ \Theta_{MAP} = \underset{\Theta}{\mathrm{argmax}}\ P_{\Theta \mid X} (\theta \mid x) \text{, \(\theta\) is discrete} \]
Given a Gaussian distribution, \( X \sim N(\Theta, 1) \) with a prior belief that \(\mu\) is equally likely to be 0 or 1.
Estimate the unknown parameter \(\mu\) using MAP.

Given that :
\(\Theta\) is discrete, with probability 0.5 for both 0 and 1.
=> The Gaussian distribution is equally likely to be centered at 0 or 1.
Variance: \(\sigma^2 = 1\)

Prior:

\[ P_{\Theta}(\theta=0) = P_{\Theta}(\theta=1) = 1/2 \]

Likelihood:

\[ f_{X \mid \Theta}(x \mid \theta) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x_i-\theta)^2}{2 \sigma^2}} \\ f_{X \mid \Theta}(x \mid \theta) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}} e^{-\frac{(x_i-\theta)^2}{2}}, \text{ since \(\sigma = 1\)} \]

Posterior:

\[ P_{\Theta \mid X} (\theta \mid x) = \frac{P_{\Theta}(\theta) f_{X \mid \Theta}(x \mid \theta)}{f_{X}(x)} \\ \]

We need to find:

\[ \underset{\Theta}{\mathrm{argmax}}\ P_{\Theta \mid X} (\theta \mid x) \]

Taking log on both sides:

\[ \tag{1}\log P_{\Theta \mid X} (\theta \mid x) = \log P_{\Theta}(\theta) + \log f_{X \mid \Theta}(x \mid \theta) - \log f_{X}(x) \]

Let’s calculate the log-likelihood function first -

\[ \begin{aligned} \log f_{X \mid \Theta}(x \mid \theta) &= \log \prod_{i=1}^n \frac{1}{\sqrt{2\pi}} e^{-\frac{(x_i-\theta)^2}{2}} \\ &= log(\frac{1}{\sqrt{2\pi}})^n + \sum_{i=1}^n \log (e^{-\frac{(x_i-\theta)^2}{2}}), \text{ since \(\sigma = 1\)} \\ &= n\log(\frac{1}{\sqrt{2\pi}}) + \sum_{i=1}^n -\frac{(x_i-\theta)^2}{2} \\ => \tag{2} \log f_{X \mid \Theta}(x \mid \theta) &= n\log(\frac{1}{\sqrt{2\pi}}) - \sum_{i=1}^n \frac{(x_i-\theta)^2}{2} \end{aligned} \]

Here computing \(f_{X}(x)\) is difficult.
So, instead of differentiating above equation wrt \(\Theta\), and equating to 0,
we will calculate the above expression for \(\theta=0\) and \(\theta=1\) and compare the values.
This way we can get rid of the common value \(f_{X}(x)\) in both expressions that is not dependent on \(\Theta\).

When \(\theta=1\):

\[ \begin{aligned} \log P_{\Theta \mid X} (1 \mid x) &= \log P_{\Theta}(1) + \log f_{X \mid \Theta}(x \mid 1) \\ \tag{3} \log P_{\Theta \mid X} (1 \mid x) &= \log(1/2) + n\log(\frac{1}{\sqrt{2\pi}}) - \sum_{i=1}^n \frac{(x_i - 1)^2}{2}\\ \end{aligned} \]

Similarly, when \(\theta=0\):

\[ \begin{aligned} \log P_{\Theta \mid X} (0 \mid x) &= \log P_{\Theta}(0) + \log f_{X \mid \Theta}(x \mid 0) \\ \tag{4} \log P_{\Theta \mid X} (0 \mid x) &= \log(1/2) + n\log(\frac{1}{\sqrt{2\pi}}) - \sum_{i=1}^n \frac{(x_i - 0)^2}{2}\\ \end{aligned} \]

So, we can say that \(\theta = 1\) only if :
the value of the above expression for \(\theta = 1\) > the value of the above expression for \(\theta = 0\) From equation 3 and 4:

\[ \begin{aligned} &\log(1/2) + n\log(\frac{1}{\sqrt{2\pi}}) - \sum_{i=1}^n \frac{(x_i - 1)^2}{2} > \log(1/2) + n\log(\frac{1}{\sqrt{2\pi}}) - \sum_{i=1}^n \frac{(x_i - 0)^2}{2} \\ => &\cancel{\log(1/2)} + \cancel{n\log(\frac{1}{\sqrt{2\pi}})} - \sum_{i=1}^n \frac{(x_i - 1)^2}{2} > \cancel{\log(1/2)} + \cancel{n\log(\frac{1}{\sqrt{2\pi}})} - \sum_{i=1}^n \frac{(x_i - 0)^2}{2} \\ => &\sum_{i=1}^n \frac{(x_i - 0)^2}{2} - \sum_{i=1}^n \frac{(x_i - 1)^2}{2} > 0 \\ => & \sum_{i=1}^n \frac{\cancel{x_i^2} - \cancel{x_i^2} + 2x_i -1 }{2} > 0 \\ => & \sum_{i=1}^n [x_i - \frac{1}{2}] > 0 \\ => & \sum_{i=1}^n x_i - \frac{n}{2} > 0 \\ => & \sum_{i=1}^n x_i > \frac{n}{2} \\ => & \frac{1}{n} \sum_{i=1}^n x_i > \frac{1}{2} \\ => & \Theta_{MAP}(X) = \begin{cases} 1 & \text{if } \frac{1}{n} \sum_{i=1}^n x_i > \frac{1}{2} \\ \\ 0 & \text{otherwise.} \end{cases} \end{aligned} \]

Note: If the prior is uniform then \(\Theta_{MAP} = \Theta_{MLE}\), because uniform prior does NOT give any information about the initial bias, and all possibilities are equally likely.

What if, in the above example, we know that the initial belief is not uniform but biased towards 0?
Prior:

\[ P_{\Theta}(\theta=0) = 3/4 , P_{\Theta}(\theta=1) = 1/4 \]

Now, let’s compare the log-posterior for both the cases i.e \(\theta=0\) and \(\theta=1\) as we did earlier.
But, note that this time the probabilities for \(\theta=0\) and \(\theta=1\) are different.

So, we can say that \(\theta = 1\) only if :
the value of the above expression for \(\theta = 1\) > the value of the above expression for \(\theta = 0\) From equation 3 and 4 above:

\[ \begin{aligned} &\log(1/4) + n\log(\frac{1}{\sqrt{2\pi}}) - \sum_{i=1}^n \frac{(x_i - 1)^2}{2} > \log(3/4) + n\log(\frac{1}{\sqrt{2\pi}}) - \sum_{i=1}^n \frac{(x_i - 0)^2}{2} \\ => &\log(1/4) + \cancel{n\log(\frac{1}{\sqrt{2\pi}})} - \sum_{i=1}^n \frac{(x_i - 1)^2}{2} > \log(3/4) + \cancel{n\log(\frac{1}{\sqrt{2\pi}})} - \sum_{i=1}^n \frac{(x_i - 0)^2}{2} \\ => &\sum_{i=1}^n \frac{(x_i - 0)^2}{2} - \sum_{i=1}^n \frac{(x_i - 1)^2}{2} > \log3 -\cancel{\log4} -\log1 +\cancel{\log4} \\ => & \sum_{i=1}^n \frac{\cancel{x_i^2} - \cancel{x_i^2} + 2x_i -1 }{2} > \log3 - 0 \text{ , since log(1) = 0}\\ => & \sum_{i=1}^n [x_i - \frac{1}{2}] > \log3 \\ => & \sum_{i=1}^n x_i - \frac{n}{2} > \log3 \\ => & \sum_{i=1}^n x_i > \frac{n}{2} + \log3 \\ \text{ dividing both sides by n: } \\ => & \frac{1}{n} \sum_{i=1}^n x_i > \frac{1}{2} + \frac{\log3}{n}\\ => & \Theta_{MAP}(X) = \begin{cases} 1 & \text{if } \frac{1}{n} \sum_{i=1}^n x_i > \frac{1}{2} + \frac{\log3}{n}\\ \\ 0 & \text{otherwise.} \end{cases} \end{aligned} \]

Therefore, we can see that \(\Theta_{MAP}\) is extra biased towards 0.

Note: For a non-uniform prior \(\Theta_{MAP}\) estimate will be pulled towards the prior’s mode.

MAP estimator is good for classification like problems, such as Yes/No, True/False, etc.
e.g: Patient has a certain disease or not.
But, what if we want to minimize average magnitude of errors, over time, say predicting a stock’s price?
Just getting to know whether the prediction was right or wrong is not sufficient here.
We also want to know that the prediction was wrong by how much, so that we can minimize the loss over time.
We can use Minimum Mean Square Error (MMSE) Estimator to do this.
Minimum Mean Square Error (MMSE) Estimation

Minimizes the expected value of squared error.
Mean of the posterior distribution is the conditional expectation of parameter \(\Theta\), given the data.

  • Posterior mean minimizes the the mean squared error.
\[ \hat\Theta_{MMSE}(X) = \underset{\Theta}{\mathrm{argmin}}\ \mathbb{E}[(\hat\Theta(X) - \Theta)^2] \]

\(\hat\Theta(X)\): Predicted value.
\(\Theta\): Actual value.

\[ \hat\Theta_{MMSE}(X) = \sum_{\Theta} \theta P_{\Theta \mid X} (\theta \mid x), \text{ if \(\Theta\) is discrete} \\ \hat\Theta_{MMSE}(X) = \int_{\Theta} \theta f_{\Theta \mid X} (\theta \mid x)d\theta, \text{ if \(\Theta\) is continuous} \]

Let’s revisit the above examples that we used for MLE.

Case 1: Uniform Continuous Prior for parameter \(\Theta\) of Bernoulli distribution
Prior:

\[ f_{\Theta}(\theta) = 1, ~or~ X \sim U(0,1), ~or~ \beta(1,1) \]

Posterior:

\[ f_{\Theta \mid X} (\theta \mid x) = \frac{\theta^{n_1} (1 - \theta)^{n - n_1}}{\beta(n_1+1, n-n_1+1)} \]

Let’s calculate the \(\Theta_{MMSE}\):

\[ \begin{aligned} \Theta_{MMSE}(X) &= \int_{0}^1 f_{\Theta \mid X} (\theta \mid x)d\theta \\ & = \frac{\int_{0}^1 \theta * \theta^{n_1} (1 - \theta)^{n - n_1}}{\beta(n_1+1, n-n_1+1)} \\[10pt] & = \frac{ \int_{0}^1 \theta^{n_1+1} (1 - \theta)^{n - n_1}}{\beta(n_1+1, n-n_1+1)} \\[10pt] \text{ Since: } \beta(a, b) = \int_0^1 t^{a-1}(1-t)^{b-1}dt \\[10pt] => \Theta_{MMSE}(X) &= \frac{\beta(n_1+2, n-n_1+1)}{\beta(n_1+1, n-n_1+1)} \end{aligned} \]

Similarly, for the second case where the prior is biased towards 1.
Case 2: Prior is biased towards 1 for parameter \(\Theta\) of Bernoulli distribution
Prior:

\[ f_{\Theta}(\theta) = 2\Theta \]

Posterior:

\[ f_{\Theta \mid X} (\theta \mid x) = \frac{\theta^{n_1+1} (1 - \theta)^{n - n_1}}{\beta(n_1+2, n-n_1+1)} \]

So, here in this case our \(\Theta_{MMSE}\) is:

\[ \Theta_{MMSE}(X) = \frac{\beta(n_1+3, n-n_1+1)}{\beta(n_1+2, n-n_1+1)} \]
MAP vs MMSE
  • MMSE is the average of the posterior distribution, whereas MAP is the mode/peak.
  • If posterior distribution is symmetric and unimodal (only 1 peak), then MAP and MMSE are very close.
  • If posterior distribution is skewed and multimodal (many peaks), then MAP and MMSE can differ a lot.
  • MMSE considers all the values of the posterior distribution; hence, it is more accurate than MAP, especially for skewed or multimodal distributions.



End of Section

2.2 - Statistics

Statistics for AI & ML

2.2.1 - Data Distribution

Understanding Data Distribution


Measures of Central Tendency

A single number that describes the central, typical, or representative value of a dataset, e.g, mean, median, and mode.
The mean is the average, the median is the middle value in a sorted list, and the mode is the most frequently occurring value.

  • A single representative value can be used to compare different groups or distributions.
Mean

The artihmetic average of a set of numbers i.e sum all values and divide by the number of values.
\(mean = \frac{1}{n}\sum_{i=1}^{n}x_i\)

  • Most common measure of central tendency.
  • Represents the ‘balancing point’ of data.
  • Sample mean is denoted by \(\bar{x}\), and population mean by \(\mu\).

Pros:

  • Uses all datapoints in its calculation, providing a comprehensive measure.

Cons:

  • Highly sensitive to outliers i.e exteme values.
Example
  1. mean\((1,2,3,4,5) = \frac{1+2+3+4+5}{5} = 3 \)
  2. With outlier: mean\((1,2,3,4,100) = \frac{1+2+3+4+100}{5} = \frac{110}{5} = 22\)
    Note: Just a single extreme value of 100 has pushed the mean from 3 to 22.
Median

The middle value of a sorted list of numbers. It divides the dataset into 2 equal halves.
Calculation:

  • Arrange the data points in ascending order.
  • If the number of data points is even, the median is the average of the two middle values.
  • If the number of data points is odd, the median is the middle value i.e \((\frac{n+1}{2})^{th}\) element.

Pros:

  • Not impacted by outliers, making it a more robust/reliable measure, especially for skewed distributions.

Cons:

  • Does NOT use all the datapoints in its calculation.
Example
  1. median\((1,2,3,4,5) = 3\)
  2. median\((1,2,3,4,5,6) = \frac{3+4}{2} = 3.5\)
  3. With outlier: median\((1,2,3,4,100) = 3\)
    Note: No impact of outlier.
Mode

The most frequently occurring value in a dataset.

  • Dataset can have 1 mode i.e unimodal, 2 modes i.e bimodal, and more than 2 modes i.e multimodal.
    • If NO value repeats, then NO mode.

Pros:

  • Only measure of central tendency that can be used for categorical/nominal data, such as, gender, blood group, level of education, etc.
  • It can reveal important peaks in data distribution.

Cons:

  • A dataset can have multiple modes, or no mode at all, which can make mode less informative.
Measures of Dispersion(Spread)
It measures the spread or variability of a dataset.
Quantifies how spread out or scattered the data points are.
E.g: Range, Variance, Standard Deviation, Median Absoulute Deviation(MAD), Skewness, Kurtosis, etc.

Range

The difference between the largest and smallest values in a dataset. Simplest measure of dispersion
\(range = max - min\)

Pros:

  • Easy to calculate and understand.

Cons:

  • Only considers the the 2 extreme values of dataset and ignores the distribution of data in between.
  • Highly sensitive to outliers.
Example
  1. range\((1,2,3,4,5) = 5 - 1 = 4\)

Variance

The average of the squared distance of each value from the mean.
Measures the spread of data points.

\(sample ~ variance = s^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2\)

\(population ~ variance = \sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2\)

Cons:

  • Highly sensitive to outliers, as squaring amplifies the weight of extreme data points.
  • Less intuitive to understand, as the units are square of original units.
Standard Deviation

The square root of the variance, measures average distance of data points from the mean.

  • Low standard deviation indicates that the data points are clustered around the mean, whereas
    high standard deviation means that the data points are spread out over a wide range.

\(s = sample ~ standard ~ deviation \)
\(\sigma = population ~ standard ~ deviation \)

Example
  1. Standard Deviation\((1,2,3,4,5) = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2} \) \[ = \sqrt{\frac{1}{5}((1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2)} \\ = \sqrt{\frac{1}{5}(4+1+0+1+4)} \\ = \sqrt{\frac{10}{5}} = \sqrt{2} = 1.414 \]
Mean Absolute Deviation

It is the average of absolute deviation or distance of all data points from mean.

\( mad = \frac{1}{n}\sum_{i=1}^{n}|x_i - \bar{x}| \)

Pros:

  • Less sensitive to outliers as compared to standard deviation..
  • More intuitive and simpler to understand.
Example
  1. Mean Absolute Deviation\((1,2,3,4,5) = \\ \frac{1}{5}\left(\left|1-3\right| + \left|2-3\right| + \left|3-3\right| + \left|4-3\right| + \left|5-3\right|\right) = \frac{1}{5}\left(2+1+0+1+2\right) = \frac{6}{5} = 1.2\)

Skewness

It measures the asymmetry of a data distribution.
Tells us whether the data is concentrated on one side of mean and is there a long tail stretching on the other side.

Positive Skew:

  • Tail is longer on the right side of the mean.
  • Bulk of data is on the left side of the mean, but there are a few very high values pulling the mean towards the right.
  • Mean > Median > Mode.

Negative Skew:

  • Tail is longer on the left side of the mean.
  • Bulk of data is on the right side of the mean, but there are a few very high values pulling the mean towards the left.
  • Mean < Median < Mode.

Zero Skew:

  • Perfectly symmetrical like a normal distribution.
  • Mean = Median = Mode.
images/maths/statistics/skewness.png

Example
  1. Consider the salary of employees in a company. Most employees earn a very modest salary, but a few executives earn extremely high salaries. This dataset will be positively skewed with the mean salary > median salary.
    Median salary would be a better representation of the typical salary of employees.
Kurtosis

It measures the “tailedness” of a data distribution.
It describes how much the data is concentrated in tails (fat or thin) versus the center.

  • It can tell us about the frequency of outliers in the data.
    • Thick tails => More outliers.

Excess Kurtosis:
Excess kurtosis is calculated by subtracting 3 from standard kurtosis in order to compare with normal distribution.
Normal distribution has kurtosis = 3.

Mesokurtic:

  • Excess kurtosis = 0 i.e normal kurtosis.
  • Tails are neither too thick nor too thin.

Leptokurtic:

  • High kurtosis, i.e, excess kurtosis > 0 (+ve).
  • Heavy or thick tails => High probability of outliers.
  • Sharp peak => High concentration of data around mean.
  • E.g: Student’s t-distribution, Laplace distribution, etc.
  • High risk stock portfolios.

Platykurtic:

  • Low kurtosis, i.e, excess kurtosis < 0 (-ve).
  • Thin tails => Low probability of outliers.
  • Low peak => more uniform distribution of values.
  • E.g: Uniform distribution, Bernoulli(P=0.5) distribution, etc.
  • Investment in fixed deposits.

images/maths/statistics/kurtosis.png

images/maths/statistics/excess_kurtosis.png

Measures of Position
It helps us understand the relative position of a data point i.e where a specific value lies within a dataset.
E.g: Percentile, Quartile, Inter Quartile Range(IQR), etc.

Percentile

It indicates the percentage of scores in a dataset that are equal to or below a specific value.
Here, the complete dataset is divided into 100 equal parts.

  • \(k^{th}\) percentile => at least \(k\) percent of the data points are equal to or below the value.
  • It is a relative comparison, i.e, compares a score with the entire group’s performance.
  • Quartiles are basis for box plots.
Example
  1. 90th percentile => score is higher than 90% of of all other test takers.

Quartile

They are special percentiles that divide the complete dataset into 4 equal parts.

Q1 => 25th percentile, value below which 25% of the data falls.
Q2 => 50th percentile, value below which 50% of the data falls; median.
Q3 => 75th percentile, value below which 75% of the data falls.

\[ Q1 = (n+1) * 1/4 \\ Q2 = (n+1) * 1/2 \\ Q3 = (n+1) * 3/4 \]
Example
  1. Data = \(\{1,2,3,4,5,6,7,8,9,10,100\}\) \[ Q1 = (11+1) * 1/4 = 12*1/4 = 3 \\ Q2 = (11+1) * 1/2 = 12*1/2 = 6 \\ Q3 = (11+1) * 3/4 = 12*3/4 = 9 \]
images/maths/statistics/quartiles.png
Inter Quartile Range(IQR)

It is the single number that measures the spread of middle 50% of the data, i.e Q1-Q3.

  • More robust measure of spread than range as is NOT impacted by outliers.

IQR = Q3 - Q1

Example
  1. Data = \(\{1,2,3,4,5,6,7,8,9,10,100\}\) \[ Q1 = (11+1) * 1/4 = 12*1/4 = 3 \\ Q2 = (11+1) * 1/2 = 12*1/2 = 6 \\ Q3 = (11+1) * 3/4 = 12*3/4 = 9 \]

Therefore, IQR = Q3-Q1 = 9-3 = 6

Outlier Detection

IQR is a standard tool for detecting outliers.
Values that fall outside the ‘fences’ can be considered as potential outliers.

Lower fence = Q1 - 1.5 * IQR
Upper fence = Q3 + 1.5 * IQR

Example
  1. Data = \(\{1,2,3,4,5,6,7,8,9,10,100\}\) \[ Q1 = (11+1) * 1/4 = 12*1/4 = 3 \\ Q2 = (11+1) * 1/2 = 12*1/2 = 6 \\ Q3 = (11+1) * 3/4 = 12*3/4 = 9 \]

IQR = Q3-Q1 = 9-3 = 6
Lower fence = Q1 - 1.5 * IQR = 3 - 9 = -6
Upper fence = Q3 + 1.5 * IQR = 9 + 9 = 18
So, any data point that is less than -6 or greater than 18 is considered as a potential outlier.
As in this example, 100 can be considered as an outlier.

Anscombe's Quartet

Even though the above metrics give us a good idea of the data distribution, but still we should always plot the data and visually inspect the data distribution.
As these metrics may not provide the complete picture.

A mathematician called Francis John Anscombe has illustrated this point beautifully in his Anscombe’s Quartet.

Anscombe’s quartet:
It comprises four datasets that have nearly identical simple descriptive statistics,
yet have very different distributions and appear very different when plotted.

images/maths/statistics/anscombe_quartet_data.png

images/maths/statistics/anscombe_quartet.png

Figure: Anscombe's Quartet



End of Section

2.2.2 - Correlation

Covariance & Correlation


Covariance

It measures the direction of linear relationship between two variables \(X\) and \(Y\).

\[Population ~ Covariance(X,Y) = \sigma_{xy} = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu_{x})(y_i - \mu_{y})\]


\(N\) = size of population
\(\mu_{x}\) = population mean of \(X\)
\(\mu_{y}\) = population mean of \(Y\)

\[Sample ~ Covariance(X,Y) = s_{xy} = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})\]


\(n\) = size of sample
\(\bar{x}\) = sample mean of \(X\)
\(\bar{y}\) = sample mean of \(Y\)

Note: We have a term (n-1) instead of n in the denominator to make it an unbiased estimate, called Bessel’s Correction.

If both \((x_i - \bar{x})\) and \((y_i - \bar{y})\) have the same sign, then the product is positive(+ve).
If both \((x_i - \bar{x})\) and \((y_i - \bar{y})\) have opposite signs, then the product is negative(-ve).
The final value of covariance depends on the sum of the above individual products.

\( \begin{aligned} \text{Cov}(X, Y) &> 0 &&\Rightarrow \text{ } X \text{ and } Y \text{ increase or decrease together} \\ \text{Cov}(X, Y) &= 0 &&\Rightarrow \text{ } \text{No linear relationship} \\ \text{Cov}(X, Y) &< 0 &&\Rightarrow \text{ } \text{If } X \text{ increases, } Y \text{ decreases (and vice versa)} \end{aligned} \)

Limitation:
Covariance is scale-dependent, i.e, units of X and Y impact its magnitude.
This makes it hard to make comparisons of covariance across different datasets.
E.g: Covariance between age and height will NOT be same as the covariance between years of experience and salary.

Note:It only measures the direction of the relationship, but does NOT give any information about the strength of the relationship.

Example
  1. \(X = [1, 2, 3] \) and \(Y = [2, 4, 6] \)
    Let’s calculate the covariance:
    \(\text{Cov}(X, Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})\)
    \(\bar{x} = 2\) and \(\bar{y} = 4\)
    \(\text{Cov}(X, Y) = \frac{1}{3-1}[(1-2)(2-4) + (2-2)(4-4) + (3-2)(6-4)]= 0\)
    \( = \frac{1}{2}[2+0+2]= 2\)
    => Cov(X,Y) > 0 i.e if X increases, Y increases and vice versa.

Correlation

It measures both the strength and direction of the linear relationship between two variables \(X\) and \(Y\).
It is a standardized version of covariance that gives a dimensionless measure of linear relationship.

There are 2 popular ways to calculate correlation coefficient:

  1. Pearson Correlation Coefficient (r)
  2. Spearman Rank Correlation Coefficient (\(\rho\))

Pearson Correlation Coefficient (r)

It is a standardized version of covariance and most widely used measure of correlation.
Assumption: Data is normally distributed.

\[r_{xy} = \frac{Cov(X, Y)}{\sigma_{x} \sigma_{y}}\]


\(\sigma_{x}\) and \(\sigma_{y}\) are the standard deviations of \(X\) and \(Y\).

Range of \(r\) is between -1 and 1.
\(r = 1\) => perfect +ve linear relationship between X and Y
\(r = -1\) => perfect -ve linear relationship between X and Y
\(r = 0\) => NO linear relationship between X and Y.

Note: A correlation coefficient of 0.9 means that there is a strong linear relationship between X and Y, irrespective of their units.

Example
  1. \(X = [1, 2, 3] \) and \(Y = [2, 4, 6] \)
    Let’s calculate the covariance:
    \(\text{Cov}(X, Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})\)
    \(\bar{x} = 2\) and \(\bar{y} = 4\)
    \(\text{Cov}(X, Y) = \frac{1}{3-1}[(1-2)(2-4) + (2-2)(4-4) + (3-2)(6-4)]= 0\)
    \( => \text{Cov}(X, Y) = \frac{1}{2}[2+0+2]= 2\)

Let’s calculate the standard deviation of \(X\) and \(Y\):
\(\sigma_{x} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2} \)
\(= \sqrt{\frac{1}{3-1}[(1-2)^2 + (2-2)^2 + (3-2)^2]}\)
\(= \sqrt{\frac{1+0+1}{2}} =\sqrt{\frac{2}{2}} = 1 \)

Similarly, we can calculate the standard deviation of \(Y\):
\(\sigma_{y} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(y_i - \bar{y})^2} \)
\(= \sqrt{\frac{1}{3-1}[(2-4)^2 + (4-4)^2 + (6-4)^2]}\)
\(= \sqrt{\frac{4+0+4}{2}} =\sqrt{\frac{8}{2}} = 2 \)

Now, we can calulate the pearson correlation coefficient (r):
\(r_{xy} = \frac{Cov(X, Y)}{\sigma_{x} \sigma_{y}}\)
=> \(r_{xy} = \frac{2}{1* 2}\)
=> \(r_{xy} = 1\)
Therefore, we can say that there is a strong +ve linear relationship between X and Y.

Spearman Rank Correlation Coefficient (\(\rho\))

It is a measure of the strength and direction of the monotonic relationship between two ranked variables \(X\) and \(Y\).
It captures monotonic relationship, meaning the variables move in the same or opposite direction,
but not necessarily a linear relationship.

  • It is used when Pearson’s correlation is not suitable, such as, ordinal data, or when the continuous data does not meet the assumptions of linear methods, such as, Pearson’s correlation.
  • Non-parametric measure of correlation that uses ranks instead of raw data.
  • Quantifies how well the ranks of one variable predict the ranks of the other variable.
  • Range of \(\rho\) is between -1 and 1.
\[\rho_{xy} = 1 - \frac{6\sum_{i}d_i^2}{n(n^2-1)}\]


Example
  1. Compute the correlation of ranks awarded to a group of 5 students by 2 different teacherrs.
    StudentTeacher A RankTeacher B Rank\(d_i\)\(d_i^2\)
    S112-11
    S22111
    S33300
    S445-11
    S55411

\(\sum_{i}d_i^2 = 4 \)
\( n = 5 \)
\(\rho_{xy} = 1 - \frac{6\sum_{i}d_i^2}{n(n^2-1)}\)
=> \(\rho_{xy} = 1 - \frac{6*4}{5(5^2-1)}\)
=> \(\rho_{xy} = 1 - \frac{24}{5*24}\)
=> \(\rho_{xy} = 1 - \frac{1}{5}\)
=> \(\rho_{xy} = 0.8\)
Therefore, we can say that there is a strong +ve correlation between the ranks given by teacher A and teacher B.

  1. \(X = [1, 2, 3] \) and \(Y = [1, 8, 27] \)
    Here, Spearman’s rank correlation coefficient \(\rho\) will be perfect 1 as there is a monotonic relationship i.e as X increases, Y increases and vice versa.
    But, the Pearson’s correlation coefficient (r) will be slightly less than 1 i.e r = 0.9662.

Correlation Application

Correlation is very useful in feature selection for training machine learning models.

  1. If 2 features are highly correlated => they provide redundant information.
  • One of the features can be removed without significant loss of information.
  • Keeping both can cause issues, such as, multicollinearity.
  1. If a feature is highly correlated with the target variable => this feature is a strong predictor, so keep it.
  • A feature with very low or near zero correlation with the target variable may be considered for removal, as they have little predictive power.
Correlation Vs Causation

Causation means that one variable directly causes the change in another variable, i.e, direct
cause->effect relationship.
Whereas, correlation means that two variables move together.

  • Correlation does NOT imply Causation.
    • Correlation simply shows an association between two variables that could be coincidental or due to some third, unobserved, factor.

E.g: Election results and stock market - there may be some correlation between the two,
but establishing clear causal links is difficult.



End of Section

2.2.3 - Central Limit Theorem

Central Limit Theorem


Before we understand the Central Limit Theorem, let’s understand a few related concepts.

Population Mean

It is the true average of the entire group.
It describe the central tendency of the entire population.

\( \mu = \frac{1}{N}\sum_{i=1}^{N}x_i \)
N: Number of data points

Sample Mean

It is the average of a smaller representative subset (a sample) of the entire population.
It provides an estimate of the population mean.

\( \bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i \)
n: size of sample

Law of Large Numbers
This law states that as the number of I.I.D samples from a population increases,
the sample mean converges to the true the population mean.
In other words, a long-run average of a repeated random variable approaches the expected value.

Central Limit Theorem

This law states that for a sequence of of I.I.D random variables \( X_1, X_2, \dots, X_n \),
with finite mean and variance, the distribution of the sample mean \( \bar{X} \) approaches a normal distribution as \( n \rightarrow \infty \), regardless of its original population distribution.
The distribution of the sample mean is : \( \bar{X} \sim N(\mu, \sigma^2/n)\)

Let, \( X_1, X_2, \dots, X_n \) are I.I.D random variables.

  • Population mean = \(E[X_i] = \mu < \infty\)
  • Population Variance = \(Var[X_i] = \sigma^2 < \infty \)
  • Sample mean = \( \bar{X_n} = \frac{1}{n}\sum_{i=1}^{n}X_i = \frac{1}{n}(X_1 + X_2+ \dots +X_n) \)
  • Variance of sample means = \( Var[\bar{X_n}] = Var[\frac{1}{n}(X_1+ X_2+ \dots+ X_n)]\)

Now, let’s calculate the variance of sample means.
We know that:

  1. \(Var[X+Y] = Var[X] + Var[Y] \), for independent random variables X and Y.
  2. \(Var[cX] = c^2Var[X] \), for constant ‘c’.

Let’s apply above 2 rules on the variance of sample means equation above:

\[ \begin{aligned} Var[\bar{X_n}] &= Var[\frac{1}{n}(X_1+ X_2+ \dots+ X_n)] \\ &= \frac{1}{n^2}[Var[X_1+ X_2+ \dots+ X_n]] \\ &= \frac{1}{n^2}[Var[X_1] + Var[X_2] + \dots + Var[X_n]] \\ \text{We know that: } Var[X_i] = \sigma^2 \\ &= \frac{1}{n^2}[\sigma^2 + \sigma^2 + \dots + \sigma^2] \\ &= \frac{n\sigma^2}{n^2} \\ => Var[\bar{X_n}] &= \frac{\sigma^2}{n} \end{aligned} \]

Since, standard deviation = \(\sigma = \sqrt{Variance}\)
Therefore, Standard Deviation\([\bar{X_n}] = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}}\)
The standard deviation of the sample means is also known as “Standard Error”.

Note: We can also standardize the sample mean, i.e, mean centering and variance scaling.
Standardisation helps us to use the Z-tables of normal distribution.

We know that, a standardized random variable \(Y_i = \frac{X_i - \mu}{\sigma}\)
Similarly, standardized sample mean:

\[ Z_n = \frac{\bar{X_n} - \mu}{\sqrt{Var[\bar{X_n}]}} = \frac{ \frac{1}{n}\sum_{i=1}^{n}X_i - \mu}{\frac{\sigma}{\sqrt{n}}} \\ = \frac{\sum_{i=1}^{n}X_i - n\mu}{\sigma\sqrt{n}} \xrightarrow{Distribution} N(0,1) , \text{ as } n \rightarrow \infty \\ Z_n \text{ converges in distribution to } N(0,1), \text{ as } n \rightarrow \infty \]

Note: For practical purposes, \(n \ge 30\) is considered as a sufficient sample size for the CLT to hold.

Example
  1. Let’s collect the data for height of people in a city to find the average height of people in the city.
  • Sample size (n) = 100
  • And then repeat this data collection process 1000 times.
  • For each of these 1000 (k) samples, calculate the sample mean \(X_1, X_2, \dots, X_{1000(k)} \)
  • Now, when we plot these 1000(k) sample means, the resulting distribution will be very close to a normal/Gaussian distribution.
    \(\bar{X_n} \sim N(\mu, \sigma^2/n)\), for large n, typically \(n \ge 30\).

Note:

  • ‘k’ = a large number of repetitions allows us to observe the distribution of sample means after plotting.
  • ’n’ = number of samples in each repetition is fixed for any given calculation of sample mean \(\bar{X_n}\).

Why variance must be finite?

The variance must be finite, else, the sample mean will NOT converge to a normal distribution.
If a distribution has a heavy tail, then the expected value calculation diverges.
e.g:

  1. Cauchy distribution has infinite mean and infinite variance.
  2. Pareto distribution (with low alpha) has infinite variance, such as distribution of wealth.



End of Section

2.2.4 - Confidence Interval

Confidence Interval


Confidence Interval

It is a range of values that is likely to contain the true population mean, based on a sample.
Instead of giving a point estimate, it gives a range of values with confidence level.

For normal distribution, confidence interval :

\[ CI = \bar{X} \pm Z\frac{\sigma}{\sqrt{n}} \]

\(\bar{X}\): Sample mean
\(Z\): Z-score corresponding to confidence level
\(n\): Sample size
\( \sigma \): Population Standard Deviation

Applications:

  • A/B testing, i.e., compare 2 or more versions of a product.
  • ML model performance evaluation, i.e, instead of giving a single performance score of say 85%,
    it is better to provide a 95% confidence interval, such as, [82.5%, 87.8%].

Meaning of Confidence Interval

95% confidence interval does NOT mean there is a 95% chance that the true mean lies in the specific calculated interval.

  • It just means that if we repeat the sampling process many times, then 95% of of those calculated intervals will capture or contain the true population mean \(\mu\).
  • Also, we cannot say there is 95% probability that the true mean is within that specific range because true population mean is a fixed constant, NOT a random variable.
Example

Let’s suppose we want to measure the average weight of a certain species of dog.
We want to estimate the true population mean \(\mu\) using confidence interval.
Note: True average weight = 30 kg, but this is NOT known to us.

Sample NumberSample Mean95% Confidence IntervalDid it capture \(\mu\) ?
129.8 kg(28.5, 31.1)Yes
230.4 kg(29.1, 31.7)Yes
331.5 kg(30.2, 32.8)No
428.1 kg(26.7, 29.3)No
----
----
----
10029.9 kg(28.6, 31.2)Yes
  • We generated 100 confidence intervals(CI) each based on different samples.
  • 95% CI guarantees that, in long run, 95 out of 100 CIs will include the true average weight, i.e, \(\mu=30kg\), and may be will miss 5 out of 100 times.

Suggest which company is offering a better salary?
Below is the details of the salaries based on a survey of 50 employees.

CompanyAverage Salary(INR)Standard Deviation
A36 lpa7 lpa
B40 lpa14 lpa

For comparison, let’s calculate the 95% confidence interval for the average salaries of both companies A and B.
We know that:
\( CI = \bar{X} \pm Z\frac{\sigma}{\sqrt{n}} \)
Margin of Error(MoE) \( = Z\frac{\sigma}{\sqrt{n}} \)
Z-Score for 95% CI = 1.96

\(MoE_A = 1.96*\frac{7}{\sqrt{50}} \approx 1.94 \)
=> 95% CI for A = \(36 \pm 1.94 \) = [34.06, 37.94]

\(MoE_B = 1.96*\frac{14}{\sqrt{50}} \approx 3.88\)
=> 95% CI for B = \(40 \pm 3.88 \) = [36.12, 43.88]

We can see that initially company B’s salary looked obviously better,
but after calculating the 95% CI, we can see that there is a significant overlap in the salaries of two companies,
i.e [36.12, 37.94].



End of Section

2.2.5 - Hypothesis Testing

Hypothesis Testing
Hypothesis
An idea that is suggested as a possible explanation for a phenomenon, but has not been found to be true.
Why do we need Hypothesis Testing?

Hypothesis Testing is used to determine whether a claim or theory about a population is supported by a sample data,
by assessing whether observed difference or patterns are likely due to chance or represent a true effect.

  • It allows companies to test marketing campaigns or new strategies on a small scale before committing to larger investments.
  • Based on the results of hypothesis testing, we can make reliable inferences about the whole group based on a representative sample.
  • It helps us determine whether an observed result is statistically significant finding, or if it could have just happened by random chance.
Hypothesis Testing

It is a statistical inference framework used to make decisions about a population parameter, such as, the mean, variance, distribution, correlation, etc., based on a sample of data. It provides a formal method to evaluate competing claims.

Null Hypothesis (\(H_0\)):
Status quo or no-effect or no difference statement; almost always contains a statement of equality.

Alternative Hypothesis (\(H_1 ~or~ H_a\)):
The statement representing an effect, a difference, or a relationship.
It must be true if the null hypothesis is rejected.

Types of Hypothesis Testing
  1. Test of Means:
  • 1-Sample Mean Test: Compare sample mean to a known population mean.
  • 2-Sample Mean Test: Compare means of 2 populations.
  • Paired Mean Test: Compare means when data is paired, e.g., before vs. after test.
  1. Test of Median:
  • Mood’s Median Test
  • Sign Test
  • Wilcoxon Signed Rank Test (non-parametric)
  1. Test of Variance:
  • Chi-Square Test for a single variance
  • F-Test to compare variances of 2 populations
  1. Test of Distribution(Goodness of Fit):
  • Kolmogorov-Smirnov Test
  • Shapiro-Wilk Test
  • Anderson-Darling Test
  • Chi-Square Goodness of Fit Test
  1. Test of Correlation:
  • Pearson’s Correlation Coefficient Test
  • Spearman’s Rank Correlation Test
  • Kendall’s Tau Correlation Test
  • Chi-Square Test of Independence
  1. Regression Test:
  • T-Test: For regression coefficients
  • F-Test: For overall regression significance
Framework for Hypothesis Testing
We can structure any hypothesis test in 6 steps as follows:

Step 1: Define the null and alternative hypotheses.
Step 2: Select a relevant statistical test for the task with associated test statistic.
Step 3: Calculate the test statistic under null hypothesis.
Step 4: Select a significance level (\(\alpha\)), i.e, the maximum acceptable false positive rate;
typically - 5% or 1%.
Step 5: Compute the p-value from the observed value of test-statistic.
Step 6: Make a decision to either accept or reject the null hypothesis, based on the significance level (\(\alpha\)).
Perform a hypothesis test to compare the mean recovery time of 2 medicines.

Say, the data, D: <patient_id, med_1/med_2, recovery_time(in days)>
We need some metric to compare the recovery times of 2 medicines.
We can use the mean recovery time as the metric, because we know that we can use following techniques for comparison:

  1. 2-Sample T-Test; if \(n < 30\) and population standard deviation \(\sigma\) is unknown.
  2. 2-Sample Z-Test; if \(n \ge 30\) and population standard deviation \(\sigma\) is known.

Note: Let’s assume the sample size \(n < 30\), because medical tests usually have small sample sizes.
=> We will use the 2-Sample T-Test; we will continue using T-Test throughout the discussion.

Step 1: Define the null and alternative hypotheses.
Null Hypothesis \(H_0\): The mean recovery time of 2 medicines is the same i.e \(Mean_{m1} = Mean_{m2}\) or \(m_{m1} = m_{m2}\).
Alternate Hypothesis \(H_a\): \(m_{m1} < m_{m2}\) (1-Sided T-Test) or \(m_{m1} ⍯ m_{m2}\) (2-Sided T-Test).

Step 2: Select a relevant statistical test for the task with associated test statistic.
Let’s do a 2 sample T-Test, i.e, \(m_{m1} < m_{m2}\)

Step 3: Calculate the test statistic under null hypothesis.
Test Statistic:
For 2 sample T-Test:

\[t_{obs} = \frac{m_{m_1} - m_{m_2}}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]


s: Standard Deviation
n: Sample Size

Note: If the 2 means are very close then \(t_{obs} \approx 0\).

Step 4: Suppose significance level (\(\alpha\)) = 5% or 0.05.

Step 5: Compute the p-value from the observed value of test-statistic.
P-Value:

\[p_{value} = \mathbb{P}(t \geq t_{obs} | H_0)\]


p-value = area under curve = probability of observing test statistic \( \ge t_{obs} \) if the null hypothesis is true.

images/maths/statistics/p_value_1.png

Step 6: Accept or reject the null hypothesis, based on the significance level (\(\alpha\)).
If \(p_{value} < \alpha\), we reject the null hypothesis and accept the alternative hypothesis and vice versa.
Note: In the above example \(p_{value} < \alpha\), so we reject the null hypothesis.

Left or Right Sided (Tailed) Test

We need to do a left or right sided test, or a 2-sided test, this depends upon our alternate hypothesis and test statistic.

Let’s continue our 2 sample mean T-test to understand the concept:

\[t_{obs} = \frac{m_{m_1} - m_{m_2}}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]

Left Sided/Tailed Test:
\(H_a\): Mean recovery time of medicine 1 < medicine 2, i.e, \(m_{m_1} < m_{m_2}\)
=> \(m_{m_1} - m_{m_2} < 0\)

\[t_{obs} = \frac{m_{m_1} - m_{m_2}}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]

Since, the denominator in above equation is always positive.
=> \(t_{obs} < 0\)
Therefore, we need to do a left sided/tailed test.

images/maths/statistics/left_tailed.png

So, we want \(t_{obs}\) to be very negative to confidently conclude that alternate hypothesis is true.

Right Sided/Tailed Test:
\(H_a\): Mean recovery time of medicine 1 > medicine 2, i.e, \(m_{m_1} > m_{m_2}\)
=> \(m_{m_1} - m_{m_2} > 0\)
Similarly, here we need to do a right sided/tailed test.

2 Sided/Tailed Test:
\(H_a\): Mean recovery time of medicine 1 ⍯ medicine 2, i.e, \(m_{m_1} ⍯ ~ m_{m_2}\)
=> \(m_{m_1} - m_{m_2} < 0\) or \(m_{m_1} - m_{m_2} > 0\)
If \(H_a\) is true then \(t_{obs}\) is a large -ve value or a large +ve value.

\[t_{obs} = \frac{m_{m_1} - m_{m_2}}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]

Since, t-distribution is symmetric, we can divide the significance level \(\alpha\) into 2 equal parts.
i.e \(\alpha = 2.5\%\) on each side.

images/maths/statistics/two_sided.png

So, we want \(t_{obs}\) to be very negative or very positive to confidently conclude that the alternate hypothesis is true. We accept \(H_a\) if \(t_{obs} < t^1_{\alpha/2}\) or \(t_{obs} > t^2_{\alpha/2}\).

Note: For critical applications ‘\(\alpha\)’ can be very small i.e. 0.1% or 0.01%, e.g medicine.

Significance Level (\(\alpha\))

It is the probability of wrongly rejecting a true null hypothesis, known as a Type I error or false +ve rate.

  • Tolerance level of wrongly accepting alternate hypothesis.
  • If the p-value < \(\alpha\), we reject the null hypothesis and conclude that the finding is statistically NOT so significant..
Critical Value

It is a specific point on the test-statistic distribution that defines the boundaries of the null hypothesis acceptance/rejection region.

  • It tells us that at what value (\(t_{\alpha}\)) of test statistic will the area under curve be equal to the significance level \(\alpha\).

  • For a right tailed/sided test:

    • if \(t_{obs} > t_{\alpha} => p_{value} < \alpha\); therefore, reject null hypothesis.
    • if \(t_{obs} < t_{\alpha} => p_{value} \ge \alpha\); therefore, failed to reject null hypothesis.
    images/maths/statistics/critical_value.png

Power of Test

It is the probability that a hypothesis test will correctly reject a false null hypothesis (\(H_{0}\)) when the alternative hypothesis (\(H_{a}\)) is true.

  • Power of test = \(1 - \beta\)

  • Probability of correctly accepting alternate hypothesis (\(H_{a}\))

  • \(\alpha\): Probability of wrongly accepting alternate hypothesis \(H_{a}\)

  • \(\beta\): Probability of wrongly rejecting alternate hypothesis \(H_{a}\)

    images/maths/statistics/power_of_test.png

Does having a large sample size make a hypothesis test more powerful?

Yes, having a large sample size makes a hypothesis test more powerful.

  • As n increases, sample mean \(\bar{x}\) approaches the population mean \(\mu\).
  • Also, as n increases, t-distribution approaches normal distribution.
P-Value
P-value only measures whether the observed change is statistically significant.
Effect Size

It is a standardized objective measure that complements p-value by clarifying whether a statistically significant finding has any real world relevance.
It quantifies the magnitude of relationship between two variables.

  • Larger effect size => more impactful effect.
  • Standardized (mean centered + variance scaled) measure allows us to compare the imporance of effect across various studies or groups, even with different sample sizes.

Effect size is measured using Cohen’s d formula:

\[ d = \frac{\bar{X_1} - \bar{X_2}}{s_p} \\[10pt] s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}} \]

\(\bar{X}\): Sample mean
\(s_p\): Pooled Standard deviation
\(n\): Sample size
\(s\): Standard deviation

Note: Theoretically, Cohen’s d value can range from negative infinity to positive infinity.
but for practical purposes, we use the following value:
small effect (\(d=0.2\)), medium effect (\(d=0.5\)), and large effect (\(d\ge 0.8\)).

  • More overlap => less effect i.e low Cohen’s d value.

    images/maths/statistics/effect_size.png
Example
  • A study on drug trials finds that patients taking a new drug had statistically significant
    improvement (p-value<0.05), compared to a placebo group.
  1. Small effect size: Cohen’s d = 0.1 => drug had minimal effect.
  2. Large effect size: Cohen’s d = 0.8 => drug produced substantial improvement.

End of Section

2.2.6 - T-Test

Student’s T-Test


T-Test

It is a statistical test that is used to determine whether the sample mean is equal to a hypothesized value or
is there a significant difference between the sample means of 2 groups.

  • It is a parametric test, since it assumes data to be approximately normally distributed.
  • Appropriate when:
    • sample size n < 30.
    • population standard deviation \(\sigma\) is unknown.
  • It is based on Student’s t-distribution.

Student's t-distribution

It is a continuous probability distribution that is a symmetrical, bell-shaped curve similar to the normal distribution but with heavier tails.

  • Shape of the curve or mass in tail is controlled by degrees of freedom.

    images/maths/statistics/t_distribution.png

There are 3 types of T-Test:

  1. 1-Sample T-Test: Test if sample mean differs from hypothesized value.
  2. 2-Sample T-Test: Test whether there is a significant difference between the means of two independent groups.
  3. Paired T-Test: Test whether 2 related samples differ, e.g., before and after.
Degrees of Freedom (\(\nu\))
It represents the number of independent pieces of information available in the sample to estimate the variability in the data.
Generally speaking, it represents the number of independent values that are free to vary in a dataset when estimating a parameter.
e.g.: If we have k observations and their sum = 50.
The sum of (k-1) terms can be anything, but the kth term is fixed at 50 - (sum of other (k-1) terms).
So, we have only (k-1) terms that can change independently, therefore, the DOF(\(\nu\)) = k-1.

1-Sample T-Test

It is used to test whether the sample mean is equal to a known/hypothesized value.
Test statistic (t):

\[ t = \frac{\bar{x} - \mu}{s/\sqrt{n}} \]

where,
\(\bar{x}\): sample mean
\(\mu\): hypothesized value
\(s\): sample standard deviation
\(n\): sample size
\(\nu = n-1 \): degrees of freedom

A developer claims that the new algorithm improves API response time by 100 ms, on an average.
Tester ran the test 20 times and found the average API repsonse time to be 115 ms, with a standard deviation of 25 ms.
Is the developer’s claim valid?

Let’s verify developer’s claim using the tester’s test results using 1 sample t-test.
Null hypothesis: \(H_0\) = The average API response time is 100 ms, i.e, \(\bar{x} = \mu\).
Alternative hypothesis: \(H_a\) = The average API response time > 100 ms, i.e, \(\bar{x} > \mu\) => right tailed test.
Hypothesized mean \(\mu\) = 100 ms
Sample mean \(\bar{x}\) = 115 ms
Sample standard deviation \(s\) = 25 ms
Sample size \(n\) = 20
Degrees of freedom \(\nu\) = 19

\( t_{obs} = \frac{\bar{x} - \mu}{s/\sqrt{n}}\) = \(\frac{115 - 100}{25/\sqrt{20}}\)
= \(\frac{15\sqrt{20}}{25} = \frac{3\sqrt{20}}{5} \approx 2.68\)

Let significance level \(\alpha\) = 5% =0.05.
Critical value \(t_{0.05}\) = 1.729
Important: Find the value of \(t_{\alpha}\) in T-table

images/maths/statistics/one_sample_t_test.png

Since \(t_{obs}\) > \(t_{0.05}\), we reject the null hypothesis.
And, accept the alternative hypothesis that the API response time is significantly > 100 ms.
Hence, the developer’s claim is NOT valid.

2-Sample T-Test

It is used to determine whether there is a significant difference between the means of two independent groups.
There are 2 types of 2-sample t-test:

  1. Unequal Variance
  2. Equal Variance

Unequal Variance:
In this case, the variance of 2 independent groups is not equal.
Also called, Welch’s t-test.
Test statistic (t):

\[ t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \\[10pt] \text{ Degrees of freedom (Welch-Satterthwaite): } \\[10pt] \nu = \frac{[\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}]^2}{\frac{s_1^4}{n_1^2(n_1-1)} + \frac{s_2^4}{n_2^2(n_2-1)}} \]

Equal Variance:
In this case, both samples come from equal or approximately equal variance.
Test statistic (t):

\[ t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \\[10pt] \text{ Pooled variance } s_p: \\[10pt] s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}} \]

Here, degrees of freedom (for equal variance) \(\nu\) = \(n_1 + n_2 - 2\).

\(\bar{x}\): sample mean
\(s\): sample standard deviation
\(n\): sample size
\(\nu\): degrees of freedom

The AI team wants to validate whether the new ML model accuracy is better than the existing model’s accuracy.
Below is the data for the existing model and the new model.

New Model (A)Existing Model (B)
Sample size (n)2418
Sample mean (\(\bar{x}\))91%88%
Sample std. dev. (s)4%3%

Given that the variance of accuracy scores of new and existing models are almost same.

Now, let’s follow our hypothesis testing framework.
Null hypothesis: \(H_0\): The accuracy of new model is same as the accuracy of existing model.
Alternative hypothesis: \(H_a\): The new model’s accuracy is better/greater than the existing model’s accuracy => right tailed test

Let’s solve this using 2 sample T-Test, since the sample size n < 30.
Since the variance of 2 sample are almost equal then we can use the pooled variance method.

Next let’s compute the test statistic, under null hypothesis.

\[ s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}} \\[10pt] = \sqrt{\frac{(23)4^2 + (17)3^2}{24+18-2}} \\[10pt] = \sqrt{\frac{23*16 + 17*9}{40}} = \sqrt{\frac{521}{40}} \\[10pt] => s_p \approx 3.6 \\[10pt] t_{obs} = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \\[10pt] = \frac{91-88}{3.6\sqrt{\frac{1}{24} + \frac{1}{18}}} \\[10pt] = \frac{3}{3.6*0.31} \\[10pt] => t_{obs} \approx 2.68 \\[10pt] \]

DOF \(\nu\) = \(24+18-2\) = 42 - 2 = 40
Let significance level \(\alpha\) = 5% =0.05.
Critical value \(t_{0.05}\) = 1.684
Important: Find the value of \(t_{\alpha}\) in T-table

images/maths/statistics/two_sample_t_test.png

Since \(t_{obs}\) > \(t_{0.05}\), we reject the null hypothesis.
And, accept the alternative hypothesis that the new model has better accuracy than the existing model.



End of Section

2.2.7 - Z-Test

Z-Test


Z-Test

It is a statistical test used to determine whether there is a significant difference between mean of 2 groups or sample and population mean.

  • It is a parametric test, since it assumes data to be normally distributed.
  • Appropriate when:
    • sample size \(n \ge 30\).
    • population standard deviation \(\sigma \) is known.
  • It is based on Gaussian/normal distribution.
  • It compares the difference between means relative to standard error, i.e, standard deviation of sampling distribution of sample mean.

There are 2 types of Z-Test:

  • 1-Sample Z-Test: Used to compare the mean of a sample mean with a population mean.
  • 2-Sample Z-Test: Used to compare the sample means of 2 independent samples.

Z-Score

It is a standardized score that measures how many standard deviations a particular data point is away from the population mean \(\mu\).

  • Transform a normal distribution \(\mathcal{N}(\mu, \sigma^2)\) to a standard normal distribution \(Z \sim \mathcal{N}(0, 1)\).
  • Standardized score helps us compare values from different normal distributions.

Z-score is calculated as:

\[Z = \frac{x - \mu}{\sigma}\]

x: data point
\(\mu\): population mean
\(\sigma\): population standard deviation

e.g:

  1. Z-score of 1.5 => data point is 1.5 standard deviations above the mean.
  2. Z-score of -2.0 => data point is 2.0 standard deviations below the mean.

Z-score helps to define probability areas:

  • 68% of the data points fall within \(\pm 1 \sigma\).
  • 95% of the data points fall within \(\pm 2 \sigma\).
  • 99.7% of the data points fall within \(\pm 3 \sigma\).

Note:

  • Z-Test applies the concept of Z-score to sample mean rather than a single data point.
1-Sample Z-Test

It is used to test whether the sample mean \(\bar{x}\) is significantly different from a known population mean \(\mu\).

Test Statistic:

\[ Z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}} \]

\(\bar{x}\): sample mean
\(\mu\): hypothesized population mean
\(\sigma\): population standard deviation
\(n\): sample size
\(\sigma / \sqrt{n}\): standard error of mean

Read more about Standard Error

Note: Test statistic Z follows a standard normal distribution \(Z \sim \mathcal{N}(0, 1)\).

2-Sample Z-Test

It is used to test whether the sample means \(\bar{x_1}\) and \(\bar{x_2}\) of 2 independent samples are significantly different from each other.
Test Statistic:

\[ Z = \frac{\bar{x_1} - \bar{x_2}}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}} \]

\(\bar{x_1}\): sample mean of first sample
\(\bar{x_2}\): sample mean of second sample
\(\sigma_1\): population standard deviation of first sample
\(\sigma_2\): population standard deviation of second sample
\(n_1\): sample size of first sample
\(n_2\): sample size of second sample

Note: Test statistic Z follows a standard normal distribution \(Z \sim \mathcal{N}(0, 1)\).


Average time to run a ML model is 120 seconds with a known standard deviation of 15 seconds.
After applying a new optimisation, and n=100 runs, yields a sample mean of 117 seconds.
Does the optimisation significantly reduce the runtime of the model?
Consider the significance level of 5%.

Null Hypothesis: \(\mu = 120\) seconds, i.e, no change.
Alternative Hypothesis: \(\mu < 120\) seconds => left tailed test.

We will use 1-sample Z-Test to test the hypothesis.
Test Statistic:

\[ t_{obs} = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}} \\[10pt] = \frac{117 - 120}{\frac{15}{\sqrt{100}}} \\[10pt] = \frac{-3}{\frac{15}{10}} = \frac{-30}{15}\\[10pt] => t_{obs} = -2 \]

Since, significance level \(\alpha\) = 5% =0.05.
Critical value \(Z_{0.05}\) = -1.645
Important: Find the value of \(Z_{\alpha}\) in Z-Score Table

images/maths/statistics/one_sample_z_test.png

Our \(t_{obs}\) is much more extreme than the the critical value \(Z_{0.05}\), => p-value < 5%.
Hence, we reject the null hypothesis.
Therefore, there is a statistically significant evidence that the new optimisation reduces the runtime of the model.

Z Test of Proportion

It is a statistical hypothesis test used to determine if there is a significant difference between the proportion of a characteristic in two independent samples or to compare a sample proportion to a known population value

  • It is used when dealing with categorical data, such as, success/failure, male/female, yes/no etc.

It is of 2 types:

  1. 1-Sample Z-Test of Proportion: Used to test whether the observed proportion in a sample differs from hypothesized proportion.
  2. 2-Sample Z-Test of Proportion: Used to compare whether the 2 independent samples differ in their proportions.

The categorical data,i.e success/failure, is discrete that can be modeled as Bernoulli distribution.
Let’s understand how this Bernoulli random variable can be approximated as a Gaussian distribution for a very large sample size, using Central Limit Theorem.
Read more about Central Limit Theorem

Note:We will not prove the complete thing, but we will understand the concept in enough depth for clarity.

Sampling Distribution of a Proportion

\(Y \sim Bernoulli(p)\)
\(X \sim Binomial(n,p)\)
E[X] = mean = np
Var[X] = variance = np(1-p)
X = total number of successes
p = true probability of success
n = number of trials
Proportion of Success in sample = Sample Proportion = \(\hat{p} = \frac{X}{n}\)
e.g.: If n=100 people were surveyed, and 40 said yes, then \(\hat{p} = \frac{40}{100} = 0.4\)

\[ E[\hat{p}] = \frac{1}{n} E[X] = \frac{np}{n} = p \\[10pt] Var[\hat{p}] = Var[\frac{X}{n}] = \frac{Var[X]}{n^2} = \frac{np(1-p)}{n^2} =\frac{p(1-p)}{n} \\[10pt] \]

By Central Limit Theorem, we can state that for very large ’n’ Binomial distribution’s mean and variance can be used as an approximation for Gaussian/Normal distribution:

\[ X \sim Binomial(n,p) \xrightarrow{n \rightarrow \infty} X \approx N(np, np(1-p)) \\[10pt] \]

Since, \(\hat{p} = \frac{X}{n}\)
We can say that:

\[ \hat{p} \xrightarrow{n \rightarrow \infty} \approx N[p, \frac{p(1-p)}{n}] \\[10pt] \]

Mean = \(\mu_{\hat{p}} = p\) = True proportion of success in the entire population
Standard Error = \(SE_{\hat{p}} = \sqrt{Var[\frac{X}{n}]} = \sqrt{\frac{p(1-p)}{n}}\) = Standard Deviation of the sample proportion

Note: Large Sample Condition - Approximation is only valid when the expected number of successes and failures are both > 10 (sometimes 5).
\(np \ge 10 ~and~ n(1-p) \ge 10\)

1-Sample Z-Test of Proportion

It is used to test whether the observed proportion in a sample differs from hypothesized proportion.
\(\hat{p} = \frac{X}{n}\): Proportion of success observed in a sample
\(p_0\): Specific propotion value under the null hypothesis
\(SE_0\): Standard error of sample proportion under the null hypothesis
Z-Statistic: Measures how many standard errors is the observed sample proportion \(\hat{p}\) away from \(p_0\)
Test Statistic:

\[ Z = \frac{\hat{p} - p_0}{SE_0} = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \]

2-Sample Z-Test of Proportion

It is used to compare whether the 2 independent samples differ in their proportions.

  • Standard test used in A/B testing.
\[ \bar{p} = \frac{total ~ successes}{total ~ sample ~ size} = \frac{x_1+x_2}{n_1+n_2} = \frac{n_1\hat{p_1}+n_2\hat{p_2}}{n_1+n_2} \\[10pt] Standard ~ Error_{\hat{p_1}-\hat{p_2}} = \sqrt{\bar{p}(1-\bar{p})(\frac{1}{n_1} +\frac{1}{n_2})} \\[10pt] Z = \frac{\hat{p_1}-\hat{p_2}}{SE_{\hat{p_1}-\hat{p_2}}} \\[10pt] => Z = \frac{\hat{p_1}-\hat{p_2}}{\sqrt{\bar{p}(1-\bar{p})(\frac{1}{n_1} +\frac{1}{n_2})}} \]

A company wants to compare its 2 different website designs A & B.
Below is the table that shows the data:

Design# of visitors(n)# of signups(x)conversion rate(\(\hat{p} = \frac{x}{n}\))
A1000800.08
B12001140.095

Is the design B better, i.e, design B increases conversion rate or proportion of visitors who sign up?
Consider the significance level of 5%.

Null Hypothesis: \(\hat{p_A} = \hat{p_B}\), i.e, no difference in conversion rates of 2 designs A & B.
Alternative Hypothesis: \(\hat{p_B} > \hat{p_A}\) i.e conversion rate of B > A => right tailed test.

Check large sample condition for both samples A & B.
\(n\hat{p_A} = 80 > 10 ~and~ n(1-\hat{p_A}) = 920 > 10\)
Similarly, we can show for B too.

Pooled proportion:

\[ \bar{p} = \frac{x_A+x_B}{n_A+n_B} = \frac{80+114}{1000+1200} = \frac{194}{2200} \\[10pt] => \bar{p}\approx 0.0882 \]

Standard Error(Pooled):

\[ SE=\sqrt{\bar{p}(1-\bar{p})(\frac{1}{n_1} +\frac{1}{n_2})} \\[10pt] = \sqrt{0.0882(1-0.0882)(\frac{1}{1000} +\frac{1}{1200})} \\[10pt] => SE \approx 0.0123 \]

Test Statistic(Z):

\[ t_{obs} = \frac{\hat{p_B}-\hat{p_A}}{SE_{\hat{p_A}-\hat{p_B}}} \\[10pt] = \frac{0.095-0.0882}{0.0123} \\[10pt] => t_{obs} \approx 1.22 \]

Significance level \(\alpha\) = 5% =0.05.
Critical value \(Z_{0.05}\) = 1.645

images/maths/statistics/two_sample_z_test_proportion.png

Since, \(t_{obs} < Z_{0.05}\) => p-value > 5%.
Hence, we fail to reject the null hypothesis.
Therefore, the observed conversion rate of design B is due to random chance; thus, B is not a better design.



End of Section

2.2.8 - Chi-Square Test

Chi-Square Test
Chi-Square Distribution (\(\chi^2\))

A random variable Q is said to follow a chi-square distribution with ’n’ degrees of freedom,i.e \(\chi^2(n)\),
if it is the sum of squares of ’n’ independent random variables that follow a standard normal distribution, i.e, \(N(0,1)\).

\[ Q = \chi^2(n) = \sum_{i=1}^n Z_i^2 \\ \text{ where: } Z_i \sim N(0,1) \\ \text{ n: degrees of freedom } \]
images/maths/statistics/chi_square_distribution.png

Key Properties:

  1. Non-negative, since sum of squares.
  2. Asymmetric, right skewed.
  3. Shape depends on the degrees of freedom; as \(\nu\) increases, the distribution becomes more symmetric and approaches a normal distribution.
Degrees of Freedom (\(\nu\))
It represents the number of independent pieces of information available in the sample to estimate the variability in the data.
Generally speaking, it represents the number of independent values that are free to vary in a dataset when estimating a parameter.
e.g.: If we have k observations and their sum = 50.
The sum of (k-1) terms can be anything, but the kth term is fixed at 50 - (sum of other (k-1) terms).
So, we have only (k-1) terms that can change independently, therefore, the DOF(\(\nu\)) = k-1.
Central Limit Theorem
Central Limit Theorem states that the sampling distribution of sample means approaches a normal distribution as the sample size increases, regardless of the distribution of the population.
More broadly, we can also say that sum/count of independent random variables approaches a normal distribution as the sample size increases.
Since, sample mean \(\bar{x} = \frac{sum}{n} \).

Read more about Central Limit Theorem
Sampling Distribution of Counts

Note: We are dealing with categorical data, where there is a count associated with each category.
In the context of categorical data, the counts \(O_i\) are governed by multinomial distribution
(a generalisation of binomial distribution).
Multinomial distribution is defined for multiple classes or categories, ‘k’, and multiple trials ’n’.
For \(i^{th}\) category:
Probability of \(i^{th}\) category = \(p_i\)
Mean = Expected count/frequency = \(E_i = np_i \)
Variance = \(Var_i = np_i(1-p_i) \)

By Central Limit Theorem, for very large n, i.e, as \(n \rightarrow \infty\), the multinomial distribution can be approximated as a normal distribution.
The multinomial distribution of count/frequency can be approximated as :
\(O_i \approx N(np_i, np_i(1-p_i))\)

Standardized count (mean centered and variance scaled):

\[ Z_i = \frac{O_i - E_i}{\sqrt{Var_i}} \\[10pt] => Z_i = \frac{O_i - np_i}{\sqrt{np_i(1-p_i)}} \xrightarrow{distribution} N(0,1), \text{ as } n \rightarrow \infty \\[10pt] \xrightarrow{distribution} \text { : means converges in distribution } \]

Under Null Hypothesis:
In Pearson’s proof of the chi-square test, the statistic is divided by the expected value (\(E_{i}\)) instead of the variance (\(Var_{i}\)), because for count data that can be modeled using a Poisson distribution (or a multinomial distribution where cell counts are approximately Poisson for large samples), the variance is equal to the expected value (mean).

Therefore, \(Z_i \approx (O_{i}-E_{i})/\sqrt{E_{i}}\)
Note that the denominator is \(\sqrt{E_{i}}\) NOT \(\sqrt{Var_{i}}\).

\(O_{i}\): Observed count for \(i^{th}\) category
\(E_{i}\): Expected count for \(i^{th}\) category

Important: \(E_{i}\): Expected count should be large i.e >= 5 (typically) for a good enough approximation.

Chi-Square (\(\chi^2\)) Test Statistic

It is formed by squaring the approximately standard normal counts above, and summing them up.
For \(k\) categories, the test statistic is:

\[ \chi_{calc}^2 = \sum_i Z_i^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} \]

Note: For very large ’n’, the Pearson’s chi-square (\(\chi^2\)) test statistic follows a chi-square (\(\chi^2\)) distribution.


Name of Test
Note: All the hypothesis tests get their name from the underlying distribution of the test statistic.
Chi-Square (\(\chi^2\)) Test
It is used to analyze categorical data to determine whether there is a significant difference between observed and expected counts.
It is a non-parametric test for categorical data, i.e, does NOT make any assumption about the underlying distribution of the data, such as, normally distributed with known mean and variance; only uses observed and expected count/frequencies.
Note: Requires a large sample size.
Test of Goodness of Fit

It is used to compare the observed frequency distribution of a single categorical variable to a hypothesized or expected probability distribution.
It can be used to determine whether a sample taken from a population follows a particular distribution, e.g., uniform, normal, etc.

Test Statistic:

\[ \chi_{calc}^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} \]

\(O_{i}\): Observed count for \(i^{th}\) category
\(E_{i}\): Expected count for \(i^{th}\) category, under null hypothesis \(H_0\)
\(k\): Number of categories
\(\nu\): Degrees of freedom = k - 1- m
\(m\): Number of parameters estimated from sample data to determine the expected probability
Note: Typical m=0, since, NO parameters are estimated.

Other Goodness of Fit Tests
  1. Kolmogorov-Smirnov (KS) Test: Compares empirical CDF with theoretical CDF of distribution.
  2. Anderson-Darling (AD) Test: Refinement of KS Test.
  3. Shapiro-Wilk (SW) Test: Specialised for normal distribution; good for small samples.
In a coin toss experiment, we tossed a coin 100 times, and got 62 heads and 38 tails.
Find whether it is a fair coin (discrete uniform distribution test)?
Significance level = 5%

We need to find whether the coin is fair i.e we need to do a goodness of fit test for discrete uniform distribution.

Null Hypothesis \(H_0\): Coin is fair.
Alternative Hypothesis \(H_a\): Coin is biased towards head.

\(O_{H}\): Observed count head = 62
\(O_{T}\): Observed count head = 38
\(E_{i}\): Expected count for \(i^{th}\) category, under null hypothesis \(H_0\) = 50 i.e fair coin
\(k\): Number of categories = 2
\(\nu\): Degrees of freedom = k - 1- m = 2 - 1 - 0 = 1
Test Statistic:

\[ t_{obs} = \chi_{calc}^2 = \sum_{i=1}^2 \frac{(O_i - E_i)^2}{E_i} \\[10pt] = \frac{(62 - 50)^2}{50} + \frac{(38 - 50)^2}{50} \\[10pt] = \frac{144}{50} + \frac{144}{50} \\[10pt] => t_{obs} = 5.76 \]

Since, significance level = 5% = 0.05
Critical value = \(\chi^2(0.05,1)\) = 3.84

images/maths/statistics/chi_square_gof.png

Since, \(t_{obs}\) = 5.76 > 3.84 (critical value), we reject the null hypothesis \(H_0\).
Therefore, the coin is biased towards head.

Test of Independence

It is used to determine whether an association exists between two categorical variables, using a contingency(dependency) table.
It is a non-parametric test, i.e, does NOT make any assumption about the underlying distribution of the data.

Test Statistic:

\[ \chi_{calc}^2 = \sum_{i=1}^R \sum_{i=1}^C \frac{(O_i - E_i)^2}{E_i} \]

\(O_{ij}\): Observed count for \(cell_{i,j}\)
\(E_{ij}\): Expected count for \(cell_{i,j}\), under null hypothesis \(H_0\)
\(R\): Number of rows
\(C\): Number of columns
\(\nu\): Degrees of freedom = (R-1)*(C-1)

Let’s understand the above test statistic in more detail.
We know that, if 2 random variables A & B are independent, then,
\(P(A \cap B) = P(A, B) = P(A)*P(B)\)
i.e Joint Probability = Product of marginal probabilities.

Null Hypothesis \(H_0\): \(A\) and \(B\) are independent.
Alternative Hypothesis \(H_a\): \(A\) and \(B\) are dependent or associated.
N = Sample size
\(P(A_i) \approx \frac{Row ~~ Total_i}{N}\)

\(P(B_j) \approx \frac{Col ~~ Total_j}{N}\)

\(E_{ij}\) : Expected count for \(cell_{i,j}\) = \( N*P(A_i)*P(B_j)\)

=> \(E_{ij}\) = \(N*\frac{Row ~~ Total_i}{N}*\frac{Col ~~ Total_j}{N}\)

=> \(E_{ij}\) = \(\frac{Row ~~ Total_i * Col ~~ Total_j}{N}\)

\(O_{ij}\): Observed count for \(cell_{i,j}\)

A survey of 100 students was conducted to understand whether there is any relation between gender and beverage preference.
Below is the table that shows the number of students who prefer each beverage.

GenderTeaCoffee
Male203050
Female104050
3070

Significance level = 5%

Null Hypothesis \(H_0\): Gender and beverage preference are independent.
Alternative Hypothesis \(H_a\): Gender and beverage preference are dependent.

We know that Expected count for cell(i,j) = \(E_{ij}\) = \(\frac{Row ~~ Total_i * Col ~~ Total_j}{N}\)

\(E_{11} = \frac{50*30}{100} = 15\)

\(E_{12} = \frac{50*70}{100} = 35\)

\(E_{21} = \frac{50*30}{100} = 15\)

\(E_{22} = \frac{50*70}{100} = 35\)

Test Statistic:

\[ t_{obs} = \chi_{calc}^2 = \sum_{i=1}^R \sum_{i=1}^C \frac{(O_i - E_i)^2}{E_i} \\[10pt] = \frac{(20 - 15)^2}{15} + \frac{(30 - 35)^2}{35} + \frac{(10 - 15)^2}{15} + \frac{(40 - 35)^2}{35} \\[10pt] = \frac{25}{15} + \frac{25}{35} + \frac{25}{15} + \frac{25}{35} \\[10pt] => t_{obs} = \frac{50}{15} + \frac{50}{35} \approx 4.76 \]

Degrees of freedom = (R-1)(C-1) = (2-1)(2-1) = 1
Since, significance level = 5% = 0.05
Critical value = \(\chi^2(0.05,1)\) = 3.84

images/maths/statistics/chi_square_independence.png

Since, \(t_{obs}\) = 4.76 > 3.84 (critical value), we reject the null hypothesis \(H_0\).
Therefore, the gender and beverage preference are dependent.



End of Section

2.2.9 - Performance Metrics

Performance Metrics


Performance Metrics
They are quantitative measures used to evaluate how well a machine learning model performs on unseen data.
E.g.: For regression models, we have - MSE, RMSE, MAE, R^2 metric, etc.
Here, we will discuss various performance metrics for classification models.

Confusion Matrix

It is a table that summarizes model’s predictions against the actual class labels, detailing where the model succeeded and where it failed.
It is used for binary or multi-class classification problems.

Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)

Type-1 Error:
It is the number of false positives.
e.g.: Model predicted that a patient has diabetes, but the patient actually does NOT have diabetes; “false alarm”.

Type-2 Error:
It is the number of false negatives. e.g.: Model predicted that a patient does NOT have diabetes, but the patient actually has diabetes; “a miss”.

Important Metrics

Many metrics are derived from the confusion matrix.

Precision:
It answers the question: “Of all the instances that the model predicted as positive, how many were actually positive?”
It measures exactness or quality of the positive predictions.

\[ Precision = \frac{TP}{TP + FP} \]


Recall:
It answers the question: “Of all the actual positive instances, how many did the model correctly identify?”
It measures completeness or coverage of the positive predictions.

\[ Recall = \frac{TP}{TP + FN} \]


F1 Score:
It is the harmonic mean of precision and recall.
It is used when we need a balance between precision and recall; also helpful when we have imbalanced data.
Harmonic mean penalizes extreme values more heavily, encouraging both metrics to be high.

\[ F1 ~ Score = 2 * \frac{Precision \times Recall}{Precision + Recall} \]


PrecisionRecallF1 ScoreMean
0.50.50.500.5
0.70.30.420.5
0.90.10.180.5

Trade-Off:
Precision Focus:: Critical when cost of false positives is high.
e.g: Identify a potential terrorist.
A false positive, i.e, wrongly flagging an innocent person as a potential terrorist is very harmful.

Recall Focus:: Critical when cost of false negatives is high.
e.g.: Medical diagnosis of a serious disease.
A false negative, i.e, falsely missing a serious disease can cost someone’s life.

Analyze the performance of an access control system. Below is the data for 1000 access attempts.

Predicted Authorised AccessPredicted Unauthorised Access
Actual Authorised Access90 (TP)10 (FN)
Actual Unauthorised Access1 (FP)899 (TN)
\[ Precision = \frac{TP}{TP + FP} = \frac{90}{90 + 1} \approx 0.989 \]

When the system allows access, it is correct 98.9% of the time.

\[ Recall = \frac{TP}{TP + FN} = \frac{90}{90 + 10} = 0.9 \]

The system caught 90% of all authorized accesses.

\[ F1 ~ Score = 2 * \frac{Precision \times Recall}{Precision + Recall} \\[10pt] = 2 * \frac{0.989 \times 0.9}{0.989 + 0.9} \\[10pt] => F1 ~ Score \approx 0.942 \]
Receiver Operating Characteristic (ROC) Curve

It is a graphical plot that shows the discriminating ability of a binary classifier system, as its discrimination threshold is varied.
Y-axis: True Positive Rate (TPR), Recall, Sensitivity
\(TPR = \frac{TP}{TP + FN}\)

X-axis: False Positive Rate (FPR); (1 - Specificity)
\(FPR = \frac{FP}{FP + TN}\)

Note: A binary classifier model outputs a probability score between 0 and 1.
and a threshold (default=0.5) is applied to the probability score to get the final class label.

\(p \ge 0.5\) => Positive Class
\(p < 0.5\) => Negative Class

Algorithm:

  1. Sort the data by the probability score in descending order.
  2. Set each probability score as the threshold for classification and calculate the TPR and FPR for each threshold.
  3. Plot each pair of (TPR, FPR) for all ’n’ data points to get the final ROC curve.

e.g.:

Patient_IdTrue Label \(y_i\)Predicted Probability Score \(\hat{y_i}\)
110.95
200.85
310.72
410.63
500.59
610.45
710.37
800.20
900.12
1000.05

Set the threshold \(\tau_1\) = 0.95, calculate \({TPR}_1, {FPR}_1\)
Set the threshold \(\tau_2\) = 0.85, calculate \({TPR}_2, {FPR}_2\)
Set the threshold \(\tau_3\) = 0.72, calculate \({TPR}_3, {FPR}_3\)


Set the threshold \(\tau_n\) = 0.05, calculate \({TPR}_n, {FPR}_n\)

Now, we have ’n’ pairs of (TPR, FPR) for all ’n’ data points.
Plot the points on a graph to get the final ROC curve.

images/maths/statistics/roc.png

AU ROC = AUC = Area under the ROC curve = Area under the curve

Note:

  1. If AUC < 0.5, then invert the labels of the classes.
  2. ROC does NOT perform well on imbalanced data.
    • Either balance the data or
    • Use Precision-Recall curve.
What is the AUC of a random binary classifier model?
AUC of a random binary classifier model = 0.5
Since, labels are randomly generated as 0/1 for binary classification, so 50% labels from each class.
Because random number generators generate numbers uniformly in the given range.

Why ROC can be misleading for imbalanced data ?

Let’s understand this with the below fraud detection example.
Below is a dataset from a fraud detection system for N = 10,000 transactions.
Fraud = 100, NOT fraud = 9900

Predicted FraudPredicted NOT Fraud
Actual Fraud80 (TP)20 (FN)
Actual NOT Fraud220 (FP)9680 (TN)
\[TPR = \frac{TP}{TP + FN} = \frac{80}{80 + 20} = 0.8\]

\[FPR = \frac{FP}{FP + TN} = \frac{220}{220 + 9680} \approx 0.022\]

If we check the location of above (TPR, FPR) pair on the ROC curve, then we can see that it is very close to the top-left corner.
This means that the model is very good at detecting fraudulent transactions, but that is NOT the case.
This is happening because of the imbalanced data, i.e, count of NOT fraud transactions is 99 times of fraudulent transactions.

Let’s look at the Precision value:

\[Precision = \frac{TP}{TP + FP} = \frac{80}{80 + 220} = \frac{80}{300}\approx 0.267\]


We can see that the model has poor precision,i.e, only 26.7% of flagged transactions are actual frauds.
Unacceptable precision for a good fraud detection system.

Precision-Recall Curve

It is used to evaluate the performance of a binary classifier model across various thresholds.
It is similar to the ROC curve, but it uses Precision instead of TPR on the Y-axis.
Plots Precision (Y-axis) against Recall (X-axis) for different classification thresholds.
Note: It is useful when the data is imbalanced.

\[ Precision = \frac{TP}{TP + FP} \\[10pt] Recall = \frac{TP}{TP + FN} \]
images/maths/statistics/prc.png

AU PRC = PR AUC = Area under Precision-Recall curve

Let’s revisit the fraud detection example discussed above to understand the utility of PR curve.

Predicted FraudPredicted NOT Fraud
Actual Fraud80 (TP)20 (FN)
Actual NOT Fraud220 (FP)9680 (TN)
\[Precision = \frac{TP}{TP + FP} = \frac{80}{80 + 220} = \frac{80}{300}\approx 0.267\]


\[Recall = \frac{TP}{TP + FN} = \frac{80}{80 + 20} = \frac{80}{100}\approx 0.8\]


If we check the location of above (Precision, Recall) point on PRC curve, we will find that it is located near the bottom right corner, i.e, the model performance is poor.



End of Section

2.3 - Linear Algebra

Linear Algebra for AI & ML

2.3.1 - Vector Fundamentals

Vector Fundamentals


Why study Vectors?

Vector is a fundamental concept used to describe the real world, which has magnitude and direction, e.g., force, velocity, electromagnetism, etc.
It is used to describe the surrounding space, e.g, lines, planes, 3D space, etc.
And, In machine learning, vectors are used to represent data, both the input features and the output of a model.
Vector

It is a collection of scalars(numbers) that has both magnitude and direction.
Geometrically, it is a line segment in space characterized by its length(magnitude) and direction.
By convention, we represent vectors as column vectors.
e.g.: \(\vec{x} = \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}_{\text{n×1}}\) i.e ’n’ rows and 1 column.

Important: In machine learning, we will use bold notation \(\mathbf{v}\) to represent vectors, instead of arrow notation \(\mathbf{v}\).

Transpose

Swap the rows and columns, i.e, a column vector becomes a row vector after transpose.
e.g: \(\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_d \end{bmatrix}_{\text{d×1}}\)

\(\mathbf{v}^\mathrm{T} = \begin{bmatrix} v_1 & v_2 & \cdots & v_d \end{bmatrix}_{\text{1×d}}\)

Length of Vector

The length (or magnitude or norm) of a vector \(\mathbf{v}\) is the distance from the origin to the point represented
by \(\mathbf{v}\) in n-dimensional space.

\(\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_d \end{bmatrix}_{\text{d×1}}\)

Length of vector:

\[\mathbf{v} = \|\mathbf{v}\| = \mathbf{v} \cdot \mathbf{v} = \mathbf{v}^\mathrm{T}\mathbf{v} = \sqrt{v_1^2 + v_2^2 + \cdots + v_d^2}\]

Note: The length of a zero vector is 0.

Direction of Vector

The direction of a vector tells us where the vector points in space, independent of its length.

Direction of vector:

\[\mathbf{v} = \frac{\vec{v}} {\|\mathbf{v}\|} \]
Vector Space

It is a collection of vectors that can be added together and scaled by numbers (scalars), such that, the results are still in the same space.
Vector space or linear space is a non-empty set of vectors equipped with 2 operations:

  1. Vector addition: for any 2 vectors \(a, b\), \(a + b\) is also in the same vector space.
  2. Scalar multiplication: for a vector \(\mathbf{v}\), \(\alpha\mathbf{v}\) is also in the same vector space; where \(\alpha\) is a scalar.
    Note: These operations must satisfy certain rules (called axioms), such as, associativity, commutativity, distributivity, existence of a zero vector, and additive inverses.

e.g.: Set of all points in 2D is a vector space.

Vector Operations

Addition:
We can only add vectors of the same dimension.

  • Commutative: \(a + b = b + a\)
  • Associative: \(a + (b + c) = (a + b) + c\)

e.g: lets add 2 real d-dimensional vectors, \(\mathbf{u} , \mathbf{v} \in \mathbb{R}^{d \times 1}\):
\(\mathbf{u} = \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_d \end{bmatrix}_{\text{d×1}}\), \(\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_d \end{bmatrix}_{\text{d×1}}\)

\(\mathbf{u} + \mathbf{v} = \begin{bmatrix} u_1+v_1 \\ u_2+v_2 \\ \vdots \\ u_d+v_d \end{bmatrix}_{\text{d×1}}\)

Multiplication:
1. Multiplication with Scalar:
All elements of the vector are multiplied with the scalar.

  • \(c(\mathbf{u} + \mathbf{v}) = c(\mathbf{u}) + c(\mathbf{v})\)
  • \((c+d)\mathbf{v} = c\mathbf{v} + d\mathbf{v}\)
  • \((cd)\mathbf{v} = c(d\mathbf{v})\)

e.g: \(\alpha\mathbf{v} = \begin{bmatrix} \alpha v_1 \\ \alpha v_2 \\ \vdots \\ \alpha v_d \end{bmatrix}_{\text{d×1}}\)

2.Inner (Dot) Product:
Inner(dot) product \(\mathbf{u} \cdot \mathbf{v}\) of 2 vectors gives a scalar output.
The two vectors must be of the same dimensions.

  • \(\mathbf{u} \cdot \mathbf{v} = u_1v_1 + u_2v_2 + \cdots + u_dv_d\)

Dot product:
\(\mathbf{u} \cdot \mathbf{v} = \mathbf{u}^\mathrm{T} \mathbf{v}\) = \(\begin{bmatrix} u_1 & u_2 & \cdots & u_d \end{bmatrix}_{\text{1×d}} \cdot \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_d \end{bmatrix}_{\text{d×1}} = u_1v_1 + u_2v_2 + \cdots + u_dv_d\)

Geometrically, \(\mathbf{u} \cdot \mathbf{v}\) = \(|u||v|cos\theta\)
where \(\theta\) is the angle between \(\mathbf{u}\) and \(\mathbf{v}\).

images/maths/linear_algebra/vector_dot_product.png

\(\mathbf{u} = \begin{bmatrix} 1 \\ \\ 2 \\ \end{bmatrix}_{\text{2×1}}\), \(\mathbf{v} = \begin{bmatrix} 3 \\ \\ 4 \\ \end{bmatrix}_{\text{2×1}}\)

\(\mathbf{u} \cdot \mathbf{v}\) = \(|u||v|cos\theta = 1 \times 3 + 2 \times 4 = 11\)

2.Outer (Tensor) Product:
Outer (tensor) product \(\mathbf{u} \otimes \mathbf{v}\) of 2 vectors gives a matrix output.
The two vectors must be of the same dimensions.

Tensor product:
\(\mathbf{u} \otimes \mathbf{v} = \mathbf{u} \mathbf{v}^\mathrm{T} \) = \(\begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_d \end{bmatrix}_{\text{d×1}} \otimes \begin{bmatrix} v_1 & v_2 & \cdots & v_d \end{bmatrix}_{\text{1×d}} = \begin{bmatrix} u_1v_1 & u_1v_2 & \cdots & u_1v_d \\ u_2v_1 & u_2v_2 & \cdots & u_2v_n \\ \vdots & \vdots & \ddots & \vdots \\ u_dv_1 & u_dv_2 & \cdots & u_dv_d \end{bmatrix} \in \mathbb{R}^{d \times d}\)

e.g:
\(\mathbf{u} = \begin{bmatrix} 1 \\ \\ 2 \\ \end{bmatrix}_{\text{2×1}}\), \(\mathbf{v} = \begin{bmatrix} 3 \\ \\ 4 \\ \end{bmatrix}_{\text{2×1}}\)



\(\mathbf{u} \otimes \mathbf{v} = \mathbf{u} \mathbf{v}^\mathrm{T} \) = \(\begin{bmatrix} 1 \\ \\ 2 \\ \end{bmatrix}_{\text{2×1}} \otimes \begin{bmatrix} 3 & 4 \end{bmatrix}_{\text{1×2}} = \begin{bmatrix} 1 \times 3 & 1 \times 4 \\ \\ 2 \times 3 & 2 \times 4 \\ \end{bmatrix} _{\text{2×2}} = \begin{bmatrix} 3 & 4 \\ \\ 6 & 8 \\ \end{bmatrix} _{\text{2×2}}\)



Note: We will NOT discuss about cross product \(\mathbf{u} \times \mathbf{v}\); product perpendicular to both vectors.

Linear Combination
A vector \(\mathbf{v}\) is a linear combination of vectors \(\mathbf{u}_1, \mathbf{u}_2, \cdots, \mathbf{u}_n\) if:

\(\mathbf{v} = \alpha_1 \mathbf{u}_1 + \alpha_2 \mathbf{u}_2 + \cdots + \alpha_k \mathbf{u}_k\)
where \(\alpha_1, \alpha_2, \cdots, \alpha_k\) are scalars.
Linear Independence

A set of vectors are linearly independent if NO vector in the set can be expressed as a linear combination of the other vectors in the set.

The only solution for :
\(\alpha_1 \mathbf{u}_1 + \alpha_2 \mathbf{u}_2 + \cdots + \alpha_k \mathbf{u}_k\) = 0
is \(\alpha_1 = \alpha_2, \cdots, = \alpha_k = 0\).

e.g.:

  1. The below 3 vectors are linearly independent.
    \(\mathbf{u} = \begin{bmatrix} 1 \\ 1 \\ 1 \\ \end{bmatrix}_{\text{3×1}}\), \(\mathbf{v} = \begin{bmatrix} 1 \\ 2 \\ 3 \\ \end{bmatrix}_{\text{3×1}}\), \(\mathbf{w} = \begin{bmatrix} 1 \\ 3 \\ 6 \\ \end{bmatrix}_{\text{3×1}}\)

  2. The below 3 vectors are linearly dependent.
    \(\mathbf{u} = \begin{bmatrix} 1 \\ 1 \\ 1 \\ \end{bmatrix}_{\text{3×1}}\), \(\mathbf{v} = \begin{bmatrix} 1 \\ 2 \\ 3 \\ \end{bmatrix}_{\text{3×1}}\), \(\mathbf{w} = \begin{bmatrix} 2 \\ 4 \\ 6 \\ \end{bmatrix}_{\text{3×1}}\)

    because, \(\mathbf{w} = 2\mathbf{v}\), and we have a non-zero solution for the below equation:
    \(\alpha_1 \mathbf{u} + \alpha_2 \mathbf{v} + \alpha_3 \mathbf{w} = 0\);
    \(\alpha_1 = 0, \alpha_2 = -2, \alpha_3 = 1\) is a valid non-zero solution.

Note:

  1. A common method to check linear independence is to arrange the column vectors in a matrix form and calculate its determinant, if determinant ≠ 0, then the vectors are linearly independent.
  2. If number of vectors > number of dimensions, then the vectors are linearly dependent.
    Since, the \((n+1)^{th}\) vector can be expressed as a linear combination of the other ’n’ vectors in n-dimensional space.
  3. In machine learning, if a feature can be expressed in terms of other features, then it is linearly dependent,
    and the feature is NOT bringing any new information.
    e.g.: In 2 dimensions, 3 vectors \(\mathbf{x_1}, \mathbf{x_2}, \mathbf{x_3} \) are linearly dependent.
Span

Span of a set of vectors is the geometric shape by all possible linear combinations of those vectors, such as, line, plane, or higher dimensional volume.
e.g.:

  1. Span of a single vector (1,0) is the entire X-axis.
  2. Span of 2 vectors (1,0) and (0,1) is the entire X-Y (2D) plane, as any vector in 2D plane can be expressed as a linear combination of the 2 vectors - (1,0) and (0,1).
Basis

It is the minimal set of linearly independent vectors that spans or defines the entire vector space, providing a unique co-ordinate system for every vector within the space.
Every vector in the vector space can be represented as a unique linear combination of the basis vectors.
e.g.:

  1. X-axis(1,0) and Y-axis(0,1) are the basis vectors of 2D space or form the co-ordinate system.
  2. \(\mathbf{u} = (3,1)\) and \(\mathbf{v} = (-1, 2) \) are the basis of skewed or parallelogram co-ordinate system.

Note: Basis = Dimensions

Orthogonal Vectors

Two vectors are orthogonal if their dot product is 0.
A set of vectors \(\mathbf{x_1}, \mathbf{x_2}, \cdots ,\mathbf{x_n} \) are said to be orthogonal if:
\(\mathbf{x_i} \cdot \mathbf{x_j} = 0 \forall i⍯j\), for every pair, i.e, every pair must be orthogonal.
e.g.:

  1. \(\mathbf{u} = (1,0)\) and \(\mathbf{v} = (0,1) \) are orthogonal vectors.
  2. \(\mathbf{u} = (1,0)\) and \(\mathbf{v} = (1,1) \) are NOT orthogonal vectors.

Note:

  1. Orthogonal vectors are linearly independent, but the inverse may NOT be true,
    i.e, linear independence does NOT imply that vectors are orthogonal. e.g.:
    Vectors \(\mathbf{u} = (2,0)\) and \(\mathbf{v} = (1,3) \) are linearly independent but NOT orthogonal.
    Since, \(\mathbf{u} \cdot \mathbf{v} = 2*1 + 3*0 = 2 ⍯ 0\).

  2. Orthogonality is the most extreme case of linear independence, i.e \(90^{\degree}\) apart or perpendicular.

Orthonormal Vectors

Orthonormal vectors are vectors that are orthogonal and have unit length.
A set of vectors \(\mathbf{x_1}, \mathbf{x_2}, \cdots ,\mathbf{x_n} \) are said to be orthonormal if:
\(\mathbf{x_i} \cdot \mathbf{x_j} = 0 \forall i⍯j\) and \(\|\mathbf{x_i}\| = 1\), i.e, unit vector.

e.g.:

  1. \(\mathbf{u} = (1,0)\) and \(\mathbf{v} = (0,1) \) are orthonormal vectors.
  2. \(\mathbf{u} = (1,0)\) and \(\mathbf{v} = (0,2) \) are NOT orthonormal vectors.
Orthonormal Basis

It is a set of vectors that functions as a basis for a space while also being orthonormal,
meaning each vector is a unit vector (has a length of 1) and all vectors are mutually perpendicular (orthogonal) to each other.
A set of vectors \(\mathbf{x_1}, \mathbf{x_2}, \cdots ,\mathbf{x_n} \) are said to be orthonormal basis of a vector space \(\mathbb{R}^{n \times 1}\), if every vector:

\[ \mathbf{y} = \sum_{k=1}^n \alpha_k \mathbf{x_k}, \quad \forall ~ \mathbf{y} \in \mathbb{R}^{n \times 1} \\ \text {and } \quad \mathbf{x_1}, \mathbf{x_2}, \cdots ,\mathbf{x_n} \text { are orthonormal vectors} \]

e.g.:

  1. \(\mathbf{u} = (1,0)\) and \(\mathbf{v} = (0,1) \) form an orthonormal basis for 2-D space.
  2. \(\mathbf{u} = (\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}})\) and \(\mathbf{v} = (-\frac{1}{\sqrt{2}},-\frac{1}{\sqrt{2}}) \) also form an orthonormal basis for 2-D space.

Note: In a n-dimensional space, there are only \(n\) possible orthonormal bases.



End of Section

2.3.2 - Matrix Operations

Matrix Operations


Why study Matrix?

Matrices let us store, manipulate, and transform data efficiently.
e.g:

  1. Represent a system of linear equations AX=B.
  2. Data representation, such as images, that are stored as a matrix of pixels.
  3. When multiplied, matrix, linearly transforms a vector, i.e, its direction and magnitude, making it useful in image rotation, scaling, etc.
Matrix
It is a two-dimensional array of numbers with a fixed number of rows(m) and columns(n).
e.g:
\( \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} _{\text{m x n}} \)
Transpose
Swapping rows and columns of a matrix.
\( \mathbf{A}^T = \begin{bmatrix} a_{11} & a_{21} & \cdots & a_{m1} \\ a_{12} & a_{22} & \cdots & a_{m2} \\ \vdots & \vdots & \ddots & \vdots \\ a_{1n} & a_{2n} & \cdots & a_{mn} \end{bmatrix} _{\text{n x m}} \)

Important: \( (AB)^T = B^TA^T \)
Rank
Rank of a matrix is the number of linearly independent rows or columns of the matrix.
Matrix Operations

Addition:
We add two matrices by adding the corresponding elements.
They must have same dimensions.

\( \mathbf{A} + \mathbf{B} = \begin{bmatrix} a_{11} + b_{11} & a_{12} + b_{12} & \cdots & a_{1n} + b_{1n} \\ a_{21} + b_{21} & a_{22} + b_{22} & \cdots & a_{2n} + b_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} + b_{m1} & a_{m2} + b_{m2} & \cdots & a_{mn} + b_{mn} \end{bmatrix} _{\text{m x n}} \)

images/maths/linear_algebra/matrix_addition.png

Multiplication:
We can multiply two matrices only if their inner dimensions are equal.
\( \mathbf{C}_{m x n} = \mathbf{A}_{m x d} ~ \mathbf{B}_{d x n} \)

\( c_{ij} \) = Dot product of \(i^{th}\) row of A and \(j^{th}\) row of B.
\( c_{ij} = \sum_{k=1}^{d} A_{ik} * B_{kj} \)
=> \( c_{11} = a_{11} * b_{11} + a_{12} * b_{21} + \cdots + a_{1d} * b_{d1} \)

e.g:
\( \mathbf{A} = \begin{bmatrix} 1 & 2 \\ \\ 3 & 4 \end{bmatrix} _{\text{2 x 2}}, \mathbf{B} = \begin{bmatrix} 5 & 6 \\ \\ 7 & 8 \end{bmatrix} _{\text{2 x 2}} \)

\( \mathbf{C} = \mathbf{A} \times \mathbf{B} = \begin{bmatrix} 1 * 5 + 2 * 7 & 1 * 6 + 2 * 8 \\ \\ 3 * 5 + 4 * 7 & 3 * 6 + 4 * 8 \end{bmatrix} _{\text{2 x 2}} = \begin{bmatrix} 19 & 22 \\ \\ 43 & 50 \end{bmatrix} _{\text{2 x 2}} \)

Key Properties:

  1. AB ≠ BA ; NOT commutative
  2. (AB)C = A(BC) ; Associative
  3. A(B+C) = AB+AC ; Distributive
Square Matrix

Square Matrix:
It is a matrix with same number of rows and columns (m=n).

\( \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nn} \end{bmatrix} _{\text{n x n}} \)

Diagonal Matrix:
It is a square matrix with all non-diagonal elements equal to zero.
e.g.:
\( \mathbf{D} = \begin{bmatrix} d_{11} & 0 & \cdots & 0 \\ 0 & d_{22} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & d_{nn} \end{bmatrix} _{\text{n x n}} \)

Note: Product of 2 diagonal matrices is a diagonal matrix.

Lower Triangular Matrix:
It is a square matrix with all elements above the diagonal equal to zero.
e.g.:
\( \mathbf{L} = \begin{bmatrix} l_{11} & 0 & \cdots & 0 \\ l_{21} & l_{22} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ l_{n1} & l_{n2} & \cdots & l_{nn} \end{bmatrix} _{\text{n x n}} \)

Note: Product of 2 lower triangular matrices is an lower triangular matrix.

Upper Triangular Matrix:
It is a square matrix with all elements below the diagonal equal to zero.
e.g.:
\( \mathbf{U} = \begin{bmatrix} u_{11} & u_{12} & \cdots & u_{1n} \\ 0 & u_{22} & \cdots & u_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & u_{nn} \end{bmatrix} _{\text{n x n}} \)

Note: Product of 2 upper triangular matrices is an upper triangular matrix.

Symmetric Matrix:
It is a square matrix that is equal to its own transpose, i.e, flip the matrix along its diagonal, it remains unchanged.
\( \mathbf{A} = \mathbf{A}^T \)
e.g.:
\( \mathbf{S} = \begin{bmatrix} s_{11} & s_{12} & \cdots & s_{1n} \\ s_{12} & s_{22} & \cdots & s_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ s_{1n} & s_{2n} & \cdots & s_{nn} \end{bmatrix} _{\text{n x n}} \)

Note: Diagonal matrix is a symmetric matrix.

Skew Symmetric Matrix:
It is a square symmetric matrix where the elements across the diagonal have opposite signs.
Also called Anti Symmetric Matrix.

\( \mathbf{A} = -\mathbf{A}^T \)
e.g.:
\( \mathbf{S} = \begin{bmatrix} 0 & s_{12} & \cdots & s_{1n} \\ -s_{12} & 0 & \cdots & s_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ -s_{1n} & -s_{2n} & \cdots & 0 \end{bmatrix} _{\text{n x n}} \)

Note: Diagonal elements of a skew symmetric matrix are zero, since the number should be equal to its negative.

Identity Matrix:
It is a square matrix with all the diagonal values equal to 1, rest of the elements are equal to zero.
e.g.:
\( \mathbf{I} = \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix} _{\text{n x n}} \)

Important:
\( \mathbf{I} \times \mathbf{A} = \mathbf{A} \times \mathbf{I} = \mathbf{A} \)

Operations of Square Matrix

Trace:
It is the sum of the elements on the main diagonal of a square matrix.
Note: Main diagonal is NOT defined for a rectangular matrix.
\( \text{trace}(A) = \sum_{i=1}^{n} a_{ii} = a_{11} + a_{22} + \cdots + a_{nn}\)

If \( \mathbf{A} \in \mathbb{R}^{m \times n} \), and \( \mathbf{B} \in \mathbb{R}^{n \times m} \), then
\( \text{trace}(AB)_{m \times m} = \text{trace}(BA)_{n \times n} \)

Determinant:
It is a scalar value that reveals crucial properties about the matrix and its linear transformation.

  1. If determinant = 0 => singular matrix, i.e linearly dependent rows or columns.
  2. Determinant is also equal to the scaling factor of the linear transformation.
\[ |A| = \sum_{j=1}^{n} (-1)^{1+j} \, a_{1j} \, M_{1j} \]

\(a_{1j} \) = element in the first row and j-th column
\(M_{1j} \) = submatrix of the matrix excluding the first row and j-th column

If n = 2, then, \( |A| = a_{11} \, a_{22} - a_{12} \, a_{21} \)

Singular Matrix:
A square matrix with linearly dependent rows or columns, i.e, determinant = 0.
A singular matrix collapses space, say, a 3D space, into a 2D space or a higher dimensional space to a lower dimensional space, making the transformation impossible to reverse.
Hence, a singular matrix is NOT invertible; i.e, inverse does NOT exist.
Singular matrix is also NOT invertible, because the inverse has division by determinant, and its determinant is zero.
Also called rank deficient matrix, because the rank < number of dimensions of the matrix, due to the presence of linearly dependent rows or columns.

e.g: Below is a linearly dependent 2x2 matrix.
\( \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} \\ \\ \beta a_{11} & \beta a_{12} \end{bmatrix} _{\text{2 x 2}}, det(\mathbf{A}) = a_{11}\cdot \beta a_{12} - a_{12} \cdot \beta a_{11} = 0 \)

Note:

  1. \( det(A) = det(A^T) \)
  2. \( det(\beta A) = \beta^n det(A) \), where n is the number of dimensions of A and \(\beta\) is a scalar.

Inverse:
It is a square matrix that when multiplied by the original matrix, gives the identity matrix.
\( \mathbf{A} \mathbf{A}^{-1} = \mathbf{A}^{-1} \mathbf{A} = \mathbf{I} \)

\[ A^{-1} = \frac{1}{|A|} \, \text{adj}(A) \]

Steps to compute inverse of a matrix:

  1. Calculate the determinant of the matrix. \( |A| = \det(A) \)
  2. For each element \(a_{ij}\), compute its minor \(M_{ij}\).
    \(M_{ij}\) is the determinant of the submatrix of the matrix excluding the i-th row and j-th column.
  3. Form the co-factor matrix \(C_{ij}\), using the minor.
    \( C_{ij} = (-1)^{\,i+j}\,M_{ij} \)
  4. Take the transpose of the cofactor matrix to get the adjugate matrix.
    \( \mathrm{adj}(A) = \mathrm{C}^\mathrm{T} \)
  5. Compute the inverse:
    \( A^{-1} = \frac{1}{|A|}\,\mathrm{adj}(A) = \frac{1}{|A|}\,\mathrm{C}^\mathrm{T}\)

e.g:
\( \mathbf{A} = \begin{bmatrix} 1 & 0 & 3\\ 2 & 1 & 1\\ 1 & 1 & 1 \\ \end{bmatrix} _{\text{3 x 3}} \), Co-factor matrix: \( \mathbf{C} = \begin{bmatrix} 0 & -1 & 1\\ 3 & -2 & -1\\ -3 & 5 & 1 \\ \end{bmatrix} _{\text{3 x 3}} \)

Adjugate matrix, adj(A) = \( \mathbf{C}^\mathrm{T} = \begin{bmatrix} 0 & 3 & -3\\ -1 & -2 & 5\\ 1 & -1 & 1 \\ \end{bmatrix} _{\text{3 x 3}} \)



determinant of \( \mathbf{A} \) = \( \begin{vmatrix} 1 & 0 & 3\\ 2 & 1 & 1\\ 1 & 1 & 1 \\ \end{vmatrix} = 0 -\cancel 1 + \cancel 1 + \cancel3 - 2 -\cancel1 -\cancel 3 + 5 + \cancel1 = 3 \)

\( \mathbf{A}^{-1} = \frac{1}{|A|}\,\mathrm{adj}(A) = \frac{1}{3}\, \begin{bmatrix} 0 & 3 & -3\\ -1 & -2 & 5\\ 1 & -1 & 1 \\ \end{bmatrix} _{\text{3 x 3}} \)

=> \( \mathbf{A}^{-1} = \begin{bmatrix} 0 & 1 & -1\\ -1/3 & -2/3 & 5/3\\ 1/3 & -1/3 & 1/3 \\ \end{bmatrix} _{\text{3 x 3}} \)

Note:

  1. Inverse of an Identity matrix is the Identity matrix itself.
  2. Inverse of \( \beta \mathbf{A} = \frac{1}{\beta}\mathbf{A}^{-1} \).
  3. \( (\mathbf{A}^\mathrm{T})^{-1} = (\mathbf{A}^{-1})^\mathrm{T} \).
  4. Inverse of a diagonal matrix is a diagonal matrix with reciprocal of the diagonal elements.
  5. Inverse of a upper triangular matrix is the upper triangular matrix.
  6. Inverse of a lower triangular matrix is the lower triangular matrix.
  7. Symmetric matrix, if invertible, then \((\mathbf{A}^{-1})^\mathrm{T} = \mathbf{A}^{-1}\), since, \(\mathbf{A} = \mathbf{A}^\mathrm{T}\)
Orthogonal Matrix

It is a square matrix that whose rows and columns are orthonormal vectors, i.e, they are perpendicular to each other and have a unit length.

\( \mathbf{A} \mathbf{A}^\mathrm{T} = \mathbf{A}^\mathrm{T} \mathbf{A} = \mathbf{A} \mathbf{A}^{-1} = \mathbf{I} \)
=> \( \mathbf{A}^\mathrm{T} = \mathbf{A}^{-1} \)

Note:

  1. Orthogonal matrix preserves the length and angles of vectors, acting as rotation or reflection in geometry.
  2. The determinant of an orthogonal matrix is always +1 or -1, because \( \mathbf{A} \mathbf{A}^\mathrm{T} = \mathbf{I} \)

Let’s check out why the rows and columns of an orthogonal matrix are orthonormal.
\( \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} \\ \\ a_{21} & a_{22} \end{bmatrix} _{\text{2 x 2}} \), => \( \mathbf{A}^\mathrm{T} = \begin{bmatrix} a_{11} & a_{21} \\ \\ a_{12} & a_{22} \end{bmatrix} _{\text{2 x 2}} \)

Since, \( \mathbf{A} \mathbf{A}^\mathrm{T} = \mathbf{I} \)
\( \begin{bmatrix} a_{11} & a_{12} \\ \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} a_{11} & a_{21} \\ \\ a_{12} & a_{22} \end{bmatrix} = \begin{bmatrix} a_{11}^2 + a_{12}^2 & a_{11}a_{21} + a_{12}a_{22} \\ \\ a_{21}a_{11} + a_{21}a_{12} & a_{21}^2 + a_{22}^2 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ \\ 0 & 1 \end{bmatrix} \)

Equating the terms, we get:
\( a_{11}^2 + a_{12}^2 = 1 \) => row 1 is a unit vector
\( a_{21}^2 + a_{22}^2 = 1 \) => row 2 is a unit vector
\( a_{11}a_{21} + a_{12}a_{22} = 0 \) => row 1 and row 2 are orthogonal to each other, since dot product = 0
\( a_{21}a_{11} + a_{21}a_{12} = 0 \) => row 1 and row 2 are orthogonal to each other, since dot product = 0

Therefore, the rows and columns of an orthogonal matrix are orthonormal.

Why multiplication of a vector with an orthogonal matrix does NOT change its size?

Let, \(\mathbf{Q}\) is an orthogonal matrix, and \(\mathbf{v}\) is a vector.
Let’s calculate the length of \( \mathbf{Q} \mathbf{v} \)

\[ \text{ length of } \mathbf{Qv} = \|\mathbf{Qv}\| \\ = (\mathbf{Qv})^T\mathbf{Qv} = \mathbf{v}^T\mathbf{Q}^T\mathbf{Qv} \\ \text{ but } \mathbf{Q}^T\mathbf{Q} = \mathbf{I} \quad \text{ since, Q is orthogonal }\\ = \mathbf{v}^T\mathbf{v} = \|\mathbf{v}\| \quad \text{ = length of vector }\\ \]

Therefore, linear transformation of a vector by an orthogonal matrix does NOT change its length.

Solve the following system of equations:
\( 2x + y = 5 \)
\( x + 2y = 4 \)

Lets solve the system of equations using matrix:
\(\begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} \) \( \begin{bmatrix} x \\ \\ y \end{bmatrix} = \begin{bmatrix} 5 \\ \\ 4 \end{bmatrix} \)

The above equation can be written as:
\(\mathbf{AX} = \mathbf{B} \)
=> \( \mathbf{X} = \mathbf{A}^{-1} \mathbf{B} \)

\( \mathbf{A} = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix}, \mathbf{B} = \begin{bmatrix} 5 \\ \\ 4 \end{bmatrix}, \mathbf{X} = \begin{bmatrix} x \\ \\ y \end{bmatrix} \)

Lets compute the inverse of A matrix:
\( \mathbf{A}^{-1} = \frac{1}{3}\begin{bmatrix} 2 & -1 \\ \\ -1 & 2 \end{bmatrix} \)

Since, \( \mathbf{X} = \mathbf{A}^{-1} \mathbf{B} \)
=> \( \mathbf{X} = \begin{bmatrix} x \\ \\ y \end{bmatrix} = \frac{1}{3}\begin{bmatrix} 2 & -1 \\ \\ -1 & 2 \end{bmatrix} \begin{bmatrix} 5 \\ \\ 4 \end{bmatrix} = \frac{1}{3}\begin{bmatrix} 6 \\ \\ 3 \end{bmatrix} = = \begin{bmatrix} 2 \\ \\ 1 \end{bmatrix} \)

Therefore, \( x = 2 \) and \( y = 1 \).



End of Section

2.3.3 - Eigen Value Decomposition

Eigen Values, Eigen Vectors, & Eigen Value Decomposition


What is the meaning of the word “Eigen” ?

Eigen is a German word that means “Characteristic” or “Proper”.
It tells us about the characteristic properties of a matrix.

Linear Transformation
A linear transformation defined by a matrix, denoted as \(T(x)=A\mathbf{x}\), is a function that maps a vector \(\mathbf{x}\) to a new vector by multiplying it by a matrix \(A\).
Multiplying a vector by a matrix can change the direction or magnitude or both of the vector.
Example

\( \mathbf{A} = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} \), \(\mathbf{u} = \begin{bmatrix} 0 \\ \\ 1 \\ \end{bmatrix}\), \(\mathbf{v} = \begin{bmatrix} 1 \\ \\ 1 \\ \end{bmatrix}\)

\(\mathbf{Au} = \begin{bmatrix} 1 \\ \\ 2 \\ \end{bmatrix}\) , \(\quad\) \(\mathbf{Av} = \begin{bmatrix} 3 \\ \\ 3 \\ \end{bmatrix}\)

images/maths/linear_algebra/linear_transformation.png

Eigen Vector
A special non-zero vector whose direction remains unchanged after transformation by a matrix is applied.
It might get scaled up or down but does not change its direction.
Result of linear transformation, i.e, multiplying the vector by a matrix, is just a scalar multiple of the original vector.
Eigen Value (\(\lambda\))
It is the scaling factor of the eigen vector, i.e, a scalar multiple \(\lambda\) of the original vector, when the vector is multiplied by a matrix.
\(|\lambda| > 1 \): Vector stretched
\(0 < |\lambda| < 1 \): Vector shrunk
\(|\lambda| = 1 \): Same size
\(\lambda < 0 \): Vector’s direction is reversed
Characteristic Equation

Since, for an eigen vector, result of linear transformation, i.e, multiplying the vector by a matrix, is just a scalar multiple of the original vector, =>

\[ \mathbf{A} \mathbf{v} = \lambda \mathbf{v} \\ \mathbf{A} \mathbf{v} - \lambda \mathbf{v} = 0 \\ \mathbf{v}(\mathbf{A} - \lambda \mathbf{I}) = 0 \\ \]

For a non-zero eigen vector, \((\mathbf{A} - \lambda \mathbf{I})\) must be singular, i.e, \(det(\mathbf{A} - \lambda \mathbf{I}) = 0 \)

If, \( \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nn} \end{bmatrix} _{\text{n x n}} \), then, \((\mathbf{A} - \lambda \mathbf{I}) = \begin{bmatrix} a_{11}-\lambda & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22}-\lambda & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nn}-\lambda \end{bmatrix} _{\text{n x n}} \)

\(det(\mathbf{A} - \lambda \mathbf{I})) = 0 \), will give us a polynomial equation in \(\lambda\) of degree \(n\),
we need to solve this polynomial equation to get the \(n\) eigen values.
e.g.:

\( \mathbf{A} = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} \), \( \quad det(\mathbf{A} - \lambda \mathbf{I})) = 0 \quad \) => \( det \begin{bmatrix} 2-\lambda & 1 \\ \\ 1 & 2-\lambda \end{bmatrix} = 0 \)

\( (2-\lambda)^2 - 1 = 0\)
\( => |\lambda - 2| = \pm 1 \)
\( => \lambda_1 = 3\) and \( \lambda_2 = 1\)

Therefore, eigen vectors corresponding to eigen values \(\lambda_1 = 3\) and \( \lambda_2 = 1\) are:
\((\mathbf{A} - \lambda \mathbf{I}) \mathbf{v} = 0 \)
\(\lambda_1 = 3\)
=> \( \begin{bmatrix} 2-3 & 1 \\ \\ 1 & 2-3 \end{bmatrix} \begin{bmatrix} v_1 \\ \\ v_2 \\ \end{bmatrix} = \begin{bmatrix} -1 & 1 \\ \\ 1 & -1 \end{bmatrix} \begin{bmatrix} v_1 \\ \\ v_2 \\ \end{bmatrix} = 0 \)

=> Both the equations will be \(v_1 - v_2 = 0 \), i.e, \(v_1 = v_2\)
So, we can choose any vector, where x-axis and y-axis components are same, i.e, \(v_1 = v_2\)
=> Eigen vector: \(v_1 = \begin{bmatrix} 1 \\ \\ 1 \\ \end{bmatrix}\)
Similarly, for \(\lambda_2 = 1\)
we will get, eigen vector: \(v_2 = \begin{bmatrix} 1 \\ \\ -1 \\ \end{bmatrix}\)

What are the eigen values and vectors of an identity matrix?
Characteristic equation for identity matrix:
\(\mathbf{Iv} = \lambda \mathbf{v}\)
Therefore, identity matrix has only one eigen value \(\lambda = 1\), and all non-zero vectors can be eigen vectors.
Are the eigen values of a real matrix always real?
No, eigen values can be complex; if complex, then always occur in conjugate pairs. e.g:
\(\mathbf{A} = \begin{bmatrix} 0 & 1 \\ \\ -1 & 0 \end{bmatrix} \), \( \quad det(\mathbf{A} - \lambda \mathbf{I}) = 0 \quad \) => det \(\begin{bmatrix} 0-\lambda & 1 \\ \\ -1 & 0-\lambda \end{bmatrix} = 0 \)

=> \(\lambda^2 + 1 = 0\) => \(\lambda = \pm i\)
So, eigen values are complex.
What are the eigen values of a diagonal matrix?
The eigen values of a diagonal matrix are the diagonal elements themselves.
e.g.:
\( \mathbf{D} = \begin{bmatrix} d_{11} & 0 & \cdots & 0 \\ 0 & d_{22} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & d_{nn} \end{bmatrix} _{\text{n x n}} \), \( \quad det(\mathbf{D} - \lambda \mathbf{I}) = 0 \quad \) => det \( \begin{bmatrix} d_{11}-\lambda & 0 & \cdots & 0 \\ 0 & d_{22}-\lambda & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & d_{nn}-\lambda \end{bmatrix} _{\text{n x n}} \)

=> \((d_{11}-\lambda)(d_{22}-\lambda) \cdots (d_{nn}-\lambda) = 0\)
=> \(\lambda = d_{11}, d_{22}, \cdots, d_{nn}\)
So, eigen values are the diagonal elements of the matrix.
Key Properties of Eigen Values and Eigen Vectors
  1. For a \(n \times n\) matrix, there are \(n\) eigen values.
  2. Eigen values need NOT be unique, e.g, identity matrix has only one eigen value \(\lambda = 1\).
  3. Sum of eigen values = trace of matrix = sum of diagonal elements, i.e,
    \(tr(\mathbf{A}) = \lambda_1 + \lambda_2 + \cdots + \lambda_n\).
  4. Product of eigen values = determinant of matrix, i.e,
    \(det(\mathbf{A}) = |\mathbf{A}|= \lambda_1 \lambda_2 \cdots \lambda_n\). e.g: For the example matrix given above,

    \( \mathbf{A} = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} \), \(\quad \lambda_1 = 3\) and \( \lambda_2 = 1\)

\(tr(\mathbf{A}) = 2 + 2 = 3 + 1 = 4\)
\(det(\mathbf{A}) = 2 \times 2 - 1 \times 1 = 3 \times 1 = 3 \)

  1. Eigen vectors corresponding to distinct eigen values of a real symmetric matrix are orthogonal to each other.
    Proof:
    Let, \(\mathbf{v_1}, \mathbf{v_2}\) be eigen vectors corresponding to distinct eigen value \(\lambda_1,\lambda_2\) of a real symmetric matrix \(\mathbf{A} = \mathbf{A}^T\).
    We know that for eigen vectors -
    \[ \mathbf{Av_1} = \lambda_1 \mathbf{v_1} ~and~ \mathbf{Av_2} = \lambda_2 \mathbf{v_2} \\[10pt] \text{ let's calculate the dot product: } \\[10pt] \mathbf{Av_1} \cdot \mathbf{v_2} = (\mathbf{Av_1})^T\mathbf{v_2} = \mathbf{v_1^TA^Tv_2} = \mathbf{v_1^T} ~ \mathbf{Av_2}, \quad \text {since } \mathbf{A} = \mathbf{A}^T\\[10pt] => (\mathbf{Av_1}) \cdot \mathbf{v_2} = \mathbf{v_1} \cdot (\mathbf{Av_2}) \\[10pt] => (\lambda_1 \mathbf{v_1}) \cdot \mathbf{v_2} = \mathbf{v_1} \cdot (\lambda_2\mathbf{v_2}) \\[10pt] => \lambda_1 (\mathbf{v_1 \cdot v_2}) = \lambda_2 (\mathbf{v_1 \cdot v_2}) \\[10pt] => (\lambda_1 - \lambda_2) (\mathbf{v_1} \cdot \mathbf{v_2}) = 0 \\[10pt] \text{ since, eigen values are distinct,} => \lambda_1 ≠ \lambda_2 \\[10pt] \therefore \mathbf{v_1} \cdot \mathbf{v_2} = 0 \\[10pt] => \text{ eigen vectors are orthogonal to each other,} => \mathbf{v_1} \perp \mathbf{v_2} \\[10pt] \]
How will we calculate the 2nd power of a matrix i.e \(\mathbf{A}^2\)?

Let’ calculate the 2nd power of a square matrix.

e.g.:
\(\mathbf{A} = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} \), \(\quad \mathbf{A}^2 = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} = \begin{bmatrix} 5 & 4 \\ \\ 4 & 5 \end{bmatrix} \)

Now, how will we calculate higher powers of a matrix i.e \(\mathbf{A}^k\)?
If we follow the above method, then we will have to multiply the matrix \(\mathbf{A}\), \(k\) times, which will be very time consuming and cumbersome.
So, we need to find an easier way to calculate the power of a matrix.
How will we calculate the power of diagonal matrix?

Let’s calculate the 2nd power of a diagonal matrix.

e.g.:
\(\mathbf{A} = \begin{bmatrix} 3 & 0 \\ \\ 0 & 2 \end{bmatrix} \), \(\quad \mathbf{A}^2 = \begin{bmatrix} 3 & 0 \\ \\ 0 & 2 \end{bmatrix} \begin{bmatrix} 3 & 0 \\ \\ 0 & 2 \end{bmatrix} = \begin{bmatrix} 9 & 0 \\ \\ 0 & 4 \end{bmatrix} \)

Note that when we square the diagonal matrix, then all the diagonal elements got squared.
Similarly, if we want to calculate the kth power of a diagonal matrix, then all we need to do is to just compute the kth powers of all diagonal elements, instead of complex matrix multiplications.
\(\quad \mathbf{A}^k = \begin{bmatrix} 3^k & 0 \\ \\ 0 & 2^k \end{bmatrix} \)

Therefore, if we diagonalize a square matrix then the computation of power of the matrix will become very easy.
Next, let’s see how to diagonalize a matrix.

Eigen Value Decomposition

\(\mathbf{V}\): Matrix of all eigen vectors(as columns) of matrix \(\mathbf{A}\)
\( \mathbf{V} = \begin{bmatrix} \mathbf{v}_1 & \mathbf{v}_2 & \cdots & \mathbf{v}_n \\ \vdots & \vdots & \ddots & \vdots \\ \vdots & \vdots & \vdots & \vdots\\ \end{bmatrix} \),
where, each column is an eigen vector corresponding to an eigen value \(\lambda_i\).
If \( \mathbf{v}_1 \mathbf{v}_2 \cdots \mathbf{v}_n \) are linearly independent, then, \(det(\mathbf{V}) ~⍯ ~ 0\).
=> \(\mathbf{V}\) is NOT singular and \(\mathbf{V}^{-1}\) exists.

\(\Lambda\): Diagonal matrix of all eigen values of matrix \(\mathbf{A}\)
\( \Lambda = \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_n \end{bmatrix} \),
where, each diagonal element is an eigen value corresponding to an eigen vector \(\mathbf{v}_i\).

Let’s recall the characteristic equation of a matrix for calculating eigen values:
\(\mathbf{A} \mathbf{v} = \lambda \mathbf{v}\)
Let’s use the consolidated matrix for eigen values and eigen vectors described above:
i.e, \(\mathbf{A} \mathbf{V} = \mathbf{\Lambda} \mathbf{V}\)
For diagonalisation:
=> \(\mathbf{\Lambda} = \mathbf{V}^{-1} \mathbf{A} \mathbf{V}\)

We can see that, using the above equation, we can represent the matrix \(\mathbf{A}\) as a diagonal matrix \(\mathbf{\Lambda}\) using the matrix of eigen vectors \(\mathbf{V}\).

Just reshuffling the above equation will give us the Eigen Value Decomposition of the matrix \(\mathbf{A}\):
\(\mathbf{A} = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^{-1}\)

Important: Given that \(\mathbf{V}^{-1}\) exists, i.e, all the eigen vectors are linearly independent.

Example

Let’s revisit the example given above:
\(\mathbf{A} = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} \), \(\quad \lambda_1 = 3\) and \( \lambda_2 = 1\), \(\mathbf{v_1} = \begin{bmatrix} 1 \\ \\ 1 \\ \end{bmatrix}\), \(\mathbf{v_2} = \begin{bmatrix} 1 \\ \\ -1 \\ \end{bmatrix}\)

=> \( \mathbf{V} = \begin{bmatrix} 1 & 1 \\ \\ 1 & -1 \\ \end{bmatrix} \), \(\quad \mathbf{\Lambda} = \begin{bmatrix} 3 & 0 \\ \\ 0 & 1 \end{bmatrix} \)

\( \because \mathbf{A} = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^{-1}\)
We know, \( \mathbf{V} ~and~ \mathbf{\Lambda} \), we need to calculate \(\mathbf{V}^{-1}\).

\(\mathbf{V}^{-1} = \frac{1}{2}\begin{bmatrix} 1 & 1 \\ \\ 1 & -1 \\ \end{bmatrix} \)

\( \therefore \mathbf{V} \mathbf{\Lambda} \mathbf{V}^{-1} = \begin{bmatrix} 1 & 1 \\ \\ 1 & -1 \\ \end{bmatrix} \begin{bmatrix} 3 & 0 \\ \\ 0 & 1 \end{bmatrix} \frac{1}{2}\begin{bmatrix} 1 & 1 \\ \\ 1 & -1 \\ \end{bmatrix} \)

\( = \frac{1}{2} \begin{bmatrix} 3 & 1 \\ \\ 3 & -1 \\ \end{bmatrix} \begin{bmatrix} 1 & 1 \\ \\ 1 & -1 \\ \end{bmatrix} = \frac{1}{2}\begin{bmatrix} 4 & 2 \\ \\ 2 & 4 \\ \end{bmatrix} \)

\( = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} = \mathbf{A} \)

Spectral Decomposition

Spectral decomposition is a specific type of eigendecomposition that applies to a symmetric matrix, requiring its eigenvectors to be orthogonal.
In contrast, a general eigendecomposition applies to any diagonalizable matrix and does not require the eigenvectors to be orthogonal.

The eigen vectors corresponding to distinct eigen values are orthogonal.
However, the matrix \(\mathbf{V}\) formed by the eigen vectors as columns is NOT orthogonal, because the rows/columns are NOT orthonormal i.e unit length.
So, in order to make the matrix \(\mathbf{V}\) orthogonal, we need to normalize the rows/columns of the matrix \(\mathbf{V}\), i.e, make, each eigen vector(column) unit length, by dividing the vector by its magnitude.

After normalisation we get orthogonal matrix \(\mathbf{Q}\) that is composed of unit length and orthogonal eigen vectors or orthonormal eigen vectors.
Since, matrix \(\mathbf{Q}\) is orthogonal, => \(\mathbf{Q}^T = \mathbf{Q}^{-1}\)

The eigen value decomposition of a square matrix is:
\(\mathbf{A} = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^{-1}\)

And, the spectral decomposition of a real symmetric matrix is:
\(\mathbf{A} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^T\)

Important: Note that, we are discussing only the special case of real symmetric matrix here,
because a real symmetric matrix is guaranteed to have all real eigenvalues.

Applications of Eigen Value Decomposition
  1. Principal Component Analysis (PCA): For dimensionality reduction.
  2. Page Rank Algorithm: For finding the importance of a web page.
  3. Structural Engineering: By calculating the eigen values of a bridge’s structural model, we can identify its natural frequencies to ensure that the bridge won’t resonate and be damaged by external forces, such as wind, and seismic waves.



End of Section

2.3.4 - Principal Component Analysis

Principal Component Analysis


In the diagram below, if we need to reduce the dimensionality of the data to 1, which feature should be dropped?

images/maths/linear_algebra/pca_example_1.png

Whenever we want to reduce the dimensionality of the data, we should aim to minimize information loss.
Since, information = variance, we should drop the feature that brings least information, i.e, has least variance.
Therefore, drop the feature 1.

What if the variance in both directions is same ?
What should be done in this case? Check the diagram below.

images/maths/linear_algebra/pca_example_2.png
Here, since the variance in both directions is approximately same, in order to capture maximum variance in data,
we will rotate the f1-axis in the direction of maximum spread/variance of data, i.e, f1’-axis and then we can drop f2’-axis, which is perpendicular to f1’-axis.
Principal Component Analysis (PCA)

It is a dimensionality reduction technique that finds the direction of maximum variance in the data.
Note: Some loss of information will always be there in dimensionality reduction, because there will be some variability in data along the direction that is dropped, and that will be lost.

Goal:
Fundamental goal of PCA is to find the new set of orthogonal axes, called the principal components, onto which the data can be projected, such that, the variance of the projected data is maximum.

Say, we have data, \(D:X \in \mathbb{R}^{n \times d}\),
n is the number of samples
d is the number of features or dimensions of each data point.
In order to find the directions of maximum variance in data, we will use the covariance matrix of data.
Covariance matrix (C) summarizes the spread and relationship of the data in the original d-dimensional space.
\(C_{d \times d} = \frac{1}{n-1}X^TX \), where \(X\) is the data matrix.
Note: (n-1) in the denominator is for unbiased estimation(Bessel’s correction) of covariance matrix.
\(C_{ii}\) is the variance of the \(i^{th}\) feature.
\(C_{ij}\) is the co-variance between feature \(i\) and feature \(j\).
Trace(C) = Sum of diagonal elements of C = Total variance of data.

Algorithm:

  1. Data is first mean centered, i.e, make mean = 0, i.e, subtract mean from each data point.
    \(X = X - \mu\)

  2. Compute the covariance matrix with mean centered data.
    \(C = \frac{1}{n-1}X^TX \), \( \quad \Sigma = \begin{bmatrix} var(f_1) & cov(f_1f_2) & \cdots & cov(f_1f_d) \\ cov(f_2f_1) & var(f_2) & \cdots & cov(f_2f_d) \\ \vdots & \vdots & \ddots & \vdots \\ cov(f_df_1) & cov(f_df_2) & \cdots & var(f_d) \end{bmatrix} _{\text{d x d}} \)

  3. Perform the eigen value decomposition of covariance matrix.
    \( C = Q \Lambda Q^T \)
    \(C\): Orthogonal matrix of eigen vectors of covariance matrix.
    New rotated axes or prinicipal components of the data.
    \(\Lambda\): Diagonal matrix of eigen values of covariance matrix.
    Scaling of variance along new eigen basis.
    Note: Eigen values are sorted in descending order, i.e \( \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \).

  4. Project the data onto the new axes or principal components/directions.
    Note: k < d = reduced dimensionality.
    \(X_{new} = Z = XQ_k\)
    \(X_{new}\): Projected data or principal component score

What is the variance of the projected data?

Variance of projected data is given by the eigen value of the co-variance matrix.
Covariance of projected data = \((XQ)^TXQ \)

\[ \begin{aligned} (XQ)^TXQ &= Q^TX^TXQ, \quad \text{ since, } (AB)^T = B^TA^T \\ &= Q^TQ \Lambda Q^TQ, \quad \text{ since, } C = X^TX = Q \Lambda Q^T \\ &= \Lambda, \quad \text{ since Q is orthogonal, } Q^TQ = I \\ \end{aligned} \]

Therefore, the diagonal matrix \( \Lambda \) captures the variance along every principal component direction.



End of Section

2.3.5 - Singular Value Decomposition

Singular Value Decomposition


Singular Value Decomposition (SVD)

It decomposes any matrix into a rotation, a scaling (based on singular values), and another rotation.
It is a generalization of the eigen value decomposition(for square matrix) to rectangular matrices.
Any rectangular matrix can be decomposed into a product of three matrices using SVD, as follows:

\[\mathbf{A}_{m \times n} = \mathbf{U}_{m \times m} \mathbf{\Sigma}_{m \times n} \mathbf{V}^T_{n \times n} \]


\(\mathbf{U}\): Set of orthonormal eigen vectors of \(\mathbf{AA^T}_{m \times m} \)
\(\mathbf{V}\): Set of orthonormal eigen vectors of \(\mathbf{A^TA}_{n \times n} \)
\(\mathbf{\Sigma}\): Rectangular diagonal matrix, whose diagonal values are called singular values; square root of non-zero eigen values of \(\mathbf{AA^T}\).

Note: The number of non-zero diagonal entries in \(\mathbf{\Sigma}\) = rank of matrix \(\mathbf{A}\).
Rank (r): Number of linearly independent rows or columns of a matrix.

\( \Sigma = \begin{bmatrix} \sigma_{11} & 0 & \cdots & 0 \\ \\ 0 & \sigma_{22} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ \cdots & \cdots & \sigma_{rr} & \vdots \\ 0 & 0 & \cdots & 0 \end{bmatrix} \), such that, \(\sigma_{11} \geq \sigma_{22} \geq \cdots \geq \sigma_{rr}\), r = rank of \(\mathbf{A}\).

Singular value decomposition, thus refers to the set of scale factors \(\mathbf{\Sigma}\) that are fundamentally linked to the matrix’s singularity and rank.

images/maths/linear_algebra/singular_value_decomposition.png

Properties of Singular Values
  1. All singular values are non-negative.
  2. Square roots of the eigen values of the matrix \(\mathbf{AA^T}\) or \(\mathbf{A^TA}\).
  3. Arranged in non-decreasing order. \(\sigma_{11} \geq \sigma_{22} \geq \cdots \geq \sigma_{rr} \ge 0\)

Note: If rank of matrix < dimensions, then 1 or more of the singular values are zero, i.e, dimension collapse.

images/maths/linear_algebra/svd_example_1.png
Suppose a satellite takes picture of objects in space and sends them to earth. Size of each picture = 1000x1000 pixels.
How can we compress the image size to save satellite bandwidth?

We can perform singular value decomposition on the image matrix and find the minimum number of top ranks required to
to successfully reconstruct the original image back.
Say we performed the SVD on the image matrix and found that top 20 rank singular values, out of 1000, are sufficient to tell what the picture is about.

\(\mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T \) \( = u_1 \sigma_1 v_1^T + u_2 \sigma_2 v_2^T + \cdots + u_r \sigma_r v_r^T \), where \(r = 20\)
A = sum of ‘r=20’ matrices of rank=1.
Now, we need to send only the \(u_i, \sigma_i , v_i\) values for i=20, i.e, top 20 ranks to earth
and then do the calculation to reconstruct the approximation of original image.
\(\mathbf{u} = \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_{1000} \end{bmatrix}\) \(\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_{1000} \end{bmatrix}\)

So, we need to send data corresponding to 20 (rank=1) matrices, i.e, \(u_i, v_i ~and~ \sigma_i\) = (1000 + 1000 + 1)20 = 200020 pixels (approx)

Therefore, compression rate = (2000*20)/(10^6) = 1/25

Applications of SVD
  1. Image compression.
  2. Low Rank Approximation: Compress data by keeping top rank singular values.
  3. Noise Reduction: Capture main structure, ignore small singular values.
  4. Recommendation Systems: Decompose user-item rating matrix to discover underlying user preferences and make recommendations.
Low Rank Approximation

The process of approximating any matrix by a matrix of a lower rank, using singular value decomposition.
It is used for data compression.

Any matrix A of rank ‘r’ can be written as sum of rank=1 outer products:

\[ \mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T = \sum_{i=1}^{r} \sigma_i \mathbf{u}_i \mathbf{v}_i^T \]

\( \mathbf{u}_i : i^{th}\) column vector of \(\mathbf{U}\)
\( \mathbf{v}_i^T : i^{th}\) column vector of \(\mathbf{V}^T\)
\( \sigma_i : i^{th}\) singular value, i.e, diagonal entry of \(\mathbf{\Sigma}\)

Since, the singular values are arranged from largest to smallest, i.e, \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r\),
The top ranks capture the vast majority of the information or variance in the matrix.

So, in order to get the best rank-k approximation of the matrix, we simply truncate the summation after the k’th term.

\[ \mathbf{A_k} = \mathbf{U} \mathbf{\Sigma_k} \mathbf{V}^T = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^T \]

The approximation \(\mathbf{A_k}\) is called the low rank approximation of \(\mathbf{A}\), which is achieved by keeping only the largest singular values and corresponding vectors.

Applications:

  1. Image compression
  2. Data compression, such as, LoRA (Low Rank Adaptation)



End of Section

2.3.6 - Vector & Matrix Calculus

Vector & Matrix Calculus
Vector Derivative

Let \(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}_{\text{n×1}}\) be a vector, i.e, \(\mathbf{x} \in \mathbb{R}^n\).

Let \(f(x)\) be a function that maps a vector to a scalar, i.e, \(f : \mathbb{R}^n \rightarrow \mathbb{R}\).
The derivative of function \(f(x)\) with respect to \(\mathbf{x}\) is defined as:

\[Gradient = \frac{\partial f(x)}{\partial x} = {f'(x)} = \nabla f(x) = \begin{bmatrix} \frac{\partial x_1}{\partial x} \\ \frac{\partial x_2}{\partial x} \\ \vdots \\ \frac{\partial x_n}{\partial x} \end{bmatrix}_{\text{n×1}} \]


Assumption: All the first order partial derivatives exist.

e.g.:

  1. \(f(x) = a^Tx\), where, \(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}\), \(\quad a = \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix}\), \(\quad a^T = \begin{bmatrix} a_1 & a_2 & \cdots & a_n \end{bmatrix}\)

    \(=> f(x) = a^Tx = a_1x_1 + a_2x_2 + \cdots + a_nx_n\)

    \(=> \frac{\partial f(x)}{\partial x} = \begin{bmatrix} \frac{\partial x_1}{\partial x} \\ \frac{\partial x_2}{\partial x} \\ \vdots \\ \frac{\partial x_n}{\partial x} \end{bmatrix} \) \(=\begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} = a \)
    \( => f'(x) = a \)



  1. Now, let’s find the derivative for a bit complex, but very widely used function.
    \(f(x) = \mathbf{x^TAx} \quad\), where, \(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}\), so, \(\quad x^T = \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix}\),

    A is a square matrix, \( \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1k} & \cdots a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2k} & \cdots a_{2n} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ a_{k1} & a_{k2} & \cdots & a_{kk} & \cdots a_{kn} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nk} & \cdots a_{nn} \end{bmatrix} _{\text{n x n}} \)

    So, \( \mathbf{A^T} = \begin{bmatrix} a_{11} & a_{21} & \cdots & a_{k1} & \cdots a_{n1} \\ a_{12} & a_{22} & \cdots & a_{k2} & \cdots a_{n2} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ a_{1k} & a_{2k} & \cdots & a_{kk} & \cdots a_{nk} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ a_{1n} & a_{2n} & \cdots & a_{kn} & \cdots a_{nn} \end{bmatrix} _{\text{n x n}} \)

\[f(x) = y = \mathbf{x^TAx} = \sum_{i=1}^n \sum_{j=1}^n x_i a_{ij} x_j\]\[\tag{1}\frac{\partial y}{\partial x} = \begin{bmatrix} \frac{\partial y}{\partial x_1} \\ \frac{\partial y}{\partial x_2} \\ \vdots \\ \frac{\partial y}{\partial x_k} \\ \vdots \\ \frac{\partial y}{\partial x_n} \end{bmatrix} \]

Let’s calculate \(\frac{\partial y}{\partial x_k}\), i.e, for the \(k^{th}\) element and sum it over 1 to n.
The \(k^{th}\) element \(x_k\) will appear 2 times in the below summation, i.e, when \(i=k ~and~ j=k\)

\[ y = \sum_{i=1}^n \sum_{j=1}^n x_i a_{ij} x_j \\ \text { when } i=k, \quad y = \sum_{j=1}^n x_k a_{kj} x_j \\ \text { when } j=k, \quad y = \sum_{i=1}^n x_i a_{ik} x_k \\[10pt] \frac{\partial y}{\partial x_k} = \frac{\partial }{\partial x_k} (\sum_{i=1}^n \sum_{j=1}^n x_i a_{ij} x_j) \\[10pt] => \frac{\partial y}{\partial x_k} = \sum_{j=1}^n \frac{\partial }{\partial x_k} x_k a_{kj} x_j + \sum_{i=1}^n \frac{\partial }{\partial x_k}x_i a_{ik} x_k \\ => \frac{\partial y}{\partial x_k} = \sum_{j=1}^n 1 \cdot a_{kj} x_j + \sum_{i=1}^n x_i a_{ik} \cdot 1 \\ \because \sum_{j=1}^n x_j = \sum_{i=1}^n x_i \text { we can combine both terms } \\ => \frac{\partial y}{\partial x_k} = \sum_{i=1}^n (a_{ki} + a_{ik}) x_i \\ \text{ Note, that: } \sum_{i=1}^n a_{ki} \text{ is k-th row of A, and } \sum_{i=1}^n a_{ik} \text{ is k-th row of } A^T \\[10pt] \text{ from (1) above: }\frac{\partial y}{\partial x} = \sum_{k=1}^n \frac{\partial y}{\partial x_k} \\ => \frac{\partial y}{\partial x} = \sum_{k=1}^n \sum_{i=1}^n (a_{ki} + a_{ik}) x_i \\[10pt] \therefore \frac{\partial y}{\partial x} = \frac{\partial}{\partial x}(\mathbf{x^TAx}) = \mathbf{(A + A^T) x} \\[10pt] \text{ if } \mathbf{A = A^T} \text{ then, } \\[10pt] \frac{\partial y}{\partial x} = \frac{\partial }{\partial x}(\mathbf{x^TAx}) = \mathbf{2Ax} \\ \]


Jacobian Matrix

Above, we saw the gradient of a scalar valued function, i.e, a function that maps a vector to a scalar, i.e, \(f : \mathbb{R}^n \rightarrow \mathbb{R}\).
There is another kind of function called vector valued function, i.e, a function that maps a vector to another vector, i.e, \(f : \mathbb{R}^n \rightarrow \mathbb{R}^m\).

The Jacobian is the matrix of all first-order partial derivatives of a vector-valued function, while the gradient is a vector representing the partial derivatives of a scalar-valued function.

  • Jacobian matrix provides the best linear approximation of a vector valued function near a given point, similar to how a derivative/gradient is the best linear approximation for a scalar valued function

Note: Gradient is a special case of the Jacobian; it is the transpose of the Jacobian for a scalar valued function.

Let, \(f(x)\) be a function that maps a vector to another vector, i.e, \(f : \mathbb{R}^n \rightarrow \mathbb{R}^m\)

\(f(x)_{m \times 1} = \mathbf{A_{m \times n}x_{n \times 1}}\), where,

\(\quad \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} _{\text{m x n}} \), \(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}_{\text{n x 1}}\), \(f(x) = \begin{bmatrix} f_1(x) \\ f_2(x) \\ \vdots \\ f_m(x) \end{bmatrix}_{\text{m x 1}}\)

\[\frac{\partial f(x)}{\partial x} = f'(x) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} _{\text{m x n}} \]



The above matrix is called Jacobian matrix of \(f(x)\).

Assumption: All the first order partial derivatives exist.

Hessian Matrix

Hessian Matrix:
It is a square matrix of second-order partial derivatives of a scalar-valued function.
This is used to characterize the curvature of a function at a give point.

Let, \(f(x)\) be a function that maps a vector to a scalar value, i.e, \(f : \mathbb{R}^n \rightarrow \mathbb{R}\)
The Hessian matrix is defined as:

\[Hessian = \frac{\partial f^2(x)}{\partial x \partial x^T } = \nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f(x)}{\partial x_1^2} & \frac{\partial^2 f(x)}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f(x)}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f(x)}{\partial x_2 \partial x_1} & \frac{\partial^2 f(x)}{\partial x_2^2} & \cdots & \frac{\partial^2 f(x)}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f(x)}{\partial x_n \partial x_1} & \frac{\partial^2 f(x)}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f(x)}{\partial x_n^2} \end{bmatrix} _{\text{n x n}} \]


Note: Most functions in Machine Learning where second-order partial derivatives are continuous, the Hessian is symmetrix.

\( H_{i,j} = H_{j,i} = \frac{\partial f^2(x)}{\partial x_i \partial x_j } = \frac{\partial f^2(x)}{\partial x_j \partial x_i } \)

e.g:

  1. \(f(x) = \mathbf{x^TAx}\), where, A is a symmetric matrix, and f(x) is a scalar valued function.

    \(Gradient = \frac{\partial }{\partial x}(\mathbf{x^TAx}) = 2\mathbf{Ax}\), since A is symmetric.

    \(Hessian = \frac{\partial^2 }{\partial x^2}(\mathbf{x^TAx}) = \frac{\partial }{\partial x}2\mathbf{Ax} = 2\mathbf{A}\)

  2. \(f(x,y) = x^2 + y^2\)

    \(Gradient = \nabla f = \begin{bmatrix} \frac{\partial f}{\partial x}\\ \\ \frac{\partial f}{\partial x} \end{bmatrix} = \begin{bmatrix} 2x+0 \\ \\ 0+2y \end{bmatrix} = \begin{bmatrix} 2x \\ \\ 2y \end{bmatrix}\)

    \(Hessian = \nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y}\\ \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix} = \begin{bmatrix} 2 & 0 \\ \\ 0 & 2 \end{bmatrix}\)

Matrix Derivative

Let, A is a mxn matrix, i.e \(A \in \mathbb{R}^{m \times n}\)

\(\mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} _{\text{m x n}} \)

Let f(A) be a function that maps a matrix to a scalar value, i.e, \(f : \mathbb{R}^{m \times n} \rightarrow \mathbb{R}\)
, then derivative of function f(A) w.r.t A is defined as:

\[\frac{\partial f}{\partial A} = f'(A) = \begin{bmatrix} \frac{\partial f}{\partial a_{11}} & \frac{\partial f}{\partial a_{12}} & \cdots & \frac{\partial f}{\partial a_{1n}} \\ \frac{\partial f}{\partial a_{21}} & \frac{\partial f}{\partial a_{22}} & \cdots & \frac{\partial f}{\partial a_{2n}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial a_{m1}} & \frac{\partial f}{\partial a_{m2}} & \cdots & \frac{\partial f}{\partial a_{mn}} \end{bmatrix} _{\text{m x n}} \\[20pt] => (\frac{\partial f}{\partial A})_{(i,j)} = \frac{\partial f}{\partial a_{ij}} \]

e.g.:

  1. Let, A is a square matrix, i.e \(A \in \mathbb{R}^{n \times n}\)
    and f(A) = trace(A) = \(a_{11} + a_{22} + \cdots + a_{nn}\)

    We know that:
    \((\frac{\partial f}{\partial A})_{(i,j)} = \frac{\partial f}{\partial a_{ij}}\)

    Since, the trace only contains diagonal elements,

    => \((\frac{\partial f}{\partial A})_{(i,j)}\) = 0 for all \(i ⍯ j\)

    similarly, \((\frac{\partial f}{\partial A})_{(i,i)}\) = 1 for all \(i=j\)

    => \( \frac{\partial Tr(A)}{\partial A} = \begin{cases} 1, & \text{if } i=j \\ \\ 0, & \text{if } i⍯j \end{cases} \)

    \(\frac{\partial f}{\partial A} = \frac{\partial Tr(A)}{\partial A} \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1& \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix} = \mathbf{I} \)

    Therefore, derivative of trace(A) w.r.t A is an identity matrix.



End of Section

2.3.7 - Vector Norms

Vector & Matrix Norms


Vector Norm

It is a measure of the size of a vector or distance from the origin.
Vector norm is a function that maps a vector to a real number, i.e, \({\| \cdot \|} : \mathbb{R}^n \rightarrow \mathbb{R}\).
Vector Norm should satisfy following 3 properties:

  1. Non-Negativity:
    Norm is always greater than or equal to zero,
    \( {\| x \|} \ge 0\), and \( {\| x \|} = 0\), if and only if \(\vec{x} = \vec{0}\).
  2. Homogeneity (or Scaling):
    \( {\| \alpha x \|} = |\alpha| {\| x \|} \).
  3. Triangle Inequality:
    \( {\| x + y \|} \le {\| x \|} + {\| y \|} \).

P-Norm:
It is a generalised form of most common family of vector norms, also called Minkowski norm.
It is defined as:

\[ {\| x \|}_p = (\sum_{i=1}^n |x_i|^p)^{1/p} \]


We can change the value of \(p\) to get different norms.

L1 Norm:
It is the sum of absolute values of all the elements of a vector; also known as Manhattan distance.
p=1:

\[ {\| x \|_1} = \sum_{i=1}^n |x_i| = |x_1| + |x_2| + ... + |x_n| \]

L2 Norm:
It is the square root of the sum of squares of all the elements of a vector; also known as Euclidean distance.
p=2:

\[ {\| x \|_2} = (\sum_{i=1}^n x_i^2)^{1/2} = \sqrt{x_1^2 + x_2^2 + ... + x_n^2} \]

L-\(\infty\) Norm:
It is the maximum of absolute values of all the elements of a vector; also known as Chebyshev distance.
p=\(\infty\):

\[ {\| x \|_\infty} = \max |x_i| = \lim_{p \to \infty} (\sum_{i=1}^n |x_i|^p)^{1/p}\]
Example
  1. Let, vector \(\mathbf{x} = \begin{bmatrix} 3 \\ \\ -4 \end{bmatrix}\), then
    \({\| x \|_1} = |3| + |-4| = 7\)
    \({\| x \|_2} = \sqrt{3^2 + (-4)^2} = \sqrt{25} = 5\)
    \({\| x \|_\infty} = max(|3|, |-4|) = max(3, 4) = 4\)

Matrix Norm

It is a function that assigns non-negative size or magnitude to a matrix.
Matrix Norm is a function that maps a matrix to a non-negative real number, i.e, \({\| \cdot \|} : \mathbb{R}^{m \times n} \rightarrow \mathbb{R}\)
It should satisfy following 3 properties:

  1. Non-Negativity:
    Norm is always greater than or equal to zero,
    \( {\| x \|} \ge 0\), and \( {\| x \|} = 0\), if and only if \(\vec{x} = \vec{0}\).
  2. Homogeneity (or Scaling):
    \( {\| \alpha x \|} = |\alpha| {\| x \|} \).
  3. Triangle Inequality:
    \( {\| x + y \|} \le {\| x \|} + {\| y \|} \).

There are 2 types of matrix norms:

  1. Element wise norms, e.g,, Frobenius norm
  2. Vector induced norms Frobenius Norm:


Frobenius Norm:
It is equivalent to the Euclidean norm of the matrix if it were flattened into a single vector.
If A is a matrix of size \(m \times n\), then, Frobenius norm is defined as:

\[ {\| A \|_F} = \sqrt{\sum_{i=1}^m \sum_{j=1}^n a_{ij}^2} = \sqrt{Trace(A^TA)} = \sqrt{\sum_i \sigma_i^2}\]


\(\sigma_i\) is the \(i\)th singular value of matrix A.

Vector Induced Norm:
It measures the maximum stretching a matrix can apply when multiplied with a vector,
where the vector has a unit length under the chosen vector norm.

Matrix Induced by Vector P-Norm:
P-Norm:

\[ {\| A \|_p} = \max_{{\| x \|_p} =1} \frac{\| Ax \|_p}{\| x \|_p} \]


P=1 Norm:

\[ {\| A \|_1} = \max_{1 \le j \le n } \sum_{i=1}^m |a_{ij}| = \text{ max absolute column sum } \]

P=\(\infty\) Norm:

\[ {\| A \|_\infty} = \max_{1 \le i \le m } \sum_{j=1}^n |a_{ij}| = \text{ max absolute row sum } \]

P=2 Norm:
Also called Spectral norm, i.e, maximum factor by which the matrix can stretch a unit vector in Euclidean norm.

\[ {\| A \|_2} = \sigma_{max}(A) = \text{ max singular value of matrix } \]

Example
  1. Let, matrix \(\mathbf{A} = \begin{bmatrix} a_{11} & a_{12} \\ \\ a_{21} & a_{22} \end{bmatrix}\), then find Frobenius norm.

    \({\| A \|_F} = \sqrt{a_{11}^2 + a_{12}^2 + a_{21}^2 + a_{22}^2}\)

    \(\mathbf{A}^T = \begin{bmatrix} a_{11} & a_{22} \\ \\ a_{12} & a_{22} \end{bmatrix}\)

    => \(\mathbf{A^TA} = \begin{bmatrix} a_{11} & a_{22} \\ \\ a_{12} & a_{22} \end{bmatrix} \begin{bmatrix} a_{11} & a_{12} \\ \\ a_{21} & a_{22} \end{bmatrix} = \begin{bmatrix} a_{11}^2 + a_{12}^2 & a_{11}.a_{21} + a_{12}.a_{22} \\ \\ a_{21}.a_{11} + a_{22}.a_{12} & a_{21}^2 + a_{22}^2 \end{bmatrix} \)

    Therefore, \(Trace(\mathbf{A^TA}) = a_{11}^2 + a_{12}^2 + a_{21}^2 + a_{22}^2\)

    => \({\| A \|_F} = \sqrt{Trace(A^TA)} = \sqrt{a_{11}^2 + a_{12}^2 + a_{21}^2 + a_{22}^2}\)

  2. Let, matrix \(\mathbf{A} = \begin{bmatrix} 1 & -2 & 3 \\ \\ 4 & 5 & -6 \end{bmatrix}\), then

Column 1 absolute value sum = |1|+|4| = 5
Column 2 absolute value sum = |-2|+|5|= 7
Column 3 absolute value sum = |3|+|-6|= 9

Row 1 absolute value sum = |1|+|-2|+|3| = 6
Row 2 absolute value sum = |4|+|5|+|-6| = 15

\({\| A \|_1} = max(5,7,9) = 9\) = max column absolute value sum.
\({\| A \|_\infty} = max(6,15) = 15\) = max row absolute value sum.

  1. Let, matrix \(\mathbf{A} = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix}\), then find spectral norm.
    Spectral norm can be found using the singular value decomposition, in order to get the largest singular value.
    \({\| A \|_2} = \sigma_{max}(A) \)

\(\mathbf{A} = \mathbf{U \Sigma V^T} \) , where \(\mathbf{U} = \mathbf{AA^T} \), \(\mathbf{V} = \mathbf{A^TA} \)

Let’s find the largest eigen value of \(\mathbf{A^TA} \), square root of which will give the largest singular value of \(\mathbf{A}\).

\( \mathbf{V} = \mathbf{A^TA} = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} = \begin{bmatrix} 5 & 4 \\ \\ 4 & 5 \end{bmatrix} \)

Now, lets find the eigen values of the above matrix V:
\(det(V-\lambda I) = 0 \)

=> \( det\begin{bmatrix} 5 - \lambda & 4 \\ \\ 4 & 5- \lambda \end{bmatrix} = 0 \)

=> \( (5 - \lambda)^2 - 16 = 0 \)
=> \( (5 - \lambda) = \pm 4 \)
=> \( (5 - \lambda) = 4 \) or \( (5 - \lambda) = -4 \)
=> \( \lambda = 1 \) or \( \lambda = 9 \)
=> Largest Singular Value = Square root of largest eigen value = \(\sqrt{9} = 3\)
Therefore, \({\| A \|_2} = \sigma_{max}(A) = 3\)



End of Section

2.3.8 - Hyperplane

Equation of a Hyperplane


What is the equation of a line ?

Equation of a line is of the form \(y = mx + c\).
To represent a line in 2D space, we need 2 things:

  1. m = slope or direction of the line
  2. c = y-intercept or distance from the origin
images/maths/linear_algebra/line.png
Hyperplane
A hyperplane is a lower (d-1) dimensional sub-space that divides a d-dimensional space into 2 distinct parts.
Equation of a Hyperplane

Similarly, to represent a hyperplane in d-dimensions, we need 2 things:

  1. \(\vec{w}\) = direction of the hyperplane = vector perpendicular to the hyperplane
  2. \(w_0\) = distance from the origin
\[ \pi_d = w_1x_1 + w_2x_2 + \dots + w_dx_d + w_0 = 0\\[10pt] \text{ representing 'w' and 'x' as vectors: } \\[10pt] \mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix}_{\text{d×1}}, \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix}_{\text{d×1}} \\[10pt] \pi_d = \mathbf{w}^\top \mathbf{x} + w_0 = 0 \\[10pt] => \mathbf{w} \cdot \mathbf{x} + w_0 = 0 \\[10pt] \text{ dividing both sides by the Euclidean norm of w: } \Vert \mathbf{w} \Vert_2 \\[10pt] \frac{\mathbf{w} \cdot \mathbf{x}}{\Vert \mathbf{w} \Vert} + \frac{w_0}{\Vert \mathbf{w} \Vert} = 0 \\[10pt] \text { since the vector divided by its magnitude is a unit vector, so: } \frac{\mathbf{w}}{\Vert \mathbf{w} \Vert} = \mathbf{\widehat{w}} \\[10pt] \mathbf{\widehat{w}} = \frac{w_1x_1 + w_2x_2 + \dots + w_dx_d}{\sqrt{w_1^2 + w_2^2 + \dots + w_d^2}} \\[10pt] \mathbf{\widehat{w}} \cdot \mathbf{x} + \frac{w_0}{\Vert \mathbf{w} \Vert} = 0 \\[10pt] \mathbf{\widehat{w}} \text{ : is the direction of the hyperplane } \\[10pt] \frac{w_0}{\Vert \mathbf{w} \Vert} \text{ : is the distance from the origin } \]

Note: There can be only 2 directions of the hyperplane, i.e, direction of a unit vector perpendicular to the hyperplane:

  1. Towards the origin
  2. Away from the origin
images/maths/linear_algebra/hyperplane.png
Distance from Origin

If a point ‘x’ is on the hyperplane, then it satisfies the below equation:

\[ \pi_d = \mathbf{w}^\top \mathbf{x} + w_0 = 0 \\ => {\Vert \mathbf{w} \Vert}{\Vert \mathbf{x} \Vert} cos{\theta} + w_0 = 0 \\[10pt] \because d = \text{ distance from the origin } = {\Vert \mathbf{x} \Vert} cos{\theta} \\ => {\Vert \mathbf{w} \Vert}.d = -w_0 \\[10pt] => d = \frac{-w_0}{{\Vert \mathbf{w} \Vert}} \\[10pt] \therefore distance(0, \pi_d) = \frac{-w_0}{\Vert \mathbf{w} \Vert} \]
images/maths/linear_algebra/hyperplane_distance.png

Key Points:

  1. By convention, the direction of the hyperplane is given by a unit vector perpendicular to the hyperplane , i.e, \({\Vert \mathbf{w} \Vert} = 1\), since the direction is only important.
  2. \(w_0\) gives the signed perpendicular distance from the origin.
    \(w_0 = 0\) => Hyperplane passes through the origin.
    \(w_0 < 0\) => Hyperplane is in the same direction of unit vector \(\mathbf{\widehat{w}}\) w.r.t the origin.
    \(w_0 > 0\) => Hyperplane is in the opposite direction of unit vector \(\mathbf{\widehat{w}}\) w.r.t the origin.
Consider a line as a hyperplane in 2D space. Let the unit vector point towards the positive x-axis direction.
What is the direction of the hyperplane w.r.t the origin and the direction of the unit vector ?

Equation of a hyperplane is: \(\pi_d = \mathbf{w}^\top \mathbf{x} + w_0 = 0\)
Let, the equation of the line/hyperplane in 2D be:
\(\pi_d = 1.x + 0.y + w_0 = x + w_0 = 0\)

Case 1: \(w_0 < 0\), say \(w_0 = -5\)
Therefore, equation of hyperplane: \( x - 5 = 0 => x = 5\)
Here, the hyperplane(line) is located in the same direction as the unit vector w.r.t the origin,
i.e, towards the +ve x-axis direction.

Case 2: \(w_0 > 0\), say \(w_0 = 5\)
Therefore, equation of hyperplane: \( x + 5 = 0 => x = -5\)
Here, the hyperplane(line) is located in the opposite direction as the unit vector w.r.t the origin,
i.e, towards the -ve x-axis direction.

images/maths/linear_algebra/hyperplane_example.png

Half Spaces

A hyperplane divides a space into 2 distinct parts called half-spaces.
e.g.: A 2D hyperplane divided a 3D space into 2 distinct parts.
Similar example in real world: A wall divides a room into 2 distinct spaces.

Positive Half-Space:
A half space that is in the same direction as the unit vector w.r.t the origin.

Negative Half-Space:
A half space that is in the opposite direction as the unit vector w.r.t the origin.

\[ \text { If point 'x' is on the hyperplane, then:} \\[10pt] \pi_d = \mathbf{w}^\top \mathbf{x} + w_0 = 0 \\[10pt] \mathbf{w}^\top \mathbf{x_1} + w_0 > 0 ~or~ \mathbf{w}^\top \mathbf{x_1} + w_0 < 0 \quad ? \\[10pt] \text { Distance of } x_1 < x_1\prime \text{ as } x_1 \text{ is between the origin and the hyperplane and } x_1\prime \text { lies on the hyperplane } \\[10pt] => \tag{1}\Vert \mathbf{x_1} \Vert < \Vert \mathbf{x_1\prime} \Vert \]

\[ \mathbf{w}^\top \mathbf{x_1\prime} + w_0 = 0 \\[10pt] => \tag{2} \Vert \mathbf{w} \Vert \Vert \mathbf{x_1\prime} \Vert cos{\theta} + w_0 = 0 \]

from equations (1) & (2), we can say that:

\[ \Vert \mathbf{w} \Vert \Vert \mathbf{x_1} \Vert cos{\theta} + w_0 < \Vert \mathbf{w} \Vert \Vert \mathbf{x_1\prime} \Vert cos{\theta} + w_0 \]

Everything is same on both the sides except \(\Vert \mathbf{x_1}\Vert\) and \(\Vert \mathbf{x_1\prime\Vert}\), so:

\[ \mathbf{w}^\top \mathbf{x_1} + w_0 < 0 \]

i.e, negative half-space, opposite to the direction of unit vector or towards the origin.
Similarly,

\[ \mathbf{w}^\top \mathbf{x_2} + w_0 > 0 \]

i.e, positive half-space, same as the direction of unit vector or away from the origin.

Equation of distance of any point \(x\prime\) from the hyperplane:

\[ d_{\pi_d} = \frac{\mathbf{w}^\top \mathbf{x\prime} + w_0}{\Vert \mathbf{w}\Vert} = 0 \]
images/maths/linear_algebra/half_spaces.png
Applications of Equation of Hyperplane

The above concept of equation of hyperplane will be very helpful when we discuss the following topics later:

  1. Logistic Regression
  2. Support Vector Machines



End of Section

2.4 - Calculus

Calculus for AI & ML

2.4.1 - Calculus Fundamentals

Calculus Fundamentals
Integration

Integration is a mathematical tool that is used to find the area under a curve.

\[ \text{Area under the curve = } \int_{a}^{b} f(x) dx \\[10pt] and ~ \int x^n dx = \frac{x^{n+1}}{n+1} + C, \text{ where C is constant.} \\ \]

Let’s understand integration with the help of few simple examples for finding area under a curve:

  1. Area of a triangle:
images/maths/calculus/fundamentals/integration_triangle.png
\[ \text{Area of } \triangle ~ABC = \frac{1}{2} \times base \times height = \frac{1}{2} \times 3 \times 3 = 9/2 = 4.5 \\[10pt] \text{Area of } \triangle ~ABC = \int_0^3 f(x) dx = \int_0^3 x dx = \frac{x^2}{2} \big|_0^3 = \frac{3^2 - 0^2}{2} = 9/2 = 4.5 \]
  1. Area of a rectangle:
images/maths/calculus/fundamentals/integration_rectangle.png
\[ \text{Area of rectangle ABCD} = length \times breadth = 4 \times 3 = 12 \\[10pt] \text{Area of rectangle ABCD} = \int_1^5 f(x) dx = \int_1^5 3 dx = 3x \big|_1^5 = 3(5-1) = 12 \]


Note: The above examples were standard straight forward, where we know a direct formula for finding area under a curve.
But, what if we have such a shape, for which, we do NOT know a ready-made formula, then how do we calculate the area under the curve.
Let’s see an example:

3. Area of a part of parabola:

images/maths/calculus/fundamentals/integration_parabola.png
\[ \text{Area under curve} = \int_{-2}^2 f(x) dx = \int_{-2}^2 x^2 dx = \frac{x^3}{3} \big|_{-2}^2 = \frac{(2)^3 - (-2)^3}{3} = \frac{8 - (-8)}{3} = \frac{16}{3} \]
Differentiation

Differentiation is a mathematical tool that is used to find the derivative or rate of change of a function at a specific point.

  • \(\frac{dy}{dx} = f\prime(x) = tan\theta\) = Derivative = Slope = Gradient
  • Derivative tells us how fast the function is changing at a specific point in relation to another variable.

Note: For a line, the slope is constant, but for a parabola, the slope is changing at every point.

How do we calculate the slope at a given point?

images/maths/calculus/fundamentals/tangent_secant.png



Let AB is a secant on the parabola, i.e, line connecting any 2 points on the curve.
Slope of secant = \(tan\theta = \frac{\Delta y}{\Delta x} = \frac{y_2-y_1}{x_2-x_1}\)
As \(\Delta x \rightarrow 0\), secant will become a tangent to the curve, i.e, the line will touch the curve only at 1 point.
\(dx = \Delta x \rightarrow 0\)
\(x_2 = x_1 + \Delta x \)
\(y_2 = f(x_2) = f(x_1 + \Delta x) \)
Slope at \(x_1\) =

\[ tan \theta = \frac{y_2-y_1}{x_2-x_1} = \frac{f(x_1 + \Delta x) - f(x_1)}{(x_1 + \Delta x)-x_1} \\[10pt] \therefore tan \theta = \lim_{\Delta x \rightarrow 0} \frac{f(x_1 + \Delta x) - f(x_1)}{\Delta x} \\[10pt] Generally, ~ slope = ~ tan \theta = \frac{dy}{dx} = \lim_{\Delta x \rightarrow 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \]


e.g.:
\( y = f(x) = x^2\), find the derivative of f(x) w.r.t x.

\[ \begin{aligned} \frac{dy}{dx} &= \lim_{\Delta x \rightarrow 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \\[10pt] &= \lim_{\Delta x \rightarrow 0} \frac{(x + \Delta x)^2 - x^2}{\Delta x} \\[10pt] &= \lim_{\Delta x \rightarrow 0} \frac{\cancel {x^2} + (\Delta x)^2 + 2x\Delta x - \cancel {x^2}}{\Delta x} \\[10pt] &= \lim_{\Delta x \rightarrow 0} \frac{\cancel {\Delta x}(\Delta x + 2x)}{\cancel {\Delta x}} \\[10pt] \text {applying limit: } \\ \therefore \frac{dy}{dx} &= 2x \\[10pt] \end{aligned} \]
Rules of Differentiation

We will understand few important rules of differentiation that are most frequently used in Machine Learning.

  1. Sum Rule:

    \[ \frac{d}{dx} (f(x) + g(x)) = \frac{d}{dx} f(x) + \frac{d}{dx} g(x) = f\prime(x) + g\prime(x) \]
  2. Product Rule:

    \[ \frac{d}{dx} (f(x).g(x)) = \frac{d}{dx} f(x).g(x) + f(x).\frac{d}{dx} g(x) = f\prime(x).g(x) + f(x).g\prime(x) \]

    e.g.:
    \( h(x) = x^2 sin(x) \), find the derivative of h(x) w.r.t x.
    Let, \(f(x) = x^2 , g(x) = sin(x)\).

    \[ h(x) = f(x).g(x) \\[10pt] => h\prime(x) = f\prime(x).g(x) + f(x).g\prime(x) \\[10pt] => h\prime(x) = 2x.sin(x) + x^2.cos(x) \\[10pt] \]
  3. Quotient Rule:

    \[ \frac{d}{dx} \frac{f(x)}{g(x)} = \frac{f\prime(x).g(x) - f(x).g\prime(x)}{(g(x))^2} \]

    e.g.:
    \( h(x) = sin(x)/x \), find the derivative of h(x) w.r.t x.
    Let, \(f(x) = sin(x) , g(x) = x\).

    \[ h(x) = \frac{f(x)}{g(x)} \\[10pt] => h\prime(x) = \frac{f\prime(x).g(x) - f(x).g\prime(x)}{(g(x))^2} \\[10pt] => h\prime(x) = \frac{cos(x).x - sin(x)}{x^2} \\[10pt] \]
  4. Chain Rule:

    \[ \frac{d}{dx} (f(g(x))) = f\prime(g(x)).g\prime(x) \]

    e.g.:
    \( h(x) = log(x^2) \), find the derivative of h(x) w.r.t x.
    Let, \( u = x^2 \)

    \[ h(x) = log(u) \\[10pt] => h\prime(x) = \frac{d h(x)}{du} \cdot \frac{du}{dx} \\[10pt] => h\prime(x) = \frac{1}{u} \cdot 2x = \frac{2x}{x^2} \\[10pt] => h\prime(x) = \frac{2}{x} \]

Now, let’s dive deeper and understand the concepts that required for differentiation, such as, limits, continuity, differentiability, etc.

Limits

Limit of a function f(x) at any point ‘c’ is the value that f(x) approaches, as x gets very close to ‘c’,
but NOT necessarily equal to ‘c’.

One-sided limit: value of the function, as it approaches a point ‘c’ from only one direction, either left or right.
Two-sided limit: value of the function, as it approaches a point ‘c’ from both directions, left and right, simultaneously.

e.g.:

  1. \(f(x) = \frac{1}{x}\), find the limit of f(x) at x = 0.
    Let’s check for one-sided limit at x=0:
    \[ \lim_{x \rightarrow 0^+} \frac{1}{x} = + \infty \\[10pt] \lim_{x \rightarrow 0^-} \frac{1}{x} = - \infty \\[10pt] so, \lim_{x \rightarrow 0^+} \frac{1}{x} ⍯ \lim_{x \rightarrow 0^-} \frac{1}{x} \\[10pt] => \text{ limit does NOT exist at } x = 0. \]
images/maths/calculus/fundamentals/limit_1_by_x.png



2. \(f(x) = x^2\), find the limit of f(x) at x = 0.
Let’s check for one-sided limit at x=0:

\[ \lim_{x \rightarrow 0^+} x^2 = 0 \\[10pt] \lim_{x \rightarrow 0^-} x^2 = 0 \\[10pt] so, \lim_{x \rightarrow 0^+} x^2 = \lim_{x \rightarrow 0^-} x^2 \\[10pt] => \text{ limit exists at } x = 0. \]
images/maths/calculus/fundamentals/parabola_convex.png



Note: Two-Sided Limit

\[ \lim_{x \rightarrow a^+} f(x) = \lim_{x \rightarrow a^-} f(x) = \lim_{x \rightarrow a} f(x) \]


3. f(x) = |x|, find the limit of f(x) at x = 0.
Let’s check for one-sided limit at x=0:

\[ \lim_{x \rightarrow 0^+} |x| = x = 0 \\[10pt] \lim_{x \rightarrow 0^-} |x| = -x = 0 \\[10pt] so, \lim_{x \rightarrow 0^+} |x| = \lim_{x \rightarrow 0^-} |x| \\[10pt] => \text{ limit exists at } x = 0. \]
images/maths/calculus/fundamentals/abs_x.png
Continuity

A function f(x) is said to be continuous at a point ‘c’, if its graph can be drawn through that point, without lifting the pen.
Continuity bridges the gap between the function’s value at the given point and the limit.


Conditions for Continuity:
A function f(x) is continuous at a point ‘c’, if and only if, all the below 3 conditions are met:

  1. f(x) must be defined at point ‘c’.
  2. Limit of f(x) must exist at point ‘c’, i.e, left and right limits must be equal. \[ \lim_{x \rightarrow c^+} f(x) = \lim_{x \rightarrow c^-} f(x) \]
  3. Value of f(x) at ‘c’ must be equal to its limit at ‘c’. \[ \lim_{x \rightarrow c} f(x) = f(c) \]

e.g.:

  1. \(f(x) = \frac{1}{x}\) is NOT continuous at x = 0, since, f(x) is not defined at x = 0.
images/maths/calculus/fundamentals/limit_1_by_x.png
  1. \(f(x) = |x|\) is continuous everywhere.
images/maths/calculus/fundamentals/abs_x.png
  1. \(f(x) = tanx \) is discontinuous at infinite points.
images/maths/calculus/fundamentals/tan_x.png
Differentiability

A function is differentiable at a point ‘c’, if derivative of the function exists at that point.
A function must be continuous at the given point ‘c’ to be differentiable at that point.
Note: A function can be continuous at a given point, but NOT differentiable at that point.

\[ f\prime(x) = \lim_{\Delta x \rightarrow 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \]

e.g.:
We know that \( f(x) = |x| \) is continuous at x=0, but its NOT differentiable at x=0.

\[ f\prime(x) =\lim_{\Delta x \rightarrow 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} = \lim_{\Delta x \rightarrow 0} \frac{|x + \Delta x| - |x|}{\Delta x} \\[10pt] \text{ let's calculate the one-sided limit from both left and right sides and check if they are equal: } \\[10pt] if ~ x>0, \lim_{\Delta x \rightarrow 0^+} \frac{|x + \Delta x| - |x|}{\Delta x} = \lim_{\Delta x \rightarrow 0+} \frac{\cancel x + \Delta x - \cancel x}{\Delta x} = 1 \\[10pt] if ~ x<0, \lim_{\Delta x \rightarrow 0^-} \frac{|x + \Delta x| - |x|}{\Delta x} = \lim_{\Delta x \rightarrow 0^-} \frac{-(x + \Delta x) - (-x)}{\Delta x} = \lim_{\Delta x \rightarrow 0-} \frac{\cancel {-x} - \Delta x + \cancel x}{\Delta x} = -1 \\[10pt] => \text{ left hand limit (-1) ⍯ right hand limit (1) } \\[10pt] => f\prime(0) \text{ does NOT exist.} \]
Maxima & Minima

Critical Point:
A point of the function where the derivative is either zero or undefined.

These critical points are candidates for local maxima or minima, which are the highest and lowest points in a function’s immediate neighborhood, respectively.

Maxima:
Highest point w.r.t immediate neighbourhood.
f’(x)/gradient/slope changes from +ve to 0 to -ve, therefore, change in f’(x) is -ve.
=> f’’(x) < 0

Let, \(f(x) = -x^2; \quad f'(x) = -2x; \quad f''(x) = -2 < 0 => maxima\)

xf’(x)
-12
00
1-2
images/maths/calculus/fundamentals/parabola_concave.png

Minima:
Lowest point w.r.t immediate neighbourhood.
f’(x)/gradient/slope changes from -ve to 0 to +ve, therefore, change in f’(x) is +ve.
=> f’’(x) > 0

Let, \(f(x) = x^2; \quad f'(x) = 2x; \quad f''(x) = 2 > 0 => minima\)

xf’(x)
-1-2
00
12
images/maths/calculus/fundamentals/parabola_convex.png

e.g.:

  1. Let \(f(x) = 2x^3 + 5x^2 + 3 \), find the maxima and minima of f(x).
    To find the maxima and minima, lets take the derivative of the function and equate it to zero.
    \[ f'(x) = 6x^2 + 10x = 0\\[10pt] => x(6x+10) = 0 \\[10pt] => x = 0 \quad or \quad x = -10/6 = -5/3 \\[10pt] \text{ lets check the second order derivative to find which point is maxima and minima: } \\[10pt] f''(x) = 12x + 10 \\[10pt] => at ~ x = 0, \quad f''(x) = 12*0 + 10 = 10 >0 \quad => minima \\[10pt] => at ~ x = -5/3, \quad f''(x) = 12*(-5/3) + 10 = -20 + 10 = -10<0 \quad => maxima \\[10pt] \]
images/maths/calculus/fundamentals/maxima_minima.png
  1. \(f(x,y) = z = x^2 + y^2\), find the minima of f(x,y).
    Since, this is a multi-variable function, we will use vector and matrix for calculation.
    \[ Gradient = \nabla f_z = \begin{bmatrix} \frac{\partial f_z}{\partial x} \\ \\ \frac{\partial f_z}{\partial y} \end{bmatrix} = \begin{bmatrix} 2x \\ \\ 2y \end{bmatrix} = \begin{bmatrix} 0 \\ \\ 0 \end{bmatrix} \\[10pt] => x=0, y=0 \text{ is a point of optima for } f(x,y) \]

Partial Derivative:
Partial derivative \( \frac{\partial f(x,y)}{\partial x} ~or~ \frac{\partial f(x,y)}{\partial y} \)is the rate of change or derivative of a multi-variable function w.r.t one of its variables, while all the other variables are held constant.

Let’s continue solving the above problem, and calculate the Hessian, i.e, 2nd order derivative of f(x,y):

\[ Hessian = H_z = \begin{bmatrix} \frac{\partial^2 f_z}{\partial x^2} & \frac{\partial^2 f_z}{\partial x \partial y} \\ \\ \frac{\partial^2 f_z}{\partial y \partial x} & \frac{\partial^2 f_z}{\partial y^2} \end{bmatrix} = \begin{bmatrix} 2 & 0 \\ \\ 0 & 2 \end{bmatrix} \]


Since, determinant of Hessian = 4 > 0 and \( \frac{\partial^2 f_z}{\partial x^2} > 0\) => (x=0, y=0) is a point of minima.

images/maths/calculus/fundamentals/parabolloid.png

Hessian Interpretation:

  • Minima: If det(Hessian) > 0 and \( \frac{\partial^2 f(x,y)}{\partial x^2} > 0\)
  • Maxima: If det(Hessian) > 0 and \( \frac{\partial^2 f(x,y)}{\partial x^2} < 0\)
  • Saddle Point: If det(Hessian) < 0
  • Inconclusive: If det(Hessian) = 0, need to perform other tests.
Saddle Point

Saddle Point is a critical point where the function is maximum w.r.t one variable,
and minimum w.r.t to another.

e.g.:
Let, \(f(x,y) = z = x^2 - y^2\), find the point of optima for f(x,y).

\[ Gradient = \nabla f_z = \begin{bmatrix} \frac{\partial f_z}{\partial x} \\ \\ \frac{\partial f_z}{\partial y} \end{bmatrix} = \begin{bmatrix} 2x \\ \\ -2y \end{bmatrix} = \begin{bmatrix} 0 \\ \\ 0 \end{bmatrix} \\[10pt] => x=0, y=0 \text{ is a point of optima for } f(x,y) \]\[ Hessian = H_z = \begin{bmatrix} \frac{\partial^2 f_z}{\partial x^2} & \frac{\partial^2 f_z}{\partial x \partial y} \\ \\ \frac{\partial^2 f_z}{\partial y \partial x} & \frac{\partial^2 f_z}{\partial y^2} \end{bmatrix} = \begin{bmatrix} 2 & 0 \\ \\ 0 & -2 \end{bmatrix} \]


Since, determinant of Hessian = -4 < 0 => (x=0, y=0) is a saddle point.

images/maths/calculus/fundamentals/saddle_point_1.png
images/maths/calculus/fundamentals/saddle_point_2.png



End of Section

2.4.2 - Optimization

Loss Function, Convexity & Optimization


Whenever we build a Machine Learning model, we try to make sure that the model makes least mistakes in its predictions.
How do we measure and minimize these mistakes in predictions made by the model?

To measure, how wrong the are the predictions made by a Machine Learning model, every model is formulated as
minimizing a loss function.
Loss Function

Loss Function:
It quantifies error of a single data point in a dataset.
e.g.: Squared Loss, Hinge Loss, Absolute Loss, etc, for a single data point.

Cost Function:
It is the average of all losses over the entire dataset.
e.g.: Mean Squared Error(MSE), Mean Absolute Error(MAE), etc.

Objective Function:
It is the over-arching objective of an optimization problem, representing the function that is minimized or maximized.
e.g.: Minimize MSE.

Let’s understant this through an example:
Task:
Predict the price of a company’s stock based on its historical data.

Objective Function:
Minimize the difference between actual and predicted price.
Let, \(y\): original or actual price
\(\hat y\): predicted price

Say, the dataset has ’n’ such data points.
Loss Function:
loss = \( y - \hat y \) for a single data point.
We want to minimize the loss for all ’n’ data points.

Cost Function:
We want to minimize the average/total loss over all ’n’ data points.
So, what are the ways ?

  1. We take a simple sum of all the losses, but this can be misleading, as loss for a
    single data point can be +ve or -ve, we can get a net-zero loss even for very large losses, when we sum them all.
  2. We take the sum of absolute value of each loss, i.e, \( |y - \hat y| \), this way the losses will not cancel out each other.
    But the absolute value function is NOT differentiable at y=0, and this can cause issues in optimisation algorithms, such as, gradient descent.
    Read more about Differentiability

  3. So, we choose squared loss, i.e, \( (y - \hat y)^2 \), this solves the above issues.

    Note: In general, we refer to the cost function as the loss function also, the terms are used interchangeably.

    Cost = Loss = Mean Squared Error(MSE) \[\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat y_i)^2 \] The task is to minimize the above loss.

Key Points:

  1. Loss is the bridge between ‘data’ and ‘optimization’.
  2. Good loss functions are differentiable and convex.
Convexity

Convexity:
It refers to a property of a function where a line segment connecting any two points on its graph lies above or on the graph itself.
A convex function is curved upwards.
It is always described by a convex set.

images/maths/calculus/optimization/convex.png
images/maths/calculus/optimization/non_convex.png

Convex Set:
A convex set is a set of points in which the straight line segment connecting any two points in the set lies entirely within that set.
A set \(C\) is convex if for any two points \(x\) and \(y\) in \(C\), the convex combination
\(\theta x+(1-\theta )y\) is also in \(C\) for all values of \(\theta \) where \(0\le \theta \le 1\).

A function \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) is convex if for all values of \(x,y\) and \(0\le \theta \le 1\),

\[ f(\theta x + (1-\theta )y) \le \theta f(x) + (1-\theta )f(y) \]

Second-Order Test:
If a function is twice differentiable, i.e, 2nd derivative exists, then the function is convex, if and only if, the Hessian is positive semi-definite for all points in its domain.

Read more about Hessian

Positive Definite:
A symmetric matrix is positive definite if and only if:

  1. Eigenvalues are all strictly positive, or
  2. For any non-zero vector \(z\), the quadratic form \(z^THz > 0\)

Note: If the Hessian is positive definite, then the function is convex; has upward curvature in all directions.

Positive Semi-Definite:
A symmetric matrix is positive semi-definite if and only if:

  1. Eigenvalues are all non-negative (i.e, greater than or equal to zero), or
  2. For any non-zero vector \(z\), the quadratic form \(z^THz \ge 0\)

Note: If the Hessian is positive definite, then the function is not strictly convex, but flat in some directions.

Read more about Eigen Values

Optimization

All machine learning algorithms minimize loss (mostly), so we need to find the optimum parameters for the model that minimizes the loss.
This is an optimization problem, i.e, find the best solution from a set of alternatives.
Note: Convexity ensures that there is only 1 minima for the loss function.

Optimization:
It is the iterative procedure of finding the optimum parameter \(x^*\) that minimizes the loss function f(x).

\[ x^* = \underset{x}{\mathrm{argmin}}\ f(x) \]


Let’s formulate an optimization problem for a model to minimize the MSE loss function discussed above:

\[ Loss = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat y_i)^2 \\[10pt] \text { We need to find the weights } w, w_0 \text{ of the model that minimize our MSE loss: } \\[10pt] \underset{w, w_0}{\mathrm{argmin}}\ \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat y_i)^2 \]

Note: To minimize the loss, we want \(y_i, \hat y_i\) to be as close as possible, for that we want to find the optimum weights \(w, w_0\) of the model.

images/maths/calculus/optimization/optimization_minima.png

Important:
Deep Learning models have non-convex loss function, so it is challenging to reach the global minima, so any local minima is also a good enough solution.

Read more about Maxima-Minima

Constrained Optimization

Constrained Optimization:
It is an optimization process to find the best possible solution (min or max), but within a set of limitations or restrictions called constraints.
Constraints limit the range of acceptable values; they can be equality constraints or inequality constraints.
e.g.:
Minimize f(x) subject to following constraints:
Equality Constraints: \( g_i(x) = c_i \forall ~i \in \{1,2,3, \ldots, n\} \)
Inequality Constraints: \( h_j(x) \le d_j \forall ~j \in \{1,2,3, \ldots, m\}\)


Lagrangian Method:
Lagrangian method converts a constrained optimization problem to an unconstrained optimization problem, by introducing a new variable called Lagrange multiplier (\(\lambda\)).

Note: Addition of Lagrangian function that incorporates the constraints, makes the problem solvable using standard calculus.

e.g.:
Let f(x) be the objective function with single equality constraint \(g(x) = c\),
then the Lagrangian function \( \mathcal{L}\) is defined as:

\[ \mathcal {L}(\lambda, x) = f(x) - \lambda(g(x) - c) \]

Now, the above constrained optimization problem becomes an unconstrained optimization problem:

\[ \underset{x^*}{\mathrm{argmin}}\ \mathcal{L}(\lambda, x) = \underset{x^*}{\mathrm{argmin}}\ f(x) - \lambda(g(x) - c) \]

By solving the above unconstrained optimization problem, we get the optimum solution for the original constrained problem.

Find the point on the line 2x + 3y = 13 that is closest to the origin.

Objective: To minimize the distance between point (x,y) on the line 2x + 3y = 13 and the origin (0,0).
distance, d = \(\sqrt{(x-0)^2 + (y-0)^2}\)
=> Objective function = minimize distance = \( \underset{x^*, y^*}{\mathrm{argmin}}\ f(x,y) = \underset{x^*, y^*}{\mathrm{argmin}}\ x^2+y^2\)
Constraint: Point (x,y) must be on the line 2x + 3y = 13.
=> Constraint (equality) function = \(g(x,y) = 2x + 3y - 13 = 0\)
Lagragian function =

\[ \mathcal{L}(\lambda, x, y) = f(x,y) - \lambda(g(x,y)) \\[10pt] => \mathcal{L}(\lambda, x, y) = x^2+y^2 - \lambda(2x + 3y - 13) \]

To find the optimum solution, we solve the below unconstrained optimization problem.

\[ \underset{x^*, y^*, \lambda}{\mathrm{argmin}}\ \mathcal{L}(\lambda, x, y) = \underset{x^*, y^*, \lambda}{\mathrm{argmin}}\ x^2+y^2 - \lambda(2x + 3y - 13) \]

Take the derivative and equate it to zero.
Since, it is multi-variable function, we take the partial derivatives, w.r.t, x, y and \(\lambda\).

\[ \tag{1} \frac{\partial}{\partial x} \mathcal{L}(\lambda, x, y) = \frac{\partial}{\partial x} (x^2+y^2 - \lambda(2x + 3y - 13)) = 0 \\[10pt] => 2x - 2\lambda = 0 \\[10pt] => x = \lambda \]


\[ \frac{\partial}{\partial y} \mathcal{L}(\lambda, x, y) = \frac{\partial}{\partial y} (x^2+y^2 - \lambda(2x + 3y - 13)) = 0 \\[10pt] \tag{2} => 2y - 3\lambda = 0 \\[10pt] => y = \frac{3}{2} \lambda \]


\[ \frac{\partial}{\partial \lambda} \mathcal{L}(\lambda, x, y) = \frac{\partial}{\partial \lambda} (x^2+y^2 - \lambda(2x + 3y - 13)) = 0 \\[10pt] \tag{3} => -2x -3y + 13 = 0 \]


Now, we have 3 variables and 3 equations (1), (2) and (3), lets solve them.

\[ -2x -3y + 13 = 0 \\[10pt] => 2x + 3 y = 13 \\[10pt] => 2*\lambda + 3*\frac{3}{2} \lambda = 13 \\[10pt] => \lambda(2+9/2) = 13 \\[10pt] => \lambda = 13 * \frac{2}{13} \\[10pt] => \lambda = 2 => x = \lambda = 2 \\[10pt] => y = \frac{3}{2} \lambda = \frac{3}{2} * 2 = 3\\[10pt] => x = 2, y = 3 \]

Hence, the point (x=2, y=3) on the line 2x + 3y = 13 that is closest to the origin.

Note

To solve the optimization problem, there are many methods, such as, analytical method, which gives the normal equation for the linear regression, but we will discuss that method later in detail, when we have understood what is linear regression?

Normal Equation for linear regression:

\[ w^* = (X^TX)^{-1}X^Ty \]

X: Feature variables
y: Vector of all observed target values



End of Section

2.4.3 - Gradient Descent

Gradient Descent for Optimization
Gradient Based Optimization

Till now, we have understood how to formulate a minimization problem as a mathematical optimization problem.
Now, lets take a step forward and understand how to solve these optimization problems.

We will focus on two important iterative gradient based methods:

  1. Gradient Descent: First order method, uses only the gradient.
  2. Newton’s Method: Second order method, used both the gradient and the Hessian.

Gradient Descent:
It is a first order iterative optimization algorithm that is used to find the local minimum of a differentiable function.
It iteratively adjusts the parameters of the model in the direction opposite to the gradient of cost function, since moving opposite to the direction of gradient leads towards the minima.

Algorithm:

  1. Initialize the weights/parameters with random values.

  2. Calculate the gradient of the cost function at current parameter values.

  3. Update the parameters using the gradient.

    \[ w_{new} = w_{old} - \eta \cdot \frac{\partial f}{\partial w_{old}} \\[10pt] \eta: \text{ learning rate or step size to take for each parameter update} \]
  4. Repeat steps 2 and 3 iteratively until convergence (to minima).

images/maths/calculus/optimization/gradient_descent.png
Types of Gradient Descent

There are 3 types of Gradient Descent:

  1. Batch Gradient Descent
  2. Stochastic Gradient Descent
  3. Mini-Batch Gradient Descent

Batch Gradient Descent (BGD):
Computes the gradient using all the data points in the dataset for parameter update in each iteration.
Say, number of data points in the dataset is \(n\).
Let, the loss function for individual data point be \(l_i(w)\)

\[ l_i(w) = (y_i -\hat{y}_i)^2 \\[10pt] L(w) = \frac{1}{n} \sum_{i=1}^{n} l_i(w) \\[10pt] \frac{\partial L}{\partial w} = \frac{1}{n} \sum_{i=1}^{n} \frac{\partial l_i}{\partial w} \\[10pt] w_{new} = w_{old} - \eta \cdot \frac{\partial L}{\partial w_{old}} \\[10pt] w_{new} = w_{old} - \eta \cdot (\text{average of all 'n' gradients}) \]

Key Points:

  1. Slow steps towards convergence, i.e, TC = O(n).
  2. Smooth, direct path towards minima.
  3. Number of steps/iterations is minimum.
  4. Not suitable for large datasets; impractical for Deep Learning, as n = millions/billions.



Stochastic Gradient Descent (SGD):
It uses only 1 data point selected randomly from dataset to compute gradient for parameter update in each iteration.

\[ \frac{\partial L}{\partial w} \approx \frac{\partial l_i}{\partial w}, \text { say i = 5} \\[10pt] w_{new} = w_{old} - \eta \cdot (\text{gradient of i-th data point}) \]

Key Points:

  1. Computationally fastest per step; TC = O(1).
  2. Highly noisy, zig-zag path to minima.
  3. High variance in gradient estimation makes path to minima volatile, requiring a careful decay of learning rate \(\eta\) to ensure convergence to minima.



Mini Batch Gradient Descent (MBGD):
It uses small randomly selected subsets of dataset, called mini-batch, (1<k<n) to compute gradient for parameter update in each iteration.

\[ \frac{\partial L}{\partial w} = \frac{1}{n} \sum_{i=1}^{k} \frac{\partial l_i}{\partial w} , \text { say k = 32} \\[10pt] w_{new} = w_{old} - \eta \cdot (\text{average gradient of k data points}) \]

Key Points:

  1. Moderate time consumption per step; TC = O(k<n).
  2. Less noisy, and more reliable convergence than stochastic gradient descent.
  3. More efficient and faster than batch gradient descent.
  4. Standard optimization algorithm for Deep Learning.
    Note: Vectorization on GPUs allows for parallel processing of mini-batches; also GPUs are the reason for the mini-batch size to be a power of 2.



End of Section

2.4.4 - Newton's Method

Newton’s Method for Optimization


Newton’s Method:
It is a second-order iterative gradient based optimization technique known for its extremely fast convergence.
When close to optimum, it achieves quadratic convergence, better than gradient descent’s linear convergence.

Algorithm:

  1. Start at a random point \(x_k\).
  2. Compute the slope at \(x_k, ~i.e,~ f'(x_k)\).
  3. Compute the curvature at \(x_k, ~i.e,~ f''(x_k)\).
  4. Draw a parabola at \(x_k\) that locally approximates the function.
  5. Jump directly to the minimum of that parabola; that’s the next step. Note: So, instead of walking down the slope step by step (gradient descent), we are jumping straight to the point where the curve bends downwards towards the bottom.
\[ x_{k+1} = x_k - \frac{f\prime(x_k)}{f\prime\prime(x_k)} \\[10pt] \text{ step size = } \frac{1}{f\prime\prime(x_k)} \\[10pt] f\prime\prime(x_k) : \text{ tells curvature of the function at } x_k \\[10pt] x_{new} = x_{old} - (\nabla^2 f(x_{old})^{-1} \nabla f(x_{old}) \\[10pt] \nabla^2 f(x_{old}): Hessian \]
images/maths/calculus/optimization/newton_method.png
Example
  1. Find the minima of \(f(x) = x^2 - 4x + 5\) To find the minima, lets calulate the first derivative and equate to zero.
    \(f'(x) = 2x - 4 = 0 \)
    \( => x^* = 2 \)

    \(f''(x) = 2 >0 \) => minima is at \(x^* = 2\)

Now, we will solve this using Newton’s Method.
Let’s start at x = 0.

\[ x_{new} = x_{old} - \frac{f\prime(x_{old})}{f\prime\prime(x_{old}} \\[10pt] => x_{new} = 0 - \frac{2*0 -4}{2} = 0-(-2) \\[10pt] => x_{new} = 2 \]

Hence, we can see that using Newton’s Method we can get to the minima \(x^* = 2\) in just 1 step.

Limitations

Full Newton’s Method is rarely used in Machine Learning/Deep Learning optimization, because of the following limitations:

  1. \(TC = O(n^2\)) for Hessian calculation, since for a network with \(n\) parameters
    requires \(n^2\) derivative calculations.
  2. \(TC = O(n^3\)) for Hessian Inverse calculation.
  3. If it encounters a Saddle point, then it can go to a maxima rather than minima.

Because of the above limitations, we use Quasi-Newton methods like BFGS and L-BFGS.
The quasi-Newton methods make use of approximations for Hessian calculation, in order to gain the benefits of curvature without incurring the cost of Hessian calculation.

BFGS: Broyden-Fletcher-Goldfarb-Shanno
L-BFGS: Limited-memory BFGS



End of Section