This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Linear Regression

1: Meaning of 'Linear'
2: Meaning of 'Regression'
3: Linear Regression
4: Probabilistic Interpretation
5: Convex Function
6: Gradient Descent
7: Polynomial Regression
8: Data Splitting
9: Cross Validation
10: Bias Variance Tradeoff
11: Regularization
12: Regression Metrics
13: Assumptions

Linear Regression | All Videos

End of Section

1 - Meaning of 'Linear'

Meaning of ‘Linear’ in Linear Regression

Linear Regression | All Videos

What is the meaning of “linear” in Linear Regression ?

Equation of a line is of the form \(y = mx + c\).
To represent a line in 2D space, we need 2 things:

m = slope or direction of the line
c = y-intercept or distance from the origin

images/machine_learning/supervised/linear_regression/line.png

Similarly,
A hyperplane is a lower (d-1) dimensional sub-space that divides a d-dimensional space into 2 distinct parts. Equation of a hyperplane:

\[y = w_1x_1 + w_2x_2+ \dots + w_nx_n + w_0 \\[5pt] \implies y = w^Tx + w_0\]

Here, ‘y’ is expressed as a linear combination of parameters - \( w_0, w_1, w_2, \dots, w_n \)
Hence - Linear means the model is ‘linear’ with respect to its parameters NOT the variables.
Read more about Hyperplane

images/machine_learning/supervised/linear_regression/hyperplane.png

Polynomial Features ✅

\[ y = w_1x_1 + w_2x_2 + w_3x_1^2 + w_4x_2^3 + w_0 \]

can be rewritten as:

\[y = w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 + w_0\]

where, \(x_3 = x_1^2 ~ and ~ x_4 = x_2^3 \)
\(x_3 ~ and ~ x_4 \) are just 2 new (polynomial) variables.
And, ‘y’ is still a linear combination of parameters: \(w_0, w_1, \dots w_4\)

images/machine_learning/supervised/linear_regression/hypersurface.png

Non-Linear Features ✅

\[ y = w_1log(x) + w_2\sqrt{x}+ w_0 \]

can be rewritten as:

\[y = w_1x_1 + w_2x_2 + w_0\]

where, \(x_1 = log(x) ~ and ~ x_2 = \sqrt{x} \)
\(x_1 ~ and ~ x_2 \) are are transformations of variable \(x\).
And, ‘y’ is still a linear combination of parameters: \(w_0, w_1, ~and~ w_2\)

images/machine_learning/supervised/linear_regression/non_linear_features.png

Non-Linear Parameters ❌

\[ y = x_1^{w_1} + x_2^{w_2} + w_0 \]

Even if we take log, we get:

\[log(y) = w_1log(x_1) + w_2log(x_2) + log(w_0)\]

here, \(log(w_0) \) parameter is NOT linear.

images/machine_learning/supervised/linear_regression/exponential.png

Importance of Linearity

Linearity in parameters allows to use Ordinary Least Squares (OLS) to find the best-fit coefficients by solving a set of linear equations, making estimation straightforward.

Why Linear Regression is NOT 'Linear' ? | Meaning of 'Linear' in Linear Regression | Explained

Previous: Types of ML Next: Meaning of 'Regression

End of Section

2 - Meaning of 'Regression'

Meaning of ‘Regression’ in Linear Regression

Linear Regression | All Videos

What is the meaning of “Regression” in Linear Regression ?

Regression = Going Back

Regression has a very specific historical origin that is different from its current statistical meaning.

Sir Francis Galton (19th century), cousin of Charles Darwin, coined 🪙 this term.

Observation:
Galton observed that -
the children 👶 of unusually tall ⬆️ parents 🧑‍🧑‍🧒‍🧒, tended to be shorter ⬇️ than their parents 🧑‍🧑‍🧒‍🧒,
and children 👶 of unusually short ⬇️ parents 🧑‍🧑‍🧒‍🧒, tended to be taller ⬆️ than their parents 🧑‍🧑‍🧒‍🧒.

Galton named this biological tendency - ‘regression towards mediocrity/mean’.

Galton used method of least squares to model this relationship, by fitting a line to the data 📊.

Regression = Fitting a Line
Over time ⏳, the name ‘regression’ got permanently attached to the method of fitting line to the data 📊.

Today in statistics and machine learning, ‘regression’ universally refers to the method of finding the
‘line of best fit’ for a set of data points, NOT the concept of ‘regressing towards the mean’.

images/machine_learning/supervised/linear_regression/line_of_best_fit.png

Why Linear Regression is NOT 'Regression' ? | Meaning of 'Regression' in Linear Regression

Previous: Meaning of 'Linear Next: Linear Regression

End of Section

3 - Linear Regression

Linear Regression

Linear Regression | All Videos

Predict Salary

Let’s understand linear regression using an example to predict salary.

Predict the salary 💰 of an IT employee, based on various factors, such as, years of experience, domain, role, etc.

images/machine_learning/supervised/linear_regression/salary_prediction.png

Let’s start with a simple problem and predict the salary using only one input feature.

Goal 🎯 : Find the line of best fit.

Plot: Salary vs Years of Experience

\[y = mx + c = w_1x + w_0\]

Slope = \(m = w_1 \)
Intercept = \(c = w_0\)

images/machine_learning/supervised/linear_regression/salary_yoe.png

Similarly, if we include other factors/features impacting the salary 💰, such as, domain, role, etc, we get an equation of a fitting hyperplane:

\[y = w_1x_1 + w_2x_2 + \dots + w_dx_d + w_0\]

where,

\[ \begin{align*} x_1 &= \text{Years of Experience} \\ x_2 &= \text{Domain (Tech, BFSI, Telecom, etc.)} \\ x_3 &= \text{Role (Dev, Tester, DevOps, ML, etc.)} \\ x_d &= d^{th} ~ feature \\ w_0 &= \text{Salary of 0 years experience} \\ \end{align*} \]

What is the dimensions of the fitting hyperplane?

Space = ’d’ features + 1 target variable = ’d+1’ dimensions
In a ’d+1’ dimensional space, we try to fit a ’d’ dimensional hyperplane.

images/machine_learning/supervised/linear_regression/fitting_hyperplane.png

Linear Regression | Dimensions of Fitting Hyperplane | Explained with Example

Parameters/Weights of the Model

Let, data = \( {(x_i, y_i)}_{i=1}^N ; ~ x_i \in R^d , y_i \in R^d\)
where, N = number of training samples.

images/machine_learning/supervised/linear_regression/linear_regression_data.png

Note: Fitting hyperplane (\(y = w_1x_1 + w_2x_2 + \dots + w_dx_d + w_0\)) is the model.
Objective 🎯: find the parameters/weights (\(w_0, w_1, w_2, \dots w_d \)) of the model.

\(\mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix}_{\text{d x 1}}\) \( \mathbf{x_i} = \begin{bmatrix} x_{i_1} \\ x_{i_2} \\ \vdots \\ x_{i_d} \end{bmatrix}_{\text{d x 1}} \) \( \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_i \\ \vdots \\ y_n \end{bmatrix}_{\text{n x 1}} \) \( X = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{i1} & x_{i2} & \cdots & x_{id} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \\ \end{bmatrix} _{\text{n x d}} \)
Prediction:

\[ \hat{y_i} = w_1x_{i_1} + w_2x_{i_2} + \dots + w_dx_{i_d} + w_0 \]

\[ i^{th} \text{ Prediction } = \hat{y_i} = x_i^Tw + w_0\]

Error = Actual - Predicted

\[ \epsilon_i = y_i - \hat{y_i}\]

Goal 🎯: Minimize error between actual and predicted.

Loss 💸 Function

We can quantify the error for a single data point in following ways:

Absolute error = \(|y_i - \hat{y_i}|\)
Squared error = \((y_i - \hat{y_i})^2\)

Issues with Absolute Value function

Not differentiable at x=0, required for gradient descent.
Constant gradient, i.e \(\pm 1\), model learns at same rate, whether the error is large or small.

Cost 💰 Function

Average loss across all data points.

Mean Squared Error (MSE) =

\[ J(w) = \frac{1}{n} \sum_{i=1}^N (y_i - \hat{y_i})^2 \]

images/machine_learning/supervised/linear_regression/mean_squared_error.png

Optimization

\[ \begin{align*} \underset{w_0, w}{\mathrm{min}}\ J(w) &= \underset{w_0, w}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^N (y_i - \hat{y_i})^2 \\ &= \underset{w_0, w}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^N (y_i - (x_i^Tw + w_0))^2 \\ &= \underset{w_0, w_1, w_2, \dots w_d}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^N (y_i - (w_1x_{i_1} + w_2x_{i_2} + \dots + w_dx_{i_d} + w_0))^2 \\ \underset{w_0, w}{\mathrm{min}}\ J(w) &= \underset{w_0, w_1, w_2, \dots w_d}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^N y_i^2 + w_0^2 + w_1^2x_{i_1}^2 + w_2^2x_{i_2}^2 + \dots + w_d^2x_{i_d}^2 + \dots \\ \end{align*} \]

The above equation is quadratic in \(w_0, w_1, w_2, \dots w_d \).

Below is an image of a Paraboloid in 3D, similarly we will have a Paraboloid in ’d’ dimensions.

images/machine_learning/supervised/linear_regression/paraboloid.png

Find the Minima

In order to find the minima of the cost function we need to take its derivative w.r.t weights and equate to 0.

\[ \begin{align*} \frac{\partial{J(w)}}{\partial{w_0}} = 0 \\ \frac{\partial{J(w)}}{\partial{w_1}} = 0 \\ \frac{\partial{J(w)}}{\partial{w_2}} = 0 \\ \vdots \\ \frac{\partial{J(w)}}{\partial{w_d}} = 0 \\ \end{align*} \]

We have ‘d+1’ linear equations to solve for ‘d+1’ weights \(w_0, w_1, w_2, \dots , w_d\).

But solving ‘d+1’ system of linear equations (called the ’normal equations’) is tedious and NOT used for practical purposes.

Parameters of Linear Regression | Issues with Absolute Error | Loss Function | Cost Function

Matrix Form of Cost Function

\[ J(w) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y_i})^2 \]

\[ J(w) = \frac{1}{n} (y - Xw)^2 \]

where, \(\mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix}_{\text{d x 1}}\) \( \mathbf{x_i} = \begin{bmatrix} x_{i_1} \\ x_{i_2} \\ \vdots \\ x_{i_d} \end{bmatrix}_{\text{d x 1}} \) \( \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_i \\ \vdots \\ y_n \end{bmatrix}_{\text{n x 1}} \) \( X = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{i1} & x_{i2} & \cdots & x_{id} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \\ \end{bmatrix} _{\text{n x d}} \)
Prediction:

\[ \hat{y_i} = w_1x_{i_1} + w_2x_{i_2} + \dots + w_dx_{i_d} + w_0 \]

Let’s expand the cost function J(w):

\[ \begin{align*} J(w) &= \frac{1}{n} (y - Xw)^2 \\ &= \frac{1}{n} (y - Xw)^T(y - Xw) \\ &= \frac{1}{n} (y^T - w^TX^T)(y - Xw) \\ J(w) &= \frac{1}{n} (y^Ty - w^TX^Ty - y^TXw + w^TX^TXw) \end{align*} \]

Since,\(w^TX^Ty\), is a scalar, so it is equal to its transpose.

\[ w^TX^Ty = (w^TX^Ty)^T = y^TXw\]

\[ J(w) = \frac{1}{n} (y^Ty - y^TXw - y^TXw + w^TX^TXw)\]

\[ J(w) = \frac{1}{n} (y^Ty - 2y^TXw + w^TX^TXw) \]

Note: \(X^2 = X^TX\) and \((AB)^T = B^TA^T\)

Normal Equation

To find the minimum, take the derivative of cost function J(w) w.r.t ‘w’, and equate to 0 vector.

\[\frac{\partial{J(w)}}{\partial{w}} = \vec{0}\]\[ \begin{align*} &\frac{\partial{[\frac{1}{n} (y^Ty - 2y^TXw + w^TX^TXw)]}}{\partial{w}} = 0\\ & \implies 0 - 2X^Ty + (X^TX + X^TX)w = 0 \\ & \implies \cancel{2}X^TXw = \cancel{2} X^Ty \\ & \therefore \mathbf{w} = (X^TX)^{-1}X^T\mathbf{y} \end{align*} \]

Note: \(\frac{\partial{(a^T\mathbf{x})}}{\partial{\mathbf{x}}} = a\) and \(\frac{\partial{(\mathbf{x}^TA\mathbf{x})}}{\partial{\mathbf{x}}} = (A + A^T)\mathbf{x}\)

This is the closed-form solution of normal equations.

Issues with Normal Equation

Inverse may NOT exist (non-invertible).
Time complexity of calculating the inverse is O(n^3).

Pseudo Inverse

If the inverse does NOT exist then we can use the approximation of the inverse, also called Pseudo Inverse or Moore Penrose Inverse (\(A^+\)).

Moore Penrose Inverse ( \(A^+\)) is calculated using Singular Value Decomposition (SVD).

SVD of \(A = U \Sigma V^T\)

Pseudo Inverse \(A^+ = V \Sigma^+ U^T\)

Where, \(\Sigma^+\) is a transpose of \(\Sigma\) with reciprocals of non-zero singular values on its diagonals.
e.g:

\[ \Sigma = \begin{bmatrix} 5 & 0 & 0 \\\\ 0 & 2 & 0 \end{bmatrix} \]

\[ \Sigma ^{+}=\left[\begin{matrix}1/5&0\\ 0&1/2\\ 0&0\end{matrix}\right]=\left[\begin{matrix}0.2&0\\ 0&0.5\\ 0&0\end{matrix}\right] \]

Note: Time Complexity = O(mn^2)

Normal Equation | Issues with Inverse | Pseudo Inverse | Explained with Example

Previous: Meaning of 'Regression Next: Probabilistic Interpretation

End of Section

4 - Probabilistic Interpretation

Probabilistic Interpretation of Linear Regression

Linear Regression | All Videos

Probabilistic Interpretation

Explains why we use ordinary least squares error to find the model weights/parameters.

Model Assumptions

Error = Random Noise, Un-modeled effects

\[ \begin{align*} \epsilon_i = y_i - \hat{y_i} \\ \implies y_i = \hat{y_i} + \epsilon_i \\ \because \hat{y_i} = x_i^Tw \\ \therefore y_i = x_i^Tw + \epsilon_i \\ \end{align*} \]

Actual value(\(y_i\)) = Deterministic linear predictor(\(x_i^Tw\)) + Error term(\(\epsilon_i\))

Error Assumptions

Independent and Identically Distributed (I.I.D):
Each error term is independent of others.
Normal (Gaussian) Distributed:
Error follows a normal distribution with mean = 0 and a constant variance, .

This implies that the target variable itself is a random variable, normally distributed around the linear relationship.

\[(y_{i}|x_{i};w)∼N(x_{i}^{T}w,\sigma^{2})\]

images/machine_learning/supervised/linear_regression/probabilistic_interpretation/slide_04_01.png

images/machine_learning/supervised/linear_regression/probabilistic_interpretation/slide_05_01.png

Why is Error terms distribution considered to be Gaussian ?

Central Limit Theorem (CLT) states that for a sequence of I.I.D random variables, the distribution of the sample mean(sum) approximates to a normal distribution, regardless of the original population distribution.

images/machine_learning/supervised/linear_regression/probabilistic_interpretation/slide_07_01.png

Probability Vs Likelihood

Probability (Forward View):
Quantifies the chance of observing a specific outcome given a known, fixed model.
Likelihood (Backward/Inverse View):
Inverse concept used for inference (working backward from results to causes). It is a function of the parameters and measures how ‘likely’ a specific set of parameters makes the observed data appear.

Maximum Likelihood Estimate (MLE)

‘Find the most plausible explanation for what I see.'

The goal of the probabilistic interpretation is to find the parameters ‘w’ that maximize the probability (likelihood) of observing the given dataset.

Assumption: Training data is I.I.D.

\[ \begin{align*} Likelihood &= \mathcal{L}(w) \\ \mathcal{L}(w) &= p(y|x;w) \\ &= \prod_{i=1}^N p(y_i| x_i; w) \\ &= \prod_{i=1}^N \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y_i-x_i^Tw)^2}{2\sigma^2}} \end{align*} \]

Issue with Likelihood

Maximizing the likelihood function is mathematically complex due to the product term and the exponential function.

A common simplification is to maximize the log-likelihood function instead, which converts the product into a sum.

Note: Log is a strictly monotonically increasing function.

Solution: Log Likelihood

\[ \begin{align*} log \mathcal{L}(w) &= log \prod_{i=1}^N \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y_i-x_i^Tw)^2}{2\sigma^2}} \\ &= \sum_{i=1}^N log(\frac{1}{\sigma\sqrt{2\pi}}) + log (e^{-\frac{(y_i-x_i^Tw)^2}{2\sigma^2}}) \\ log \mathcal{L}(w) &= Nlog(\frac{1}{\sigma\sqrt{2\pi}}) - \sum_{i=1}^N \frac{(y_i-x_i^Tw)^2}{2\sigma^2} \\ \end{align*} \]

Note: The first term is constant w.r.t ‘w'.

So, we need to find parameters ‘w’ that maximize the log likelihood.

\[ \begin{align*} log \mathcal{L}(w) & \propto -\frac{1}{2\sigma^2} \sum_{i=1}^N (y_i-x_i^Tw)^2 \\ & \because \frac{1}{2\sigma^2} \text{ is constant} \\ log \mathcal{L}(w) & \propto -\sum_{i=1}^N (y_i-x_i^Tw)^2 \\ \end{align*} \]

Ordinary Least Squares

\[ \begin{align*} log \mathcal{L}(w) &\propto -\sum_{i=1}^N (y_i-x_i^Tw)^2 \\ \underset{w}{\mathrm{max}}\ -\sum_{i=1}^N (y_i-x_i^Tw)^2 &= \underset{w}{\mathrm{min}}\ \sum_{i=1}^N (y_i-x_i^Tw)^2 \end{align*} \]

Maximizing the log-likelihood is equivalent to minimizing the sum of squared errors, which is the exact objective of the ordinary least squares (OLS) method.

\[ \underset{w}{\mathrm{min}}\ J(w) = \underset{w}{\mathrm{min}}\ \sum_{i=1}^N (y_i - x_i^Tw)^2 \]

Probabilistic Interpretation of Linear Regression | MLE | Probability vs Likelihood | Explained

Previous: Linear Regression Next: Convex Function

End of Section

5 - Convex Function

Convex Function

Linear Regression | All Videos

Convexity

Refers to a property of a function where a line segment connecting any two points on its graph lies above or on the graph itself.

A convex function is curved upwards.

Is MSE Cost function Convex ? YES ✅

MSE cost function J(w) is convex because its Hessian (H) is always positive semi definite.

\[\nabla J(\mathbf{w})=\frac{1}{n}\mathbf{X}^{T}(\mathbf{Xw}-\mathbf{y})\]

\[\mathbf{H}=\frac{\partial (\nabla J(\mathbf{w}))}{\partial \mathbf{w}^{T}}=\frac{1}{n}\mathbf{X}^{T}\mathbf{X}\]

\[\therefore \mathbf{H} = \nabla^2J(w) = \frac{1}{n} \mathbf{X}^{T}\mathbf{X}\]

images/machine_learning/supervised/linear_regression/convex_function/slide_05_01.png

MSE: Positive Semi Definite (Proof)

A matrix H is positive semi-definite if and only if for any non-zero vector ‘z’, the quadratic form \(z^THz \ge 0\).

For the Hessian of J(w),

\[ z^THz = z^T(\frac{1}{n}X^TX)z = \frac{1}{n}(Xz)^T(Xz) \]

\((Xz)^T(Xz) = \lVert Xz \rVert^2\) = squared L2 norm (magnitude) of the vector

Note: The squared norm of any real-valued vector is always \(\ge 0\).

Convex Function | Why Mean Squared Error Function is Convex ? | Explained

Previous: Probabilistic Interpretation Next: Gradient Descent

End of Section

6 - Gradient Descent

Gradient Descent

Linear Regression | All Videos

Goal 🎯

Minimize the cost 💰function.

\[J(w)=\frac{1}{2n}(y-Xw)^{2}\]

Note: The 1/2 term is included simply to make the derivative cleaner (it cancels out the 2 from the square).

Issues with Normal Equation

Normal Equation (Closed-form solution) jumps straight to the optimal point in one step.

\[w=(X^{T}X)^{-1}X^{T}y\]

But it is not always feasible and computationally expensive 💰(due to inverse calculation 🧮)

Gradient Descent 🎢

An iterative optimization algorithm slowly nudges parameters ‘w’ towards a value that minimize the cost💰 function.

Algorithm ⚙️

Initialize the weights/parameters with random values.
Calculate the gradient of the cost function at current parameter values.
Update the parameters using the gradient. \[ w_{new} = w_{old} - \eta \frac{\partial{J(w)}}{\partial{w_{old}}} \] \( \eta \) = learning rate or step size to take for each parameter update.
Repeat 🔁 steps 2 and 3 iteratively until convergence (to minima).

images/machine_learning/supervised/linear_regression/gradient_descent/slide_07_01.png

Gradient Calculation

\[ \begin{align*} &J(w) = \frac{1}{2n} (y - Xw)^2 \\ &\frac{\partial{J(w)}}{\partial{w}} = \frac{\partial{(\frac{1}{2n} (y - Xw)^2)}}{\partial{w}} \end{align*} \]

Applying chain rule:

\[ \begin{align*} &\text{Let } u = (y - Xw) \\ &\text{Derivative of } u^2 \text{ w.r.t 'w' }= 2u.\frac{\partial{u}}{\partial{w}} \\ \frac{\partial{(\frac{1}{2n} (y - Xw)^2)}}{\partial{w}} &= \frac{1}{\cancel2n}.\cancel2(y - Xw).\frac{\partial{(y - Xw)}}{\partial{w}} \\ &= \frac{1}{n}(y - Xw).X^T.(-1) \\ \therefore \frac{\partial{J(w)}}{\partial{w}} &= \frac{1}{n}X^T(Xw - y) \end{align*} \]

Note: \(\frac{∂(a^{T}x)}{∂x}=a\)

Update parameter using gradient:

\[ w_{new} = w_{old} - \eta'. X^T(Xw - y) \]

Learning Rate

Most important hyper parameter of gradient descent.
Dictates the size of the steps taken down the cost function surface.

Small \(\eta\) ->

images/machine_learning/supervised/linear_regression/gradient_descent/slide_11_01.png

Large \(\eta\) ->

images/machine_learning/supervised/linear_regression/gradient_descent/slide_11_02.png

Learning Rate Techniques

Learning Rate Schedule:
The learning rate is decayed (reduced) over time.
Large steps initially and fine-tuning near the minimum, e.g., step decay or exponential decay.
Adaptive Learning Rate Methods:
Automatically adjust the learning rate for each parameter ‘w’ based on the history of gradients.
Preferred in modern deep learning as they require less manual tuning, e.g., Adagrad, RMSprop, and Adam.

Gradient Descent Algorithm | MSE Gradient Calculation | Learning Rate | Explained with Examples

Types of Gradient Descent 🎢

Batch, Stochastic, Mini-Batch are classified by data usage for gradient calculation in each iteration.

Batch Gradient Descent (BGD): Entire Dataset
Stochastic Gradient Descent (SGD): Random Point
Mini-Batch Gradient Descent (MBGD): Subset

Batch Gradient Descent 🎢 (BGD)

Computes the gradient using all the data points in the dataset for parameter update in each iteration.

\[w_{new} = w_{old} - \eta.\text{(average of all ’n’ gradients)}\]

🔑Key Points:

Slow 🐢 steps towards convergence, i.e, TC = O(n).
Smooth, direct path towards minima.
Minimum number of steps/iterations.
Not suitable for large datasets; impractical for Deep Learning, as n = millions/billions.

Stochastic Gradient Descent 🎢 (SGD)

Uses only 1 data point selected randomly from dataset to compute gradient for parameter update in each iteration.

\[w_{new} = w_{old} - \eta.\text{(gradient at any random data point)}\]

🔑Key Points:

Computationally fastest 🐇 per step; TC = O(1).
Highly noisy, zig-zag path to minima.
High variance in gradient estimation makes path to minima volatile, requiring a careful decay of learning rate to ensure convergence to minima.

Mini Batch Gradient Descent 🎢 (MBGD)

Uses small randomly selected subsets of dataset, called mini-batch, (1<k<n), to compute gradient for parameter update in each iteration. \[w_{new} = w_{old} - \eta.\text{(average gradient of ‘k' data points)}\]

🔑Key Points:

Moderate time ⏰ consumption per step; TC = O(k<n).
Less noisy, and more reliable convergence than stochastic gradient descent.
More efficient and faster than batch gradient descent.
Standard optimization algorithm for Deep Learning.Note: Vectorization on GPUs allows for parallel processing of mini-batches; also GPUs are the reason for the mini-batch size to be a power of 2.

BGD vs SGD vs Mini-BGD

images/machine_learning/supervised/linear_regression/types_of_gradient_descent/slide_08_01.png

images/machine_learning/supervised/linear_regression/types_of_gradient_descent/slide_09_01.png

Types of Gradient Descent | Batch GD | Stochastic GD | Mini-Batch GD | Explained with Examples

Previous: Convex Function Next: Polynomial Regression

End of Section

7 - Polynomial Regression

Polynomial Regression

Linear Regression | All Videos

What if our data is more complex than a straight line?

We can use a linear model to fit non-linear data.

Add powers of each feature as new features, then train a linear model on this extended set of features.

Polynomial Regression

Linear: \(\hat{y_i} = w_0 + w_1x_{i_1} \)

Polynomial (quadratic): \(\hat{y_i} = w_0 + w_1x_{i_1} + w_2x_{i_1}^2\)

Polynomial (n-degree): \(\hat{y_i} = w_0 + w_1x_{i_1} + w_2x_{i_1}^2 +w_3x_{i_1}^3 + \dots + w_nx_{i_1}^n \)

Above polynomial can be re-written as linear equation:

\[\hat{y_i} = w_0 + w_1X_1 + w_2X_2 +w_3X_3 + \dots + w_nX_n \]

where, \(X_1 = x_{i_1}, X_2 = x_{i_1}^2, X_3 = x_{i_1}^3, \dots, X_n = x_{i_1}^n\)

=> the model is still linear w.r.t to its parameters/weights \(w_0, w_1, w_2, \dots , w_n \).

e.g:

images/machine_learning/supervised/linear_regression/polynomial_regression/slide_04_01.png

Strategy to find Polynomial Features

Fit a linear model to the data points.
Plot the errors.
If the variance of errors is too high, then try polynomial features.

Note: Detect and remove outliers from error distribution.

High Degree Polynomial Regression

images/machine_learning/supervised/linear_regression/polynomial_regression/slide_07_01.png

Conclusion

Polynomial model : Over-fitting ❌
Linear model : Under-fitting ❌
Quadratic model: Generalizes best ✅

Polynomial Regression | Polynomial Features | Overfitting | Underfitting | Explained with Examples

Previous: Gradient Descent Next: Data Splitting

End of Section

8 - Data Splitting

Data Splitting

Linear Regression | All Videos

Why Data-Splitting is Required ?

To avoid over-fitting (memorize), so that, the model generalizes well, improving its performance on unseen data.

Train/Validation/Test Split

Training Data: Learn model parameters (Textbook + Practice problems)
Validation Data: Tune hyper-parameters (Mock tests)
Test Data: Evaluate model performance (Real (final) exam)

Data Leakage

Data leakage occurs when information from the validation or test set is inadvertently used to train 🏃‍♂️ the model.

The model ‘cheats’ by learning to exploit information it should not have access to, resulting in artificially inflated performance metrics during testing 🧪.

Typical Split Ratios

Small datasets(1K-100K): 60/20/20, 70/15/15 or 80/10/10
Large datasets(>1M): 98/1/1 would suffice, as 1% of 1M is still 10K.

Note: There is no fixed rule, its trial and error.

Imbalanced Data

Imbalanced data refers to a dataset where the target classes are represented by an unequal or highly skewed distribution of samples, such that the majority class significantly outnumbers the minority class.

Stratified Sampling

If there is class imbalance in the dataset, (e.g., 95% class A , 5% class B), a random split might result in the validation set having 99% class A.

Solution: Use stratified sampling to ensure class proportions are maintained across all splits (train🏃‍♂️/validation📋/test🧪).

Note: Non-negotiable for imbalanced data.

Time-Series ⏳ Data

In time-series ⏰ data, divide the data chronologically, not randomly, i.e, training data time ⏰ should precede validation data time ⏰.
We always train 🏃‍♂️ on past data to predict future data.

Golden rule: Never look 👀 into the future.

Data Splitting (Train/Validation/Test Split) | Data Leakage | Stratified Sampling | Time Series Data

Previous: Polynomial Regression Next: Cross Validation

End of Section

9 - Cross Validation

Cross Validation

Linear Regression | All Videos

Core Idea 💡

Do not trust one split of the data; validate across many splits, and average the result to reduce randomness and bias.

Note: Two different splits of the same dataset can give very different validation scores.

Cross-validation

Cross-validation is a statistical resampling technique used to evaluate how well a machine learning model generalizes to an independent, unseen dataset.

It works by systematically partitioning the available data into multiple subsets, or ‘folds’, and then training and testing the model on different combinations of these folds.

K-Fold Cross-Validation
Leave-One-Out Cross-Validation (LOOCV)

K-Fold Cross-Validation

Shuffle the dataset randomly (except time-series ⏳).
Split data into k equal subsets(folds).
Iterate through each unique fold, using it as the validation set.
Use remaining k-1 fold for training 🏃‍♂️.
Take an average of the results.Note: Common choice for k=5 or 10.

Iteration 1: [V][T][T][T][T]
Iteration 2: [T][V][T][T][T]
Iteration 3: [T][T][V][T][T]
Iteration 4: [T][T][T][V][T]
Iteration 5: [T][T][T][T][V]

Leave-One-Out Cross-Validation (LOOCV)

Model is trained 🏃‍♂️on all data points except one, and then tested 🧪on that remaining single observation.

LOOCV is an extreme case of k-fold cross-validation, where, k=n (number of data points).

Pros:
Useful for small (<1000) datasets.
Cons:
Computationally 💻 expensive 💰.

Cross Validation | K-Fold CV | LOOCV in Machine Learning | Explained with Examples

Previous: Data Splitting Next: Bias Variance Tradeoff

End of Section

10 - Bias Variance Tradeoff

Bias Variance Tradeoff

Linear Regression | All Videos

Total Error

Mean Squared Error (MSE) = \(\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y_i})^2\)

Total Error = Bias^2 + Variance + Irreducible Error

Bias = Systematic Error
Variance = Sensitivity to Data
Irreducible Error = Sensor noise, Human randomness

Bias

Systematic error from overly simplistic assumptions or strong opinion in the model.

e.g. House 🏠 prices 💰 = Rs. 10,000 * Area (sq. ft).

Note: This is over simplified view, because it ignores, amenities, location, age, etc.

Variance

Error from sensitivity to small fluctuations 📈 in the data.

e.g. Deep neural 🧠 network trained on a small dataset.

Note: Memorizes everything, including noise.

Say a house 🏠 in XYZ street was sold for very low price 💰.

Reason: Distress selling (outlier), or incorrect entry (noise).

Note: Model will make wrong(lower) price 💰predictions for all houses in XYZ street.

Linear (High Bias), Polynomial(High Variance)

images/machine_learning/supervised/linear_regression/bias_variance_tradeoff/slide_06_01.png

Bias Variance Table

Bias-Variance Trade-Off

Goal 🎯 is to minimize total error.

Find a sweet-spot balance ⚖️ between Bias and Variance.

A good model ‘generalizes’ well, i.e.,

Is not too simple or has a strong opinion.
Does not memorize 🧠 everything in the data, including noise.

Fix 🩹 High Bias (Under-Fitting)

Make model more complex.
- Add more features, add polynomial features.
Decrease Regularization.
Train 🏃‍♂️longer, the model has not yet converged.

Fix 🩹 High Variance (Over-Fitting)

Add more data (most effective).
- Harder to memorize 🧠 1 million examples than 100.
- Use data augmentation, if getting more data is difficult.
Increase Regularization.
Early stopping 🛑, prevents memorization 🧠.
Dropout (DL), randomly kill neurons, prevents co-adaptation.
Use Ensembles.
Averaging reduces variance.

Note: Co-adaptation refers to a phenomenon where neurons in a neural network become highly dependent on each other to detect features, rather than learning independently.

Bias Variance TradeOff in Machine Learning | Underfitting vs Overfitting | Explained with Examples

Previous: Cross Validation Next: Regularization

End of Section

11 - Regularization

Regularization

Linear Regression | All Videos

Over-Fitting

Over-Fitting happens when we have a complex model that creates highly erratic curve to fit every single data point, including the random noise.

Excellent performance on training 🏃‍♂️ data, but poor performance on unseen data.

How to Avoid Over-Fitting ?

Penalty for overly complex models, i.e, models with excessively large or numerous parameters.
Focus on learning general patterns rather than memorizing 🧠 everything, including noise in training 🏃‍♂️data.

Regularization

Set of techniques that prevents over-fitting by adding a penalty term to the model’s loss function.

\[ J_{reg}(w) = J(w) + \lambda.\text{Penalty(w)} \]

\(\lambda\) = Regularization strength hyper parameter, bias-variance tradeoff knob.

High \(\lambda\): High ⬆️ penalty, forces weights towards 0, simpler model.
Low \(\lambda\): Weak ⬇️ penalty, closer to un-regularized model.

Regularization introduces Bias

By intentionally simplifying a model (shrinking weights) to reduce its complexity, which prevents it from overfitting.
Penalty pulls feature weights closer to zero, making the model less faithful representation of the training data’s true complexity.

Common Regularization Techniques

L2 Regularization (Ridge Regression)
L1 Regularization (Lasso Regression)
Elastic Net Regularization
Early Stopping
Dropout (Neural Networks)

L2 Regularization

\[ \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^n (y_i - x_i^Tw)^2 + \lambda.\sum_{j=1}^n w_j^2 \]

Penalty term: \(\ell_2\) norm - penalizes large weights quadratically.
Pushes the weights close to 0 (not exactly 0), making models more stable by distributing importance across weights.
Splits feature importance across co-related features.
Use case: Best when most features are relevant and co-related.
Also known as Ridge regression or Tikhonov regularization.

L1 Regularization

\[ \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^n (y_i - x_i^Tw)^2 + \lambda.\sum_{j=1}^n |w_j| \]

Penalty term: \(\ell_1\) norm.
Shrinks some weights exactly to 0, effectively performing feature selection, giving sparse solutions.
For a group of highly co-related features, arbitrarily selects one feature and shrinks others to 0.
Use case: Best when using high-dimensional datasets (d»n) where we suspect many features are irrelevant or redundant, or when model interpretability matters.
Also known as Lasso (Least Absolute Shrinkage and Selection Operator) regression.
Computational hack: \(\frac{\partial{|w_j|}}{\partial{w_j}} = 0\), since absolute function is not differentiable at 0.

Elastic Net Regularization

\[ \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ \frac{1}{n} \sum_{i=1}^n (y_i - x_i^Tw)^2 + \lambda.((1-\alpha).\sum_{j=1}^n w_j^2 + \alpha.\sum_{j=1}^n |w_j|) \]

Penalty term : linear combination of and norm.
Sparsity(feature-selection) of L1 and stability/grouping effect of L2 regularization.
Use case: Best when we have high dimensional data with correlated features and we want sparse and stable solution.

L1/L2/Elastic Net Regularization Comparison

images/machine_learning/supervised/linear_regression/regularization/slide_12_01.png

Why weights shrink exactly to 0 in L1 regularization but NOT in L2 regularization ?

Because the gradient of L1 penalty (absolute function) is a constant value, i.e, \(\pm 1\), this means a constant reduction in weight at each step, making it gradually reach to 0 in finite steps.
Whereas, the derivative of L2 penalty is proportional to the weight (\(2w_j\)) and as the weight reaches close to 0, the gradient also becomes very small, this means that the weight will become very close to 0, but not exactly equal to 0.

L1 vs L2 Regularization Comparison

images/machine_learning/supervised/linear_regression/regularization/slide_14_01.png

Regularization (L1/L2/ElasticNet) | Why weights shrink to 0 in L1 Regularization ? | Explained

Previous: Bias Variance Tradeoff Next: Regression Metrics

End of Section

12 - Regression Metrics

Regression Metrics

Linear Regression | All Videos

Regression Metrics

Quantify the difference between the actual values and the predicted values.

Mean Absolute Error(MAE)

MAE = \(\frac{1}{n} \sum_{i=1}^n |y_i - \hat{y_i}|\)

Treats each error equally.
- Robust to outliers.
Not differentiable x=0.
- Using gradient descent requires computational hack.
Easy to interpret, as same units as target variable.

Mean Squared Error(MSE)

MSE = \(\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y_i})^2\)

Heavily penalizes large errors.
- Sensitive to outliers.
Differentiable everywhere.
- Used by gradient descent and most other optimization algorithms.
Difficult to interpret, as it has squared units.

Root Mean Squared Error(RMSE)

RMSE = \(\sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y_i})^2}\)

Easy to interpret, as it has same units as target variable.
Useful when we need outlier-sensitivity of MSE but the interpretability of MAE.

R^2 Metric

Measures improvement over mean model.

\[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y_i})^2}{\sum_{i=1}^n (y_i - \bar{y_i})^2} \]

Good R^2 value depends upon the use case, e.g. :

Car 🚗 sale, R =0.8 is good enough.
Cancer 🧪 prediction R 0.95, as life depends on it.

Range of values:

Best value = 1
Baseline value = 0
Worst value = \(- \infty\)

Note: Example for bad model is all the points lie along x-axis and model predicts y-axis.

Huber Loss

Quadratic for small errors; Linear for large errors.

\[ L_{\delta}(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{for } |y - \hat{y}| \le \delta \\ \\ \delta (|y - \hat{y}| - \frac{1}{2}\delta) & \text{otherwise} \end{cases} \]

Robust to outliers.
Differentiable at 0; smooth convergence to minima.
Delta (\(\delta\)) knob(hyper parameter) to control.
\(\delta\) high: MSE
\(\delta\) low: MAE

Note: Tune \(\delta\): MAE, for outliers > \(\delta\); MSE, for small errors < \(\delta\).
e.g: = 95th percentile of errors or 1.35\(\sigma\) for standard Gaussian data.

Huber loss (Green) and Squared loss (blue)

images/machine_learning/supervised/linear_regression/regression_metrics/slide_09_01.png

Previous: Regularization Next: Assumptions of Linear Regression

End of Section

13 - Assumptions

Assumptions Of Linear Regression

Linear Regression | All Videos

Assumptions

Linear Regression works reliably only when certain key 🔑 assumptions about the data are met.

Linearity
Independence of Errors (No Auto-Correlation)
Homoscedasticity (Equal Variance)
Normality of Errors
No Multicollinearity
No Endogeneity (Target not correlated with the error term)

Linearity

Relationship between features and target 🎯 is linear in parameters.

Note: Polynomial regression is linear regression.
\(y=w_0 +w_1x_1+w_2x_2^2 + w_3x_3^3\)

Independence of Errors (No Auto-Correlation)

Residuals (errors) should not have a visible pattern or correlation with one another (most common in time-series ⏰ data).

Risk:
If errors are correlated, standard errors will be underestimated, making variables look ‘statistically significant’ when they are not.

Test:

Durbin–Watson test
Autocorrelation plots (ACF)
Residuals vs time

Homoscedasticity

Constant variance of errors; Var(ϵ|X)=σ

Risk:
Standard errors become biased, leading to unreliable hypothesis tests (t-tests, F-tests).

Test:

Breusch–Pagan test
White test

Fix:

Log transform
Weighted Least Squares(WLS)

Normality of Errors

Error terms should follow a normal distribution; (Required for small datasets.)

Note: Because of Central Limit Theorem, with a large enough sample size, this becomes less critical for estimation.

Risk: Hypothesis testing (calculating p-values and confidence intervals), we assume the error terms follow a normal distribution.

Test:

Q-Q plot
Shapiro-Wilk Test

No Multicollinearity

Features should not be highly correlated with each other.

Risk:

High correlation makes it difficult to determine the unique, individual impact of each feature. This leads to high variance in model parameter estimates, small changes in data cause large swings in parameters.
Model interpretability issues.

Test:

Variance Inflation Factor(VIF)VIF > 5 → concern, VIF > 10 → serious issue

Fix:

PCA
Remove redundant features

No Endogeneity (Exogeneity)

Error term must be uncorrelated with the features; E[ϵ|X] = 0

Risk:

Parameters will be biased and inconsistent.

Test:

Hausman Test
Durbin-Wu-Hausman (DWH) Test

Fix:

Controlled experiments.

Assumptions of Linear Regression | Linearity | Normality | Homoscedasticity | Multicollinearity

Previous: Regression Metrics Next: Binary Classification

End of Section