This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Logistic Regression

1: Binary Classification
2: Log Loss
3: Regularization
4: Log Odds
5: Probabilistic Interpretation

Logistic Regression | All Videos

End of Section

1 - Binary Classification

Binary Classification

Logistic Regression | All Videos

Binary Classification

Why can’t we use Linear Regression for binary classification ?

Linear regression tries to find the best fit line, but we want to find the line or decision boundary that clearly separates the two classes.

images/machine_learning/supervised/logistic_regression/binary_classification_intro/slide_03_01.tif

Goal 🎯

Find the decision boundary, i.e, the equation of the separating hyperplane.

\[z=w^{T}x+w_{0}\]

Decision Boundary

Value of \(z = \mathbf{w^Tx} + w_0\) tells us how far is the point from the decision boundary and on which side.

Note: Weight 🏋️‍♀️ vector ‘w’ is normal/perpendicular to the hyperplane, pointing towards the positive class (y=1).

Distance of Points from Separating Hyperplane

For points exactly on the decision boundary \[z = \mathbf{w^Tx} + w_0 = 0 \]
Positive (+ve) labeled points \[ z = \mathbf{w^Tx} + w_0 > 0 \]
Negative (-ve) labeled points \[ z = \mathbf{w^Tx} + w_0 < 0 \]

Missing Link 🔗

The distance of a point from the hyperplane can range from \(-\infty\) to \(+ \infty\).
So we need a link 🔗 to transform the geometric distance to probability.

Sigmoid Function (a.k.a Logistic Function)

Maps the output of a linear equation to a value between 0 and 1, allowing the result to be interpreted as a probability.

\[\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}\]

If the distance ‘z’ is large and positive, \(\hat{y} \approx 1\) (High confidence).
If the distance ‘z’ is 0, \(\hat{y} = 0.5\) (Maximum uncertainty).

Why is it called Logistic Regression ?

Because, we use the logistic (sigmoid) function as the ‘link function’🔗 to map 🗺️ the continuous output of the regression into a probability space.

Binary Classification | Sigmoid Function | Why is Logistic Regression called so ?

Previous: Assumptions of Linear Regression Next: Log Loss

End of Section

2 - Log Loss

Log Loss

Logistic Regression | All Videos

Log Loss

Log Loss = \(\begin{cases} -log(\hat{y_i}) & \text{if } y_i = 1 \\ \\ -log(1-\hat{y_i}) & \text{if } y_i = 0 \end{cases} \)

Combining the above 2 conditions into 1 equation gives:

Log Loss = \(-[y_ilog(\hat{y_i}) + (1-y_i)log(1-\hat{y_i})]\)

images/machine_learning/supervised/logistic_regression/log_loss/slide_04_01.png

Cost Function

\[J(w) = -\frac{1}{n}\sum [y_ilog(\hat{y_i}) + (1-y_i)log(1-\hat{y_i})]\]

We need to find the weights 🏋️‍♀️ ‘w’ that minimize the cost 💵 function.

Gradient Descent

Weight update: \[w_{new}=w_{old}-η.\frac{∂J(w)}{∂w_{old}}\]

We need to find the gradient of log loss w.r.t weight ‘w’.

Gradient Calculation

Chain Rule:

\[\frac{\partial{J(w)}}{\partial{w}} = \frac{\partial{J(w)}}{\partial{\hat{y}}}.\frac{\partial{\hat{y}}}{\partial{z}}.\frac{\partial{z}}{\partial{w}}\]

Cost Function: \(J(w) = -\frac{1}{n}\sum [y_ilog(\hat{y_i}) + (1-y_i)log(1-\hat{y_i})]\)
Prediction: \(\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}\)
Distance of Point: \(z = \mathbf{w^Tx} + w_0\)

Cost 💰Function Derivative

\[ J(w) = -\sum [ylog(\hat{y}) + (1-y)log(1-\hat{y})]\]

How loss changes w.r.t prediction ?

\[ \begin{align*} \frac{\partial{J(w)}}{\partial{\hat{y}}} &= - [\frac{y}{\hat{y}} - \frac{1-y}{1-\hat{y}}] \\ &= -[\frac{y- \cancel{y\hat{y}} -\hat{y} + \cancel{y\hat{y}}}{\hat{y}(1-\hat{y})}] \\ \therefore \frac{\partial{J(w)}}{\partial{\hat{y}}} &= \frac{\hat{y} - y}{\hat{y}(1-\hat{y})} \end{align*} \]

Prediction Derivative

\[ \hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}} \]

How prediction changes w.r.t distance ?

\[ \begin{align*} \frac{\partial{\hat{y}}}{\partial{z}} &= \frac{\partial{\sigma(z)}}{\partial{z}} = \sigma'(z) \\ \sigma'(z) &= \sigma(z)(1-\sigma(z)) \\ \therefore \frac{\partial{\hat{y}}}{\partial{z}} &= \hat{y}(1-\hat{y}) \end{align*} \]

Sigmoid Derivative

\[ \sigma(z) = \frac{1}{1 + e^{-z}} \]\[ \begin{align} &\text {Let } u = 1 + e^{-z} \nonumber \\ &\implies \sigma(z) = \frac{1}{u}, \quad \text{so, } \nonumber \\ &\frac{\partial{\sigma(z)}}{\partial{z}} = \frac{\partial{\sigma(z)}}{\partial{u}}. \frac{\partial{u}}{\partial{z}} \nonumber \\ &\frac{\partial{\sigma(z)}}{\partial{u}} = -\frac{1}{u^2} = - \frac{1}{(1 + e^{-z})^2} \\ &\text{and } \frac{\partial{u}}{\partial{z}} = \frac{\partial{(1 + e^{-z})}}{\partial{z}} = -e^{-z} \end{align} \]

from equations (1) & (2):

\[ \begin{align*} \because \frac{\partial{\sigma(z)}}{\partial{z}} &= \frac{\partial{\sigma(z)}}{\partial{u}}. \frac{\partial{u}}{\partial{z}} \\ \implies \frac{\partial{\sigma(z)}}{\partial{z}} &= - \frac{1}{(1 + e^{-z})^2}. -e^{-z} = \frac{e^{-z}}{(1 + e^{-z})^2} \\ 1 - \sigma(z) & = 1 - \frac{1}{1 + e^{-z}} = \frac{e^{-z}}{1 + e^{-z}} \\ \frac{\partial{\sigma(z)}}{\partial{z}} &= \frac{1}{1 + e^{-z}}.\frac{e^{-z}}{1 + e^{-z}} \\ \therefore \frac{\partial{\sigma(z)}}{\partial{z}} &= \sigma(z).(1-\sigma(z)) \end{align*} \]

Distance Derivative

\[z=w^{T}x+w_{0}\]

How distance changes w.r.t weight 🏋️‍♀️ ?

\[ \frac{\partial{z}}{\partial{w}} = \mathbf{x} \]

\[\because \frac{\partial{(a^T\mathbf{x})}}{\partial{\mathbf{x}}} = a\]

Gradient Calculation (combined)

Chain Rule:

\[ \begin{align*} \frac{\partial{J(w)}}{\partial{w}} &= \frac{\partial{J(w)}}{\partial{\hat{y}}}.\frac{\partial{\hat{y}}}{\partial{z}}.\frac{\partial{z}}{\partial{w}} \\ &= \frac{\hat{y} - y}{\cancel{\hat{y}(1-\hat{y})}}.\cancel{\hat{y}(1-\hat{y})}.x \\ \therefore \frac{\partial{J(w)}}{\partial{w}} &= (\hat{y} - y).x \end{align*} \]

Cost 💰Function Derivative

\[\frac{\partial{J(w)}}{\partial{w}} = \sum (\hat{y_i} - y_i).x_i\]

Gradient = Error x Input

Error = \((\hat{y_i}-y_i)\): how far is prediction from the truth?
Input = \(x_i\): contribution of specific feature to the error.

Gradient Descent (update)

Weight update:

\[w_{new} = w_{old} - \eta. \sum_{i=1}^n (\hat{y_i} - y_i).x_i\]

Why MSE can NOT be used as Loss Function?

Mean Squared Error (MSE) can not be used to quantify error/loss in binary classification because:

Convexity : MSE combined with Sigmoid is non-convex, so, Gradient Descent can get trapped in local minima.
Penalty: MSE does not appropriately penalize mis-classifications in binary classification.
- e.g: If actual value is class 1 but the model predicts class 0, then MSE = \((1-0)^2 = 1\), which is very low, whereas los loss = \(-log(0) = \infty\)

Log Loss | Logistic Regression | Derivative Calculation | Sigmoid Function Derivative | Explained

Previous: Binary Classification Next: Regularization

End of Section

3 - Regularization

Regularization in Logistic Regression

Logistic Regression | All Videos

What happens to the weights of Logistic Regression if the data is perfectly linearly separable?

The weights 🏋️‍♀️ will tend towards infinity, preventing a stable solution.

The model tries to make probabilities exactly 0 or 1, but the sigmoid function never reaches these limits, leading to extreme weights 🏋️‍♀️ to push probabilities near the extremes.

Distance of Point: \(z = \mathbf{w^Tx} + w_0\)
Prediction: \(\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}\)
Log loss: \(-[y_ilog(\hat{y_i}) + (1-y_i)log(1-\hat{y_i})] \)

Why is it a problem ?

Overfitting:
Model becomes perfectly accurate on training 🏃‍♂️data but fails to generalize, performing poorly on unseen data.

Solution 🦉

Regularization:
Adds a penalty term to the loss function, discouraging weights 🏋️‍♀️ from becoming too large.

L1 Regularization

\[ \begin{align*} \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ & \underbrace{- \sum_{i=1}^n [y_i\log(\hat{y_i}) + (1-y_i)\log(1-\hat{y_i})]}_{\text{Log Loss}} \\ & \underbrace{+ \lambda_1 \sum_{j=1}^n |w_j|}_{\text{L1 Regularization}} \\ \end{align*} \]

L2 Regularization

\[ \begin{align*} \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ & \underbrace{- \sum_{i=1}^n [y_i\log(\hat{y_i}) + (1-y_i)\log(1-\hat{y_i})]}_{\text{Log Loss}} \\ & \underbrace{+ \lambda_2 \sum_{j=1}^n w_j^2}_{\text{L2 Regularization}} \\ \end{align*} \]

4 - Log Odds

Log Odds

Logistic Regression | All Videos

What is the meaning of Odds ?

Odds compare the likelihood of an event happening vs. not happening.

Odds = \(\frac{p}{1-p}\)

p = probability of success

Log Odds (Logit) Assumption

In logistic regression we assume that Log-Odds (the log of the ratio of positive class to negative class) is a linear function of inputs.

Log-Odds (Logit) = \(log_e \frac{p}{1-p}\)

Log Odds (Logit)

Log Odds = \(log_e \frac{p}{1-p} = z\)

\[z=w^{T}x+w_{0}\]

\[ \begin{align*} &log_{e}(\frac{p}{1-p}) = z \\ &⟹\frac{p}{1-p} = e^{z} \\ &\implies p = e^z - p.e^z \\ &\implies p = \frac{e^z}{1+e^z} \\ &\text { divide numerator and denominator by } e^z \\ &\implies p = \frac{1}{1+e^{-z}} \quad \text { i.e, Sigmoid function} \end{align*} \]

Sigmoid Function

Sigmoid function is the inverse of log-odds (logit) function, it converts the log-odds back to probability, and vice versa.

Range of Values

Probability: 0 to 1
Odds: 0 to + \(\infty\)
Log Odds: -\(\infty\) to +\(\infty\)

Log Odds (Logit) | Relation to Sigmoid Function | Logistic Regression Logit Assumption | Explained

Previous: Regularization Next: Probabilistic Interpretation

End of Section

5 - Probabilistic Interpretation

Probabilistic Interpretation of Logistic Regression

Logistic Regression | All Videos

Why do we use Log Loss in Binary Classification?

To understand that let’s have a look 👀at the statistical assumptions.

Bernoulli Assumption

We assume that our target variable ‘y’ follows a Bernoulli distribution, i.e, has exactly 2 outputs success/failure.

P(Y=1|X) = p
P(Y=0|X) = 1- p

Combining above 2 into 1 equation gives:

P(Y=y|X) = \(p^y(1-p)^{1-y}\)

Maximum Likelihood Estimate (MLE)

‘Find the most plausible explanation for what I see.’

We want to find the weights 🏋️‍♀️‘w’ that maximize the likelihood of seeing the data.

Data, D = \(\{ (x_i, y_i)_{i=1}^n , \quad y_i \in \{0,1\} \}\)

We do this by maximizing likelihood function.

Likelihood Function

\[\mathcal{L}(w) = \prod_{i=1}^n [p_i^{y_i}(1-p_i)^{1-y_i}]\]

Assumption: Training data is I.I.D.

Problem 🦀

Multiplying many small probabilities is computationally difficult and prone to numerical errors.

Solution🦉

A common simplification is to maximize the log-likelihood function instead, which converts the product into a sum.

Note: Log is a strictly monotonically increasing function.

Log Likelihood Function

\[ \begin{align*} log\mathcal{L}(w) &= \sum_{i=1}^n log [p_i^{y_i}(1-p_i)^{1-y_i}] \\ \therefore log\mathcal{L}(w) &= \sum_{i=1}^n [ y_ilog(p_i) + (1-y_i)log(1-p_i)] \end{align*} \]

Maximizing the log-likelihood is same as minimizing the negative of log-likelihood.

\[ \begin{align*} \underset{w}{\mathrm{max}}\ log\mathcal{L}(w) &= \underset{w}{\mathrm{min}} - log\mathcal{L}(w) \\ \underset{w}{\mathrm{min}} - log\mathcal{L}(w) &= - \sum_{i=1}^n [ y_ilog(p_i) + (1-y_i)log(1-p_i)] \\ \underset{w}{\mathrm{min}} - log\mathcal{L}(w) &= \text {Log Loss} \end{align*} \]

Inference

Log Loss is not chosen arbitrarily, but it follows directly from Bernoulli assumption and MLE.

Probabilistic Interpretation of Logistic Regression | Bernoulli Assumption | MLE | Explained

Previous: Log Odds Next: KNN Introduction

End of Section