Probabilistic Interpretation
Probabilistic Interpretation of Logistic Regression
2 minute read
Why do we use Log Loss in Binary Classification?
To understand that let’s have a look 👀at the statistical assumptions.
Bernoulli Assumption
We assume that our target variable ‘y’ follows a Bernoulli distribution, i.e, has exactly 2 outputs success/failure.
- P(Y=1|X) = p
- P(Y=0|X) = 1- p
Combining above 2 into 1 equation gives:
- P(Y=y|X) = \(p^y(1-p)^{1-y}\)
Maximum Likelihood Estimate (MLE)
‘Find the most plausible explanation for what I see.’
We want to find the weights 🏋️♀️‘w’ that maximize the likelihood of seeing the data.
- Data, D = \(\{ (x_i, y_i)_{i=1}^n , \quad y_i \in \{0,1\} \}\)
We do this by maximizing likelihood function.
Likelihood Function
\[\mathcal{L}(w) = \prod_{i=1}^n [p_i^{y_i}(1-p_i)^{1-y_i}]\]
Assumption: Training data is I.I.D.
Problem 🦀
Multiplying many small probabilities is computationally difficult and prone to numerical errors.
Solution🦉
A common simplification is to maximize the log-likelihood function instead, which converts the product into a sum.
Note: Log is a strictly monotonically increasing function.
Log Likelihood Function
\[
\begin{align*}
log\mathcal{L}(w) &= \sum_{i=1}^n log [p_i^{y_i}(1-p_i)^{1-y_i}] \\
\therefore log\mathcal{L}(w) &= \sum_{i=1}^n [ y_ilog(p_i) + (1-y_i)log(1-p_i)]
\end{align*}
\]
Maximizing the log-likelihood is same as minimizing the negative of log-likelihood.
\[ \begin{align*} \underset{w}{\mathrm{max}}\ log\mathcal{L}(w) &= \underset{w}{\mathrm{min}} - log\mathcal{L}(w) \\ \underset{w}{\mathrm{min}} - log\mathcal{L}(w) &= - \sum_{i=1}^n [ y_ilog(p_i) + (1-y_i)log(1-p_i)] \\ \underset{w}{\mathrm{min}} - log\mathcal{L}(w) &= \text {Log Loss} \end{align*} \]Inference
Log Loss is not chosen arbitrarily, but it follows directly from Bernoulli assumption and MLE.
End of Section