Probabilistic Interpretation

Probabilistic Interpretation of Logistic Regression

Why do we use Log Loss in Binary Classification?
To understand that let’s have a look 👀at the statistical assumptions.
Bernoulli Assumption

We assume that our target variable ‘y’ follows a Bernoulli distribution, i.e, has exactly 2 outputs success/failure.

  • P(Y=1|X) = p
  • P(Y=0|X) = 1- p

Combining above 2 into 1 equation gives:

  • P(Y=y|X) = \(p^y(1-p)^{1-y}\)
Maximum Likelihood Estimate (MLE)

‘Find the most plausible explanation for what I see.’

We want to find the weights 🏋️‍♀️‘w’ that maximize the likelihood of seeing the data.

  • Data, D = \(\{ (x_i, y_i)_{i=1}^n , \quad y_i \in \{0,1\} \}\)

We do this by maximizing likelihood function.

Likelihood Function
\[\mathcal{L}(w) = \prod_{i=1}^n [p_i^{y_i}(1-p_i)^{1-y_i}]\]

Assumption: Training data is I.I.D.

Problem 🦀
Multiplying many small probabilities is computationally difficult and prone to numerical errors.
Solution🦉

A common simplification is to maximize the log-likelihood function instead, which converts the product into a sum.

Note: Log is a strictly monotonically increasing function.

Log Likelihood Function
\[ \begin{align*} log\mathcal{L}(w) &= \sum_{i=1}^n log [p_i^{y_i}(1-p_i)^{1-y_i}] \\ \therefore log\mathcal{L}(w) &= \sum_{i=1}^n [ y_ilog(p_i) + (1-y_i)log(1-p_i)] \end{align*} \]

Maximizing the log-likelihood is same as minimizing the negative of log-likelihood.

\[ \begin{align*} \underset{w}{\mathrm{max}}\ log\mathcal{L}(w) &= \underset{w}{\mathrm{min}} - log\mathcal{L}(w) \\ \underset{w}{\mathrm{min}} - log\mathcal{L}(w) &= - \sum_{i=1}^n [ y_ilog(p_i) + (1-y_i)log(1-p_i)] \\ \underset{w}{\mathrm{min}} - log\mathcal{L}(w) &= \text {Log Loss} \end{align*} \]
Inference
Log Loss is not chosen arbitrarily, but it follows directly from Bernoulli assumption and MLE.



End of Section