Cross Entropy & KL Divergence

6 minute read

In this section, we will understand Entropy, Cross Entropy, KL Divergence & JS Divergence.
All these topics come from the field of ‘Information Theory’ and can be understood through the lens of
‘Information’ or ‘Surprise’.

Watch Playlist

📘

Surprise Factor:
It is a measure of the amount of information gained when a specific, individual event occurs, and is defined based on the probability of that event.
It is defined as the negative logarithm of the probability of the event.

It is also called ‘Surprisal’.

\[ S(x) = -log(P(x))) \]

Note: Logarithm(for base > 1) is a monotonically increasing function, so as x increases, log(x) also increases.

So, if probability P(x) increases, then surprise factor S(x) decreases.
Common events have a high probability of occurrence, hence a low surprise factor.
Rare events have a low probability of occurrence, hence a high surprise factor.

Units:

The unit of surprise factor with log base 2 is bits.
with base ’e’ or natural log its nats (natural units of information).

📘

Entropy:
It conveys how much ‘information’ we expect to gain from a random event.

Entropy is the average or expected value of surprise factor of a random variable.
More uniform the distribution ⇒ greater average surprise factor, since all possible outcomes are equally likely.

\[ H(X) = E[-log(P(x)] = -\sum_{x \in X} P(x)log(P(x)) \]

Case 1:
Toss a fair coin; \(P(H) = P(T) = 0.5 = 2^{-1}\)
Surprise factor (Heads or Tails) = \(-log(P(x)) = -log_2(0.5) = -log_2(2^{-1}) = 1~bit\)
Entropy = \( \sum_{x \in X} P(x)log(-P(x)) = 0.5*1 + 0.5*1 = 1 ~bit \)
Case 2:
Toss a biased coin; \(P(H) = 0.9 P(T) = 0.1 \)
Surprise factor (Heads) = \(-log(P(x)) = -log_2(0.9) \approx 0.15 ~bits \)
Surprise factor (Tails) = \(-log(P(x)) = -log_2(0.1) \approx 3.32 ~bits\)
Entropy = \( \sum_{x \in X} P(x)log(-P(x)) = 0.9*0.15 + 0.1*3.32 \approx 0.47 ~bits\)

Therefore, a biased coin is less surprising on an average than a fair coin, hence lower entropy.

Watch Video

📘

Cross Entropy:
It is a measure of the average ‘information gain’ or ‘surprise’ when using an imperfect model (\Q) to encode events from a true model \(P\).

It measures how surprised we are on an average, if the true distribution is \(P\), but we predict using another distribution \(Q\).

\[ H(P,Q) = E[-log(Q(x)] = -\sum_{i=1}^n P(x_i)log(Q(x_i)) \\[10pt] \text{ where true distribution of X} \sim P \text{ but the predictions are made using another distribution } Q \]

A model is trained to classify images as ‘cat’ or ‘dog’. Say, for an input image the true label is ‘cat’, so the true distribution:
\(P = [1.0 ~(cat), 0.0 ~(dog)]\).
Let’s calculate the cross-entropy for the outputs of 2 models A & B.

Model A:
\(Q_A = [0.8 ~(cat), 0.2 ~(dog)]\)
Cross Entropy = \(H(P, Q_A) = -\sum_{i=1}^n P(x_i)log(Q_A(x_i)) \)
\(= -[1*log_2(0.8) + 0*log_2(0.2)] \approx 0.32 ~bits\)
Note: This is very low cross-entropy, since the predicted value is very close to actual, i.e low surprise.
Model B:
\(Q_B = [0.2 ~(cat), 0.8 ~(dog)]\)
Cross Entropy = \(H(P, Q_B) = -\sum_{i=1}^n P(x_i)log(Q_B(x_i)) \)
\(= -[1*log_2(0.2) + 0*log_2(0.8)] \approx 2.32 ~bits\)
Note: Here the cross-entropy is very high, since the predicted value is quite far from the actual truth, i.e high surprise.

Watch Video

📘

Kullback Leibler (KL) Divergence:
It measures the information lost when one probability distribution \(Q\) is used to approximate another distribution \(P\).
It quantifies the ‘extra cost’ in bits needed to encode data using the approximate distribution \(Q\) instead of the true distribution \(P\).

KL Divergence = Cross Entropy(P,Q) - Entropy(P)

\[ \begin{aligned} D_{KL}(P \parallel Q) &= H(P, Q) - H(P) \\ &= -\sum_{i=1}^n P(x_i)log(Q(x_i)) - [-\sum_{i=1}^n P(x_i)log(P(x_i))] \\[10pt] &= \sum_{i=1}^n P(x_i)[log(P(x_i)) - log(Q(x_i)) ] \\[10pt] D_{KL}(P \parallel Q) &= \sum_{i=1}^n P(x_i)log(\frac{P(x_i)}{Q(x_i)}) \\[10pt] \text{ For continuous case: } \\ D_{KL}(P \parallel Q) &= \int_{-\infty}^{\infty} p(x)log(\frac{p(x)}{q(x)})dx \end{aligned} \]

Note:

If \(P = Q\) ,i.e, P and Q are the same distributions, then KL Divergence = 0.
KL divergence is NOT symmetrical ,i.e, \(D_{KL}(P \parallel Q) ⍯ (D_{KL}(Q \parallel P)\).

Using the same cat, dog classification problem example as mentioned above.
A model is trained to classify images as ‘cat’ or ‘dog’. Say, for an input image the true label is ‘cat’, so the true distribution:
\(P = [1.0 ~(cat), 0.0 ~(dog)]\).
Let’s calculate the KL divergence for the outputs of the 2 models A & B.
Let’s calculate entropy first, and we can reuse the cross-entropy values calculated above already.
Entropy: \(H(P) = -\sum_{x \in X} P(x)log(P(x)) = -[1*log_2(1) + 0*log_2(0)] = 0 ~bits\)
Note: \(0*log_2(0) = 0\), is an approximation hack, since \(log(0)\) is undefined.

Model A:
\( D_{KL}(P \parallel Q) = H(P, Q) - H(P) = 0.32 - 0 = 0.32 ~bits \)
Model A incurs an additional 0.32 bits of surprise due to its imperfect prediction.
Model B:
\( D_{KL}(P \parallel Q) = H(P, Q) - H(P) = 2.32 - 0 = 2.32 ~bits \)
Model B has much more ‘information loss’ or incurs higher ‘penalty’ of 2.32 bits as its prediction was from from the truth.

Watch Video

📘

Jensen-Shannon Divergence:
It is a smoothed and symmetric version of the Kullback-Leibler (KL) divergence and is calculated by averaging the
KL divergences between each of the two distributions and their combined average distribution.

Symmetrical and smoothed version of KL divergence.
Always finite; KL divergence can be infinite if \( P ⍯ 0 ~and~ Q = 0 \).
- Makes JS divergence more stable for ML models where some predicted probabilities may be exactly 0.

\[ D_{JS}(P \parallel Q) = \frac{1}{2}[D_{KL}(P \parallel M) + D_{KL}(Q \parallel M)] \\ \text{ where: } M = \frac{P + Q}{2} \]

Let’s continue the cat and dog image classification example discussed above.
A model is trained to classify images as ‘cat’ or ‘dog’. Say, for an input image the true label is ‘cat’, so the true distribution:
\(P = [1.0 ~(cat), 0.0 ~(dog)]\).

Step 1: Calculate the average distribution M.

\[ M = \frac{P + Q}{2} = \frac{1}{2} [[1.0, 0.0] + [0.8, 0.2]] \\[10pt] => M = [0.9, 0.1] \]

Step 2: Calculate \(D_{KL}(P \parallel M)\).

\[ \begin{aligned} D_{KL}(P \parallel M) &= \sum_{i=1}^n P(x_i)log_2(\frac{P(x_i)}{M(x_i)}) \\[10pt] &= 1*log_2(\frac{1}{0.9}) + 0*log_2(\frac{0}{0.1}) \\[10pt] & = log_2(1.111) + 0 \\[10pt] => D_{KL}(P \parallel M) &\approx 0.152 ~bits \\[10pt] \end{aligned} \]

Step 3: Calculate \(D_{KL}(Q \parallel M)\).

\[ \begin{aligned} D_{KL}(Q \parallel M) &= \sum_{i=1}^n Q(x_i)log_2(\frac{Q(x_i)}{M(x_i)}) \\[10pt] &= 0.8*log_2(\frac{0.8}{0.9}) + 0.2*log_2(\frac{0.2}{0.1}) \\[10pt] &= 0.8*log_2(0.888) + 0.2*log_2(2) \\[10pt] &= 0.8*(-0.17) + 0.2*1 \\[10pt] &= -0.136 + 0.2 \\[10pt] => D_{KL}(Q \parallel M) &\approx 0.064 ~bits \\[10pt] \end{aligned} \]

Step 4: Finally, lets put all together to calculate \(D_{JS}(P \parallel Q)\).

\[ \begin{aligned} D_{JS}(P \parallel Q) &= \frac{1}{2}[D_{KL}(P \parallel M) + D_{KL}(Q \parallel M)] \\[10pt] &= \frac{1}{2}[0.152 + 0.064] \\[10pt] &= \frac{1}{2}*0.216 \\[10pt] => D_{JS}(P \parallel Q) &= 0.108 ~ bits \\[10pt] \end{aligned} \]

Therefore, lower JS divergence value => P and Q are more similar.

Watch Video

End of Section