Purity Metrics

Purity Metrics

Pure Leaf 🍃 Node ?

Decision trees recursively partition the data based on feature values.

images/machine_learning/supervised/decision_trees/purity_metrics/slide_02_01.tif

Pure Leaf 🍃 Node: Terminal node where every single data point belongs to the same class.

💡Zero Uncertainty.

So, what should be the logic to partition the data at each step or each node ?

The goal of a decision tree algorithm is to find the split that maximizes information gain, meaning it removes the most uncertainty from the data.

So, what is information gain ?
How do we reduce uncertainty ?

Let’s understand few terms first, before we understand information gain.

Entropy

Measure ⏱ of uncertainty, randomness, or impurity in a data.

\[H(S)=-\sum _{i=1}^{n}p_{i}\log(p_{i})\]

Binary Entropy:

images/machine_learning/supervised/decision_trees/purity_metrics/slide_04_01.png
Surprise 😮 Factor

💡Entropy can also be viewed as the ‘average surprise'.

  • A highly certain event provides little information when it occurs (low surprise).

  • An unlikely event provides a lot of information (high surprise).

    images/machine_learning/supervised/decision_trees/purity_metrics/slide_06_01.png
Information Gain💰

⭐️Measures ⏱ the reduction in entropy (uncertainty) achieved by splitting a dataset based on a specific attribute.

\[IG=Entropy(Parent)-\left[\frac{N_{left}}{N_{parent}}Entropy(Child_{left})+\frac{N_{right}}{N_{parent}}Entropy(Child_{right})\right] \]

Note: The goal of a decision tree algorithm is to find the split that maximizes information gain, meaning it removes the most uncertainty from the data.

Gini 🧞‍♂️Impurity

⭐️Measures ⏱ the probability of an element being incorrectly classified if it were randomly labeled according to the distribution of labels in a node.

\[Gini(S)=1-\sum_{i=1}^{n}(p_{i})^{2}\]
  • Range: 0 (Pure) - 0.5 (Maximum impurity)

Note: Gini is used in libraries like Scikit-Learn (as the default), because it avoids the computationally expensive 💰 log function.

Gini Impurity Vs Entropy
  • Gini Impurity is a first-order approximation of Entropy.

  • For most of the real-world cases, choosing one over the other results in the exact same tree structure or negligible differences in accuracy.

  • When we plot the two functions, they follow nearly identical shapes.

    images/machine_learning/supervised/decision_trees/purity_metrics/slide_10_01.tif



End of Section