Extra Trees

2 minute read

Issue with Decision Trees/Random Forest

💡In a standard Decision Tree or Random Forest, the algorithm searches for the optimal split point (the threshold ’s’) that maximizes Information Gain or minimizes MSE.

👉This search is:

computationally expensive (sort + mid-point) and
tends to follow the noise in the training 🏃‍♂️data.

Adding Randomness

Adding randomness (right kind) in ensemble averaging reduces correlation/variance.

\[Var(f_{bag})=ρ\sigma^{2}+\frac{1-ρ}{B}\sigma^{2}\]

Extremely Randomized (ExtRa) Trees

Random Thresholds: Instead of searching for the best split point (computationally expensive 💰) for a feature, it picks a threshold at random from a uniform distribution between the feature’s local minimum and maximum.
Entire Dataset: Uses entire training dataset (default) for every tree; no bootstrapping.
Random Feature Subsets: Random subset of m<n features is used in each decision tree.

Mathematical Intuition

\[Var(f_{et})=ρ\sigma^{2}+\frac{1-ρ}{B}\sigma^{2}\]

Picking thresholds randomly has two effects:

Structural correlation between trees becomes extremely low.
Individual trees are ‘weaker’ and have higher bias than a standard optimized tree.

👉The massive drop in ‘\(\rho\)’ often outweighs the slight increase in bias, leading to an overall ensemble that is smoother and more robust to noise than a standard Random Forest.

Note: Extra Trees are almost always grown to full depth, as they may need extra splits to find the same decision boundary.

When to use Extra Trees ?

Performance: Significantly faster to train, as it does not sort data to find optimal split.
Note: If we are working with billions of rows or thousands of features, ET can be 3x to 5x faster than a Random Forest(RF).
Robustness to Noise: By picking thresholds randomly, tends to ‘handle’ the noise more effectively than RF.
Feature Importance: Because ET is so randomized, it often provides more ‘stable’ feature importance scores.

Note: It is less likely to favor a high-cardinality feature (e.g. zip-code) just because it has more potential split points.

Extremely Randomized (Extra) Trees | Decision Trees | Why Random Splitting gives Performance Gain?

Previous: Random Forest Next: Boosting

End of Section