Regularization

2 minute read

Over-Fitting

⭐️Because trees🌲 are non-parametric and ‘greedy’💰, they will naturally try to grow 📈 until every leaf 🍃 is pure, effectively memorizing noise and outliers rather than learning generalizable patterns.

Tree 🌲 Size

👉As the tree 🌲 grows, the amount of data in each subtree decreases 📉, leading to splits based on statistically insignificant samples.

Regularization Techniques

Pre-Pruning ✂️ &
Post-Pruning ✂️

Pre-Pruning ✂️

⭐️ ‘Early stopping’ heuristics (hyper-parameters).

max_depth: Limits how many levels of ‘if else’ the tree can have; most common.
min_samples_split: A node will only split, if it has at least ‘N’ samples; smooths the model (especially in regression), by ensuring predictions are based on an average of multiple points.
max_leaf_nodes: Limiting the number of leaves reduces the overall complexity of the tree, making it simpler and less likely to memorize the training data’s noise.
min_impurity_decrease: A split is only made if it reduces the impurity (Gini/MSE) by at least a certain threshold.

Max Depth Hyper Parameter Tuning

Below is an example for one of the hyper-parameter’s max_depth tuning.

As we can see below the cross-validation error decreases till depth=6 and after that reduction in error is not so significant.

images/machine_learning/supervised/decision_trees/regularization/slide_05_01.tif

Post-Pruning ✂️

Let the tree🌲 grow to its full depth (overfit) and then ‘collapse’ nodes that provide little predictive value.

Most common algorithm:

Minimal Cost Complexity Pruning

Minimal Cost Complexity Pruning ✂️

💡Define a cost-complexity 💰 measure that penalizes the tree 🌲 for having too many leaves 🍃.

\[R_\alpha(T) = R(T) + \alpha |T|\]

R(T): total misclassification rate (or MSE) of the tree
|T|: number of terminal nodes (leaves)
\(\alpha\): complexity parameter (the ‘tax’ 💰 on complexity)

Logic:

If \(\alpha\)=0, the tree is the original overfit tree.
As \(\alpha\) increases 📈, the penalty for having many leaves grows 📈.
To minimize the total cost 💰, the model is forced to prune branches🪾 that do not significantly reduce R(T).
Use cross-validation to find the ‘sweet spot’ \(\alpha\) that minimizes validation error.

Regularization in Decision Trees | Pre Pruning | Post Pruning | Minimal Cost Complexity Pruning

Previous: Decision Trees For Regression Next: Bagging

End of Section