Random Forest
2 minute read
💡If one feature is extremely predictive (e.g., ‘Area’ for house prices), almost every bootstrap tree will split on that feature at the root.
👉This makes the trees(models) very similar, leading to a high correlation ‘\(\rho\)’.
\[Var(f_{bagging})=ρ\sigma^{2}+\frac{1-ρ}{B}\sigma^{2}\]💡Choose a random subset of ‘m’ features from the total ‘d’ features, reducing the correlation ‘\(\rho\)’ between trees.
👉By forcing trees to split on ‘sub-optimal’ features, we intentionally increase the variance of individual trees; also the bias is slightly increased (simpler trees).
Standard heuristics:
- Classification: \(m = \sqrt{d}\)
- Regression: \(m = \frac{d}{3}\)
💡Because ‘\(\rho\)’ is the dominant factor in the variance of the ensemble when B is large, the overall ensemble variance Var(\(f_{rf}\)) drops significantly lower than standard Bagging.
\[Var(f_{rf})=ρ\sigma^{2}+\frac{1-ρ}{B}\sigma^{2}\]💡A Random Forest will never overfit by adding more trees (B).
It only converges to the limit: ‘\(\rho\sigma^2\)’.
Overfitting is controlled by:
- depth of the individual trees.
- size of the feature subset ‘m'.
- High Dimensionality: 100s or 1000s of features; RF’s feature sampling prevents a few features from masking others.
- Tabular Data (with Complex Interactions): Captures non-linear relationships without needing manual feature engineering.
- Noisy Datasets: The averaging process makes RF robust to outliers (especially if using min_samples_leaf).
- Automatic Validation: Need a quick estimate of generalization error without doing 10-fold CV (via OOB Error).
End of Section