Random Forest

Random Forest

Problem with Bagging

💡If one feature is extremely predictive (e.g., ‘Area’ for house prices), almost every bootstrap tree will split on that feature at the root.

👉This makes the trees(models) very similar, leading to a high correlation ‘\(\rho\)’.

\[Var(f_{bagging})=ρ\sigma^{2}+\frac{1-ρ}{B}\sigma^{2}\]
Feature Sub Sampling

💡Choose a random subset of ‘m’ features from the total ‘d’ features, reducing the correlation ‘\(\rho\)’ between trees.

👉By forcing trees to split on ‘sub-optimal’ features, we intentionally increase the variance of individual trees; also the bias is slightly increased (simpler trees).

Standard heuristics:

  • Classification: \(m = \sqrt{d}\)
  • Regression: \(m = \frac{d}{3}\)
Math of De-Correlation

💡Because ‘\(\rho\)’ is the dominant factor in the variance of the ensemble when B is large, the overall ensemble variance Var(\(f_{rf}\)) drops significantly lower than standard Bagging.

\[Var(f_{rf})=ρ\sigma^{2}+\frac{1-ρ}{B}\sigma^{2}\]
Over-Fitting

💡A Random Forest will never overfit by adding more trees (B).

It only converges to the limit: ‘\(\rho\sigma^2\)’.

Overfitting is controlled by:

  • depth of the individual trees.
  • size of the feature subset ‘m'.
When to use Random Forest ?
  • High Dimensionality: 100s or 1000s of features; RF’s feature sampling prevents a few features from masking others.
  • Tabular Data (with Complex Interactions): Captures non-linear relationships without needing manual feature engineering.
  • Noisy Datasets: The averaging process makes RF robust to outliers (especially if using min_samples_leaf).
  • Automatic Validation: Need a quick estimate of generalization error without doing 10-fold CV (via OOB Error).



End of Section