CatBoost

CatBoost

CatBoost (Categorical Boosting)
⭐️Developed by Yandex, this algorithm is specifically optimized for handling ‘categorical’ features without requiring extensive preprocessing (such as, one-hot encoding).
Algorithmic Optimizations
🔵 Ordered Target Encoding
🔵 Symmetric(Oblivious) Trees
🔵 Handling Missing Values
Ordered Target Encoding
  • ❌ Standard target encoding can lead to target leakage, where the model uses information from the target variable during training that would not be available during inference.
    👉(model ‘cheats’ by using a row’s own label to predict itself).
  • ✅ CatBoost calculates the target statistics (average target value) for each category based only on the history of previous training examples in a random permutation of the data.
Symmetric (Oblivious) Trees
  • 🦋 Uses symmetric decision trees by default.
    👉 In symmetric trees, the same split condition is applied at each level across the entire tree structure.

  • 🦘Does not walk down the tree using ‘if-else’ logic, instead it evaluates decision conditions to create a binary index (e.g 101) and jumps directly to that leaf 🍃 in memory 🧠.

    images/machine_learning/supervised/decision_trees/catboost/slide_06_01.png
Handling Missing Values
  • ⚙️ CatBoost offers built-in, intelligent handling of missing values and sparse features, which often require manual preprocessing in other GBDT libraries.

  • 💡Treats ‘NaN’ as a distinct category, reducing the need for imputation.

    images/machine_learning/supervised/decision_trees/catboost/slide_08_01.png



End of Section