Data Leakage

Data Leakage

Data Leakage
⭐️ Occurs when information ℹ️ NOT available at inference time is used during training 🏃‍♂️, leading to good training performance, but poor real‑world 🌎 performance.
Target Leakage

⭐️ Including features that are only available after the event we are trying to predict.

  • e.g. Including number_of_late_payments in a model to predict whether a person applying for a bank loan 💵 will default ?
Temporal Leakage

⭐️ Using future data to predict the past.

  • Fix: Use Time-Series ⏰ Cross-Validation (Walk-forward validation) instead of random shuffling.
Train-Test Contamination

⭐️ Applying preprocessing (like global StandardScaler or Mean_Imputation) on the entire dataset before splitting.

  • Fix: Compute mean, variance, etc. only on the training 🏃‍♂️data and use the same for validation and test data.



End of Section