Data Leakage
Data Leakage
less than a minute
Data Leakage
⭐️ Occurs when information ℹ️ NOT available at inference time is used during training 🏃♂️,
leading to good training performance, but poor real‑world 🌎 performance.
Target Leakage
⭐️ Including features that are only available after the event we are trying to predict.
- e.g. Including number_of_late_payments in a model to predict whether a person applying for a bank loan 💵 will default ?
Temporal Leakage
⭐️ Using future data to predict the past.
- Fix: Use Time-Series ⏰ Cross-Validation (Walk-forward validation) instead of random shuffling.
Train-Test Contamination
⭐️ Applying preprocessing (like global StandardScaler or Mean_Imputation) on the entire dataset before splitting.
- Fix: Compute mean, variance, etc. only on the training 🏃♂️data and use the same for validation and test data.
End of Section