Data Leakage

less than a minute

Data Leakage

⭐️ Occurs when information ℹ️ NOT available at inference time is used during training 🏃‍♂️, leading to good training performance, but poor real‑world 🌎 performance.

Target Leakage

⭐️ Including features that are only available after the event we are trying to predict.

e.g. Including number_of_late_payments in a model to predict whether a person applying for a bank loan 💵 will default ?

Temporal Leakage

⭐️ Using future data to predict the past.

Fix: Use Time-Series ⏰ Cross-Validation (Walk-forward validation) instead of random shuffling.

Train-Test Contamination

⭐️ Applying preprocessing (like global StandardScaler or Mean_Imputation) on the entire dataset before splitting.

Fix: Compute mean, variance, etc. only on the training 🏃‍♂️data and use the same for validation and test data.

Data Leakage in ML | Target Leakage | Temporal Leakage | Train Test Contamination | Explained

Previous: Feature Engineering Next: Model Interpretability

End of Section