Data Splitting
Data Splitting
2 minute read
Why Data-Splitting is Required ?
To avoid over-fitting (memorize), so that, the model generalizes well, improving its performance on unseen data.
Train/Validation/Test Split
- Training Data: Learn model parameters (Textbook + Practice problems)
- Validation Data: Tune hyper-parameters (Mock tests)
- Test Data: Evaluate model performance (Real (final) exam)
Data Leakage
Data leakage occurs when information from the validation or test set is inadvertently used to train 🏃♂️ the model.
The model ‘cheats’ by learning to exploit information it should not have access to, resulting in artificially inflated performance metrics during testing 🧪.
Typical Split Ratios
- Small datasets(1K-100K): 60/20/20, 70/15/15 or 80/10/10
- Large datasets(>1M): 98/1/1 would suffice, as 1% of 1M is still 10K.
Note: There is no fixed rule, its trial and error.
Imbalanced Data
Imbalanced data refers to a dataset where the target classes are represented by an unequal or
highly skewed distribution of samples, such that the majority class significantly outnumbers the minority class.
Stratified Sampling
If there is class imbalance in the dataset, (e.g., 95% class A , 5% class B), a random split might result in the validation set having 99% class A.
Solution: Use stratified sampling to ensure class proportions are maintained across all splits (train🏃♂️/validation📋/test🧪).
Note: Non-negotiable for imbalanced data.
Time-Series ⏳ Data
- In time-series ⏰ data, divide the data chronologically, not randomly, i.e, training data time ⏰ should precede validation data time ⏰.
- We always train 🏃♂️ on past data to predict future data.
Golden rule: Never look 👀 into the future.
Data Splitting (Train/Validation/Test Split) | Data Leakage | Stratified Sampling | Time Series Data
End of Section