Cross Validation
2 minute read
Do not trust one split of the data; validate across many splits, and average the result to reduce randomness and bias.
Note: Two different splits of the same dataset can give very different validation scores.
Cross-validation is a statistical resampling technique used to evaluate how well a machine learning model generalizes to an independent, unseen dataset.
It works by systematically partitioning the available data into multiple subsets, or ‘folds’, and then training and testing the model on different combinations of these folds.
- K-Fold Cross-Validation
- Leave-One-Out Cross-Validation (LOOCV)
- Shuffle the dataset randomly (except time-series ⏳).
- Split data into k equal subsets(folds).
- Iterate through each unique fold, using it as the validation set.
- Use remaining k-1 fold for training 🏃♂️.
- Take an average of the results.Note: Common choice for k=5 or 10.
- Iteration 1: [V][T][T][T][T]
- Iteration 2: [T][V][T][T][T]
- Iteration 3: [T][T][V][T][T]
- Iteration 4: [T][T][T][V][T]
- Iteration 5: [T][T][T][T][V]
Model is trained 🏃♂️on all data points except one, and then tested 🧪on that remaining single observation.
LOOCV is an extreme case of k-fold cross-validation, where, k=n (number of data points).
Pros:
Useful for small (<1000) datasets.Cons:
Computationally 💻 expensive 💰.
End of Section