Cross Validation

2 minute read

Core Idea 💡

Do not trust one split of the data; validate across many splits, and average the result to reduce randomness and bias.

Note: Two different splits of the same dataset can give very different validation scores.

Cross-validation

Cross-validation is a statistical resampling technique used to evaluate how well a machine learning model generalizes to an independent, unseen dataset.

It works by systematically partitioning the available data into multiple subsets, or ‘folds’, and then training and testing the model on different combinations of these folds.

K-Fold Cross-Validation
Leave-One-Out Cross-Validation (LOOCV)

K-Fold Cross-Validation

Shuffle the dataset randomly (except time-series ⏳).
Split data into k equal subsets(folds).
Iterate through each unique fold, using it as the validation set.
Use remaining k-1 fold for training 🏃‍♂️.
Take an average of the results.Note: Common choice for k=5 or 10.

Iteration 1: [V][T][T][T][T]
Iteration 2: [T][V][T][T][T]
Iteration 3: [T][T][V][T][T]
Iteration 4: [T][T][T][V][T]
Iteration 5: [T][T][T][T][V]

Leave-One-Out Cross-Validation (LOOCV)

Model is trained 🏃‍♂️on all data points except one, and then tested 🧪on that remaining single observation.

LOOCV is an extreme case of k-fold cross-validation, where, k=n (number of data points).

Pros:
Useful for small (<1000) datasets.
Cons:
Computationally 💻 expensive 💰.

Cross Validation | K-Fold CV | LOOCV in Machine Learning | Explained with Examples

Previous: Data Splitting Next: Bias Variance Tradeoff

End of Section