Cross Validation

Cross Validation


Core Idea 💡

Do not trust one split of the data; validate across many splits, and average the result to reduce randomness and bias.

Note: Two different splits of the same dataset can give very different validation scores.

Cross-validation

Cross-validation is a statistical resampling technique used to evaluate how well a machine learning model generalizes to an independent, unseen dataset.

It works by systematically partitioning the available data into multiple subsets, or ‘folds’, and then training and testing the model on different combinations of these folds.

  • K-Fold Cross-Validation
  • Leave-One-Out Cross-Validation (LOOCV)
K-Fold Cross-Validation
  1. Shuffle the dataset randomly (except time-series ⏳).
  2. Split data into k equal subsets(folds).
  3. Iterate through each unique fold, using it as the validation set.
  4. Use remaining k-1 fold for training 🏃‍♂️.
  5. Take an average of the results.Note: Common choice for k=5 or 10.
  • Iteration 1: [V][T][T][T][T]
  • Iteration 2: [T][V][T][T][T]
  • Iteration 3: [T][T][V][T][T]
  • Iteration 4: [T][T][T][V][T]
  • Iteration 5: [T][T][T][T][V]
Leave-One-Out Cross-Validation (LOOCV)

Model is trained 🏃‍♂️on all data points except one, and then tested 🧪on that remaining single observation.

LOOCV is an extreme case of k-fold cross-validation, where, k=n (number of data points).

  • Pros:
    Useful for small (<1000) datasets.

  • Cons:
    Computationally 💻 expensive 💰.



End of Section