Lab 14: Model Evaluation & CV

Objective

Implement robust model evaluation from scratch: train/val/test split, K-fold and stratified K-fold cross-validation, evaluation metrics (accuracy, precision, recall, F1, AUC-ROC, RMSE, R²), confusion matrix, learning curves, and bias-variance tradeoff analysis — benchmarking multiple models on Surface product tier prediction.

Background

A model that scores 100% on training data but 50% on unseen data has overfit — it memorised the training set. Proper evaluation requires holding out data the model never sees during training. Cross-validation gives a more reliable estimate by averaging performance across K different train/test splits. Metrics beyond accuracy matter: in imbalanced datasets, a model that always predicts the majority class can have 95% accuracy while being completely useless.

Time

35 minutes

Prerequisites

  • Lab 02 (Logistic Regression) — classification metrics

  • Lab 05 (Decision Trees) — the model we evaluate

Tools

  • Docker: zchencow/innozverse-python:latest


Lab Instructions

💡 Cross-validation variance tells you about dataset size. Wide confidence intervals (e.g., 0.820±0.150) mean your dataset is too small for reliable evaluation — performance estimates swing wildly depending on which 20% ended up in the test set. The fix: more data, or nested cross-validation. Rule of thumb: you need at least 50 samples per class for K-fold to give stable estimates. With fewer than 30 total samples, leave-one-out cross-validation (K=n) is more appropriate.

📸 Verified Output:


Summary

Metric
Formula
When to use

Accuracy

correct/total

Balanced classes

Precision

TP/(TP+FP)

Minimise false positives

Recall

TP/(TP+FN)

Minimise false negatives

F1

2PR/(P+R)

Imbalanced classes

1 - SS_res/SS_tot

Regression quality

RMSE

√(Σ(y-ŷ)²/n)

Regression error magnitude

Last updated