Lab 05: Model Evaluation & Metrics

Objective

Learn how to properly evaluate ML models. Accuracy is almost never the right metric. By the end you will know how to choose the right metric, implement cross-validation, plot ROC curves, and diagnose bias vs. variance.

Time: 45 minutes | Level: Practitioner | Docker Image: zchencow/innozverse-ai:latest


Background

A model that predicts "no attack" 100% of the time achieves 99.9% accuracy on a dataset where attacks are 0.1% of traffic. That model is useless. Accuracy hides the truth on imbalanced datasets.

Dataset: 950 benign, 50 attacks  (imbalanced)

Naive model (always predict "benign"):
  Accuracy = 950/1000 = 95% ← looks great!
  Recall   = 0/50     =  0% ← detects ZERO attacks
  Useless.

Step 1: Environment Setup

docker run -it --rm zchencow/innozverse-ai:latest bash
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, average_precision_score,
    confusion_matrix, classification_report)
from sklearn.model_selection import (cross_val_score, StratifiedKFold,
    learning_curve, validation_curve)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np
import warnings; warnings.filterwarnings('ignore')
print("Ready")

📸 Verified Output:


Step 2: The Confusion Matrix — Foundation of All Metrics

📸 Verified Output:

💡 Precision vs Recall tradeoff: Raising the classification threshold → more precision (fewer false alarms) but lower recall (miss more real attacks). Choose based on cost: is a missed attack worse than a false alarm?


Step 3: ROC Curve and AUC

The ROC curve shows how the True Positive Rate (recall) vs False Positive Rate changes across all possible thresholds:

📸 Verified Output:

💡 ROC-AUC is threshold-independent — it measures the model's ability to rank positives above negatives regardless of where you set the cutoff. Ideal for imbalanced datasets.


Step 4: Stratified K-Fold Cross-Validation

📸 Verified Output:

💡 The gap between train accuracy (0.9947) and validation accuracy (0.9213) indicates mild overfitting. A large gap = overfit; train ≈ val ≈ low = underfit.


Step 5: Bias-Variance Tradeoff — Learning Curves

📸 Verified Output:

💡 The gap closes as training size increases. If more data doesn't help (gap stays large), you need a simpler model. If validation score is low even with lots of data, you need a more complex model.


Step 6: Classification Threshold Tuning

📸 Verified Output:

💡 For a security system where missing an attack is catastrophic, choose threshold 0.2–0.3 (high recall, accept more false alarms). For a system where false alarms are costly, choose 0.7+ (high precision).


Step 7: Multiclass Evaluation

📸 Verified Output:


Step 8: Real-World Capstone — Security Alert Triage System

📸 Verified Output:

💡 Business impact analysis converts model metrics into money — the language stakeholders actually care about. A 5% recall improvement (catching 5 more attacks) saves £50,000 more than fixing 100 false alarms.


Summary

Metric
Best For
Avoid When

Accuracy

Balanced classes

Any imbalanced dataset

Precision

False alarms are costly

Missing threats is dangerous

Recall

Missing threats is dangerous

False alarms are very costly

F1

Balance precision & recall

Imbalanced + care about TN

ROC-AUC

General ranking quality

Always good to report

Avg Precision

Very imbalanced datasets

Key Takeaways:

  • Always stratify your train/test split and cross-validation folds

  • Use cross_validate with multiple metrics, not just accuracy

  • Tune the classification threshold based on business cost

  • Report business impact, not just percentages

Further Reading

Last updated