Lab 06: AI Observability & Monitoring

Time: 50 minutes | Level: Architect | Docker: docker run -it --rm zchencow/innozverse-ai:latest bash

Overview

ML models silently degrade in production. This lab covers the complete ML observability stack: data drift detection (KS test, PSI), concept drift, model performance monitoring, feature importance tracking, outlier detection, and production monitoring architectures.

Architecture

┌──────────────────────────────────────────────────────────────┐
│                 ML Observability Stack                        │
├──────────────────────────────────────────────────────────────┤
│  Production Traffic → Feature Logging → Drift Pipeline       │
│       ↓                                      ↓               │
│  Prediction Store                      KS Test / PSI         │
│  Ground Truth (delayed)                Feature Drift Detect  │
│       ↓                                      ↓               │
│  Performance Metrics                   ALERT System          │
│  (Accuracy, F1, AUC)                  (PagerDuty / Slack)   │
├──────────────────────────────────────────────────────────────┤
│  Dashboards: Grafana + Evidently AI + Prometheus             │
└──────────────────────────────────────────────────────────────┘

Step 1: Types of ML Model Degradation

Silent Degradation is the Enemy:

  • Model accuracy drops but no error logs appear

  • Business metrics worsen weeks later

  • Root cause: data or concept drift

Four Types of Drift:

Type
Definition
Detection Method
Frequency

Data drift

Input distribution changed

KS test, PSI, JS divergence

Daily

Concept drift

P(Y|X) relationship changed

Performance metrics monitoring

Weekly

Label drift

Output distribution changed

Output distribution monitoring

Daily

Feature drift

Individual feature statistics changed

Per-feature statistics

Daily

Examples:


Step 2: KS Test for Feature Drift

The Kolmogorov-Smirnov test compares two distributions without assuming normality.

KS Test:

Interpretation:

💡 KS test is powerful but sensitive to sample size. With N=100,000 samples, even tiny insignificant differences will have p < 0.05. Always combine p-value with KS statistic magnitude.


Step 3: PSI (Population Stability Index)

PSI is the industry standard for monitoring data drift in credit scoring and finance.

PSI Formula:

PSI vs KS Test:

Metric
Sensitivity
Directionality
Industry Use

KS Test

High

Yes (which direction?)

General ML

PSI

Medium

No (magnitude only)

Finance/credit

JS Divergence

Medium

Symmetric

General ML

Wasserstein

High

Yes

Research


Step 4: Concept Drift Detection

Concept drift means the relationship between features and labels has changed.

Detection Approaches:

1. Performance-based monitoring (most practical):

2. Statistical tests on predictions:

3. Challenger model comparison:

ADWIN (Adaptive Windowing):


Step 5: Feature Importance Drift

Even without overall performance drop, individual feature importances can shift.

Monitoring Feature Importances Over Time:

SHAP-based Drift:


Step 6: Outlier Detection for ML Monitoring

Detect input samples that are far outside the training distribution.

Methods:

Method
Type
Complexity
Best For

Z-score

Statistical

Low

Univariate, normal dist

Isolation Forest

ML

Medium

Multivariate, any dist

Autoencoder

Deep learning

High

High-dim, complex patterns

LOF

Distance-based

Medium

Local anomalies

OCSVM

SVM-based

Medium

Small datasets

Production Outlier Monitoring:


Step 7: Monitoring Architecture (Evidently + Grafana)

Evidently AI Reports:

Metrics Stack:

Shadow Mode for Safe Monitoring:

💡 Run every new model in shadow mode for at least 48 hours (covering weekday + weekend patterns) before canary promotion.


Step 8: Capstone — Build Drift Detection System

📸 Verified Output:

Monitoring Action Matrix:

KS p-value
PSI
Action

> 0.05

< 0.1

✅ No action

< 0.05

0.1-0.2

⚠️ Investigate root cause

< 0.01

> 0.2

🚨 Trigger retraining pipeline

< 0.001

> 0.25

🚨 Disable model, manual review


Summary

Concept
Key Points

Data Drift

Input distribution changed; detect with KS test or PSI

Concept Drift

Y|X relationship changed; detect with performance monitoring

KS Test

Non-parametric, p < 0.05 = drift; sensitive to sample size

PSI

Finance standard: <0.1=stable, 0.1-0.2=monitor, >0.2=retrain

Feature Importance Drift

Track SHAP values over time; sudden changes = investigate

Shadow Mode

New model runs on live traffic, no user impact, safest test

Monitoring Stack

Evidently (reports) + Prometheus (metrics) + Grafana (dashboards)

Next Lab: Lab 07: Federated Learning at Scale →

Last updated