Lab 14: Model Compression & Quantisation

Objective

Reduce model size and inference latency without sacrificing accuracy: knowledge distillation, weight pruning, post-training quantisation (INT8/INT4), and structured compression — applied to deploying a malware classifier on edge devices.

Time: 50 minutes | Level: Advanced | Docker Image: zchencow/innozverse-ai:latest


Background

Problem: GPT-4 requires 8x H100 GPUs. Edge device has 4GB RAM.

Compression techniques:
  Quantisation:   FP32 → INT8 (4× smaller, 2-4× faster, <1% accuracy loss)
  Pruning:        Remove 90% of weights (sparse model, smaller storage)
  Distillation:   Train small "student" to mimic large "teacher"
  GGUF/GPTQ:     Practical formats used by llama.cpp, vLLM

Step 1: Baseline Teacher Model

docker run -it --rm zchencow/innozverse-ai:latest bash
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score
import warnings; warnings.filterwarnings('ignore')

np.random.seed(42)
X, y = make_classification(n_samples=10000, n_features=20, n_informative=12,
                             weights=[0.94, 0.06], random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
scaler = StandardScaler()
X_tr_s = scaler.fit_transform(X_tr); X_te_s = scaler.transform(X_te)

# Large teacher model
teacher = MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=300, random_state=42)
teacher.fit(X_tr_s, y_tr)
teacher_auc = roc_auc_score(y_te, teacher.predict_proba(X_te_s)[:, 1])

def model_size_kb(model) -> float:
    """Estimate model size in KB"""
    if hasattr(model, 'coefs_'):
        return sum(w.size * 4 for w in model.coefs_) / 1024  # FP32
    return 0

print(f"Teacher model:  AUC={teacher_auc:.4f}  size≈{model_size_kb(teacher):.1f}KB")

📸 Verified Output:


Step 2: Knowledge Distillation

📸 Verified Output:

💡 The distilled small model (0.9712) significantly outperforms the non-distilled small model (0.9523) at the same size — knowledge transfer works!


Step 3: Weight Pruning

📸 Verified Output:


Step 4: Post-Training Quantisation

📸 Verified Output:


Step 5–8: Capstone — Edge Deployment Package

📸 Verified Output:


Summary

Technique
Compression
Accuracy Loss
Effort

INT8 quantisation

<0.5%

Low

INT4 quantisation

1-3%

Medium

Magnitude pruning (80%)

1-2%

Low

Knowledge distillation

10-26×

1-2%

High

Combined

15-30×

1-2%

High

Further Reading

Last updated