Lab 14: Model Compression & Quantisation
Objective
Background
Problem: GPT-4 requires 8x H100 GPUs. Edge device has 4GB RAM.
Compression techniques:
Quantisation: FP32 → INT8 (4× smaller, 2-4× faster, <1% accuracy loss)
Pruning: Remove 90% of weights (sparse model, smaller storage)
Distillation: Train small "student" to mimic large "teacher"
GGUF/GPTQ: Practical formats used by llama.cpp, vLLMStep 1: Baseline Teacher Model
docker run -it --rm zchencow/innozverse-ai:latest bashimport numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score
import warnings; warnings.filterwarnings('ignore')
np.random.seed(42)
X, y = make_classification(n_samples=10000, n_features=20, n_informative=12,
weights=[0.94, 0.06], random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
scaler = StandardScaler()
X_tr_s = scaler.fit_transform(X_tr); X_te_s = scaler.transform(X_te)
# Large teacher model
teacher = MLPClassifier(hidden_layer_sizes=(256, 128, 64), max_iter=300, random_state=42)
teacher.fit(X_tr_s, y_tr)
teacher_auc = roc_auc_score(y_te, teacher.predict_proba(X_te_s)[:, 1])
def model_size_kb(model) -> float:
"""Estimate model size in KB"""
if hasattr(model, 'coefs_'):
return sum(w.size * 4 for w in model.coefs_) / 1024 # FP32
return 0
print(f"Teacher model: AUC={teacher_auc:.4f} size≈{model_size_kb(teacher):.1f}KB")Step 2: Knowledge Distillation
Step 3: Weight Pruning
Step 4: Post-Training Quantisation
Step 5–8: Capstone — Edge Deployment Package
Summary
Technique
Compression
Accuracy Loss
Effort
Further Reading
PreviousLab 13: Prompt Injection Defence & LLM SecurityNextLab 15: AutoML & Neural Architecture Search
Last updated
