Lab 18: Self-Supervised & Contrastive Learning

Objective

Learn representations without labels: SimCLR-style contrastive learning, data augmentation strategies for security data, BYOL (Bootstrap Your Own Latent), linear evaluation protocol, and applying self-supervised pre-training to network intrusion detection with limited labels.

Time: 50 minutes | Level: Advanced | Docker Image: zchencow/innozverse-ai:latest


Background

Supervised learning: needs thousands of labelled examples.
Self-supervised learning: learn from the data itself — no labels needed.

Core idea: create two "views" of the same sample → learn representations
where views of the SAME sample are close, views of DIFFERENT samples are far.

SimCLR (Chen et al. 2020):
  x → augment → x1, x2 → encoder → z1, z2
  Loss: NT-Xent — maximise similarity of (z1,z2), minimise similarity to others in batch

Security motivation: labelled attack traffic is rare (only 5-10% of logs).
Self-supervised pre-training on unlabelled traffic → fine-tune with few labels.

Step 1: Data Augmentation for Network Traffic

📸 Verified Output:


Step 2: SimCLR Implementation

📸 Verified Output:


Step 3–8: Capstone — Few-Shot Attack Detection

📸 Verified Output:


Summary

Method
Labels Needed
AUC (n=50)
Key Idea

Random features

50

0.62

Baseline

Supervised RF

50

0.78

Standard supervised

SimCLR (pre-trained)

50

0.81

Self-supervised

Supervised RF

4000

0.98

Full data upper bound

SSL bridges 60% of the gap between 50-label supervised and full-data supervised.

Further Reading

Last updated