Lab 09: Text Classification with BERT

Objective

Build a complete text classification pipeline using TF-IDF, word embeddings, and BERT-style contextual representations. Classify security advisories, threat intelligence reports, and vulnerability descriptions automatically.

Time: 50 minutes | Level: Practitioner | Docker Image: zchencow/innozverse-ai:latest


Background

Text classification evolved through three generations:

Gen 1 — Bag of Words / TF-IDF (1990s–2010s):
  "SQL injection attack" → [sql:0.4, injection:0.5, attack:0.3, ...] (sparse)
  Fast, interpretable, works well for many tasks

Gen 2 — Word2Vec / GloVe (2013–2018):
  "SQL" → [0.12, -0.34, 0.56, ...] (dense 300-dim, captures semantics)
  "injection" → [0.09, -0.41, 0.61, ...]  (similar to "SQL" in this space)

Gen 3 — BERT / Transformers (2018–now):
  "I saw the bank" → bank (financial) vs bank (river) context-aware
  Each token gets a different vector depending on surrounding context

Step 1: Environment Setup

📸 Verified Output:


Step 2: Build a Security Text Dataset

📸 Verified Output:


Step 3: TF-IDF Vectorisation

📸 Verified Output:

💡 Character n-grams are powerful for security text — they capture typos, obfuscated terms, and morphological variants. Unigrams+bigrams give the best word-level coverage.


Step 4: Manual Word Embeddings (Word2Vec-style)

📸 Verified Output:


Step 5: Simulating BERT-Style Contextual Embeddings

📸 Verified Output:

💡 Contextual embeddings create perfectly separable clusters in 768-dimensional space — hence 100% accuracy with a linear classifier. Real BERT on real security text achieves 95–99% on well-defined categories.


Step 6: Confusion Analysis

📸 Verified Output:

💡 The one XSS→buffer_overflow misclassification typically comes from text mentioning both "script execution" and "memory" — borderline documents that even human analysts might debate.


Step 7: Top Predictive Features

📸 Verified Output:


Step 8: Real-World Capstone — CVE Severity Classifier

📸 Verified Output:

💡 A model like this, deployed in a CVE intake pipeline, auto-assigns severity to 90%+ of incoming vulnerabilities — saving security teams hours of manual triage daily.


Summary

Method
Strengths
Best For

TF-IDF + Logistic Reg

Fast, interpretable, no GPU

Short texts, interpretability needed

Word embeddings + SVM

Captures semantics

Medium datasets

BERT embeddings

Best accuracy, context-aware

Production, GPU available

Char n-grams

Handles typos/obfuscation

Security evasion text

Key Takeaways:

  • TF-IDF with unigrams+bigrams is a strong baseline — try it first

  • Sublinear TF (sublinear_tf=True) improves performance for long documents

  • Top feature weights reveal what the model actually learns

  • Real BERT: use sentence-transformers library for fast embedding extraction

Further Reading

Last updated