Lab 10: NER & Information Extraction

Objective

Extract structured information from unstructured security text — CVE IDs, IP addresses, malware names, attack techniques, and MITRE ATT&CK tactics. Build a pipeline that turns raw threat intelligence into actionable structured data.

Time: 50 minutes | Level: Practitioner | Docker Image: zchencow/innozverse-ai:latest


Background

Security analysts read hundreds of threat reports, advisories, and blog posts every week. Named Entity Recognition (NER) automates extraction of:

Raw text:
  "APT28 exploited CVE-2023-23397 in Microsoft Outlook to steal NTLM hashes
   from victims at 192.168.1.45 using a technique mapped to T1187 in MITRE ATT&CK."

Extracted entities:
  THREAT_ACTOR: APT28
  CVE:          CVE-2023-23397
  PRODUCT:      Microsoft Outlook
  TECHNIQUE:    NTLM hash theft
  IP_ADDRESS:   192.168.1.45
  MITRE_TTK:    T1187

Step 1: Environment Setup

📸 Verified Output:


Step 2: Rule-Based NER — Regex Patterns

The fastest, most reliable approach for well-defined entity types:

📸 Verified Output:

💡 Rule-based NER is 100% recall for exact patterns like CVEs and IPs — no training data needed, no false patterns. Always start with rules; add ML only where rules fail.


Step 3: Threat Actor Recognition

Named threat actors don't follow patterns — this is where ML helps:

📸 Verified Output:


Step 4: IOC (Indicator of Compromise) Classifier

📸 Verified Output:

💡 IOC confidence scores matter — a .ru domain has lower confidence than a SHA256 hash. Feed these scores into SIEM rules to set alert thresholds appropriately.


Step 5: Relation Extraction

Finding relationships between entities:

📸 Verified Output:

💡 Extracted triples can be stored in a knowledge graph (Neo4j) or fed into a threat intelligence platform like MISP. Each triple is a structured fact that can be queried: "Which APTs exploited this CVE?"


Step 6: MITRE ATT&CK Technique Mapping

📸 Verified Output:


Step 7: Building an NER Pipeline

📸 Verified Output:


Step 8: Real-World Capstone — Threat Intelligence Enrichment Engine

📸 Verified Output:

💡 This engine, connected to a SIEM or ticketing system, auto-enriches every incoming alert with extracted entities, predicted severity, and MITRE mapping — saving Level 1 analysts 10–15 minutes per incident.


Summary

Technique
Best For
Pros

Regex patterns

CVEs, IPs, hashes, URLs

100% precision for exact formats

Lookup tables

Known threat actors, malware families

Fast, no training data

TF-IDF + LogReg

Category/technique classification

Lightweight, interpretable

Relation extraction

Actor→CVE→product triples

Builds knowledge graphs

Key Takeaways:

  • Always start with regex for structured IOCs (CVE, IP, hash)

  • ML classification works well for MITRE technique mapping

  • Relation extraction enables knowledge graph construction

  • Confidence scores are essential — don't act on low-confidence extractions

Further Reading

Last updated