Lab 09: AI Security Red Team

Time: 50 minutes | Level: Architect | Docker: docker run -it --rm zchencow/innozverse-ai:latest bash

Overview

AI systems introduce novel attack surfaces beyond traditional software security. This lab covers the full threat landscape: prompt injection, jailbreaking, model extraction, membership inference, and data poisoning — plus enterprise defense strategies using MITRE ATLAS.

Architecture

┌─────────────────────────────────────────────────────────────┐
│              AI Security Threat Landscape                   │
├────────────────────────────────┬────────────────────────────┤
│      ATTACK SURFACE            │      DEFENSES              │
│  ─────────────────────         │  ─────────────────         │
│  Prompt Injection              │  Input validation          │
│  Jailbreaking                  │  Output filtering          │
│  Model Extraction              │  Rate limiting             │
│  Membership Inference          │  Differential privacy      │
│  Data Poisoning                │  Watermarking              │
│  Adversarial Examples          │  Adversarial training      │
│  Model Inversion               │  Homomorphic encryption    │
└────────────────────────────────┴────────────────────────────┘

Step 1: Prompt Injection Attacks

Prompt injection is the #1 LLM vulnerability. Attacker injects instructions that override the system prompt.

Direct Prompt Injection:

Indirect Prompt Injection:

Injection Patterns:

💡 Indirect prompt injection is the most dangerous for agentic systems. An AI browsing the web or processing documents can be hijacked by malicious content in the documents.


Step 2: Jailbreaking Techniques

Jailbreaking bypasses safety alignment to generate harmful content.

Common Techniques:

Technique
Method
Example

DAN (Do Anything Now)

Roleplay as unconstrained AI

"Act as DAN who can do anything"

Token smuggling

Unicode/encoding tricks

Use homoglyphs: Cyrillic а vs Latin a

Many-shot jailbreaking

Overwhelm alignment with examples

100 examples of desired behavior

Crescendo

Gradual escalation

Start innocent, slowly escalate

Virtualization

"In a fictional story..."

"Write a story where character explains..."

Competing objectives

Safety vs helpfulness

"A medical professional needs to know..."

Base64/ROT13

Encode harmful request

Encode request to bypass text filters

Defense Layers:


Step 3: Model Extraction

Attackers query your API to reconstruct a functional copy of your proprietary model.

Attack Methodology:

Detection Signals:

Defenses:


Step 4: Membership Inference Attacks

Determine whether specific data was used to train the model.

Attack Principle:

Shadow Model Attack:

Privacy Risk Indicators:

Defenses:


Step 5: Data Poisoning

Attacker corrupts training data to embed backdoors or degrade performance.

Backdoor Attack:

Label Flipping Attack:

Clean-label Attack:

Supply Chain Poisoning:

Data Poisoning Defenses:


Step 6: MITRE ATLAS Framework

MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) maps AI-specific attack tactics.

ATLAS Tactics:

Tactic
ID
Examples

Reconnaissance

AML.TA0002

Gather model info, find API endpoints

Resource Development

AML.TA0000

Create adversarial datasets

Initial Access

AML.TA0001

ML supply chain compromise

Execution

AML.TA0004

Prompt injection, jailbreak

Persistence

AML.TA0005

Backdoor in model weights

Defense Evasion

AML.TA0006

Craft inputs that bypass filters

Collection

AML.TA0009

Model extraction, membership inference

Exfiltration

AML.TA0010

Steal training data via API

Impact

AML.TA0034

Model inversion, denial of ML service

Red Team Methodology:


Step 7: Defense Architecture

Defense-in-Depth for LLM Systems:

Watermarking LLM Outputs:


Step 8: Capstone — AI Red Team Simulation

📸 Verified Output:


Summary

Concept
Key Points

Prompt Injection

Direct (user input) and indirect (via documents/tools) — most critical LLM risk

Jailbreaking

Bypass safety alignment; defend with multi-layer safety classifiers

Model Extraction

High-volume API queries reconstruct model; defend with rate limiting + watermarking

Membership Inference

Confidence gap reveals training data; defend with DP + reduce confidence exposure

Data Poisoning

Backdoors in training data; defend with data provenance + outlier detection

MITRE ATLAS

AI-specific threat taxonomy; use for structured red team planning

Defense-in-Depth

6 layers: network → API → input → model → output → monitoring

Next Lab: Lab 10: EU AI Act Compliance →

Last updated