Lab 04: LangChain & Vector Databases — RAG at Scale
Objective
Background
Basic RAG (Practitioner lab 14): embed → store → retrieve → generate
Production RAG (this lab): chunk strategy → embed → hybrid index →
multi-query retrieval → re-rank →
contextual compression → generate → evaluateStep 1: Environment and Document Corpus
docker run -it --rm zchencow/innozverse-ai:latest bashimport numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import normalize
import warnings; warnings.filterwarnings('ignore')
np.random.seed(42)
# Security knowledge base (500+ CVE-style documents)
CVE_CORPUS = [
{"id": "CVE-2024-0001", "title": "SQL Injection in Authentication Module",
"text": "A SQL injection vulnerability exists in the authentication module allowing unauthenticated attackers to bypass login and extract password hashes. Affects versions < 2.3.1. CVSS: 9.8 CRITICAL. Fix: use parameterised queries, prepared statements.",
"severity": "CRITICAL", "type": "injection"},
{"id": "CVE-2024-0002", "title": "Cross-Site Scripting in User Profile",
"text": "Reflected XSS vulnerability in user profile page. Attacker can inject malicious JavaScript via name parameter. Leads to session hijacking, credential theft. Affected: all versions. CVSS: 7.4 HIGH. Fix: output encoding, CSP headers.",
"severity": "HIGH", "type": "xss"},
{"id": "CVE-2024-0003", "title": "Remote Code Execution via File Upload",
"text": "Unrestricted file upload allows execution of arbitrary server-side code. MIME type validation bypassed via double extension trick (shell.php.jpg). CVSS: 9.9 CRITICAL. Fix: allowlist extensions, rename uploads, serve from separate domain.",
"severity": "CRITICAL", "type": "rce"},
{"id": "CVE-2024-0004", "title": "JWT Algorithm Confusion Attack",
"text": "JWT tokens accepted with 'none' algorithm and RS256-to-HS256 confusion. Attacker can forge arbitrary tokens. CVSS: 9.1 CRITICAL. Fix: explicitly validate algorithm in jwt.decode(), reject 'none'.",
"severity": "CRITICAL", "type": "auth"},
{"id": "CVE-2024-0005", "title": "SSRF via PDF Generation Service",
"text": "Server-Side Request Forgery in PDF rendering endpoint. Attacker can access internal services, AWS metadata endpoint (169.254.169.254), scan internal network. CVSS: 8.6 HIGH. Fix: allowlist URLs, block RFC-1918 ranges.",
"severity": "HIGH", "type": "ssrf"},
{"id": "CVE-2024-0006", "title": "Path Traversal in File Download",
"text": "Directory traversal vulnerability allows reading arbitrary files via ../../../etc/passwd patterns. CVSS: 7.5 HIGH. Fix: resolve realpath, validate stays within base directory.",
"severity": "HIGH", "type": "traversal"},
{"id": "CVE-2024-0007", "title": "Insecure Deserialization in Session Handler",
"text": "Python pickle deserialization of untrusted user-supplied data allows RCE. Session cookies base64-decoded and unpickled without validation. CVSS: 9.8 CRITICAL. Fix: use JSON serialisation, never pickle user data.",
"severity": "CRITICAL", "type": "deserialization"},
{"id": "CVE-2024-0008", "title": "Broken Access Control in API",
"text": "IDOR vulnerability: /api/users/{id}/data returns other users' data without authorisation check. Horizontal privilege escalation. CVSS: 8.1 HIGH. Fix: verify ownership on every request, use indirect references.",
"severity": "HIGH", "type": "bac"},
{"id": "CVE-2024-0009", "title": "Race Condition in Balance Transfer",
"text": "TOCTOU race condition in payment transfer: balance checked and debited in separate transactions. Concurrent requests allow transferring more than available balance. CVSS: 7.8 HIGH. Fix: database-level atomic transactions, row locking.",
"severity": "HIGH", "type": "race"},
{"id": "CVE-2024-0010", "title": "NoSQL Injection in MongoDB Query",
"text": "MongoDB operator injection via JSON body: {\"username\": {\"$ne\": null}} bypasses authentication. All users exposed. CVSS: 9.4 CRITICAL. Fix: schema validation, reject operator keys in user input.",
"severity": "CRITICAL", "type": "injection"},
]
# Expand corpus with variations for realistic scale
expanded_corpus = []
for i, doc in enumerate(CVE_CORPUS * 50): # 500 docs
d = doc.copy()
d['id'] = f"CVE-2024-{i+1:04d}"
d['text'] = d['text'] + f" Reference: {d['id']}."
expanded_corpus.append(d)
print(f"Corpus loaded: {len(expanded_corpus)} security documents")
print(f"Types: {set(d['type'] for d in expanded_corpus)}")Step 2: Chunking Strategies
Step 3: Vector Store with Hybrid Search
Step 4: Re-Ranking with Cross-Encoder
Step 5: Contextual Compression
Step 6: RAG Evaluation — RAGAS Metrics
Step 7: LangChain-Style Chain Architecture
Step 8: Capstone — Production Security Knowledge Assistant
Summary
Component
Naive RAG
Production RAG
Further Reading
PreviousLab 03: LLM API Integration — Streaming, Tool Use, Structured OutputNextLab 05: Adversarial ML & Model Robustness
Last updated
