Lab 04: LangChain & Vector Databases — RAG at Scale

Objective

Build production RAG pipelines using LangChain patterns: document chunking strategies, embedding models, vector store operations, retrieval chains, re-ranking, and hybrid search — applied to a security knowledge base with 500+ CVE documents.

Time: 55 minutes | Level: Advanced | Docker Image: zchencow/innozverse-ai:latest


Background

Basic RAG (Practitioner lab 14):   embed → store → retrieve → generate
Production RAG (this lab):         chunk strategy → embed → hybrid index →
                                   multi-query retrieval → re-rank → 
                                   contextual compression → generate → evaluate

The difference between demo RAG and production RAG is mostly in retrieval quality. A well-tuned retriever adds 20–40% accuracy over naive top-k cosine search.


Step 1: Environment and Document Corpus

docker run -it --rm zchencow/innozverse-ai:latest bash
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import normalize
import warnings; warnings.filterwarnings('ignore')

np.random.seed(42)

# Security knowledge base (500+ CVE-style documents)
CVE_CORPUS = [
    {"id": "CVE-2024-0001", "title": "SQL Injection in Authentication Module",
     "text": "A SQL injection vulnerability exists in the authentication module allowing unauthenticated attackers to bypass login and extract password hashes. Affects versions < 2.3.1. CVSS: 9.8 CRITICAL. Fix: use parameterised queries, prepared statements.",
     "severity": "CRITICAL", "type": "injection"},
    {"id": "CVE-2024-0002", "title": "Cross-Site Scripting in User Profile",
     "text": "Reflected XSS vulnerability in user profile page. Attacker can inject malicious JavaScript via name parameter. Leads to session hijacking, credential theft. Affected: all versions. CVSS: 7.4 HIGH. Fix: output encoding, CSP headers.",
     "severity": "HIGH", "type": "xss"},
    {"id": "CVE-2024-0003", "title": "Remote Code Execution via File Upload",
     "text": "Unrestricted file upload allows execution of arbitrary server-side code. MIME type validation bypassed via double extension trick (shell.php.jpg). CVSS: 9.9 CRITICAL. Fix: allowlist extensions, rename uploads, serve from separate domain.",
     "severity": "CRITICAL", "type": "rce"},
    {"id": "CVE-2024-0004", "title": "JWT Algorithm Confusion Attack",
     "text": "JWT tokens accepted with 'none' algorithm and RS256-to-HS256 confusion. Attacker can forge arbitrary tokens. CVSS: 9.1 CRITICAL. Fix: explicitly validate algorithm in jwt.decode(), reject 'none'.",
     "severity": "CRITICAL", "type": "auth"},
    {"id": "CVE-2024-0005", "title": "SSRF via PDF Generation Service",
     "text": "Server-Side Request Forgery in PDF rendering endpoint. Attacker can access internal services, AWS metadata endpoint (169.254.169.254), scan internal network. CVSS: 8.6 HIGH. Fix: allowlist URLs, block RFC-1918 ranges.",
     "severity": "HIGH", "type": "ssrf"},
    {"id": "CVE-2024-0006", "title": "Path Traversal in File Download",
     "text": "Directory traversal vulnerability allows reading arbitrary files via ../../../etc/passwd patterns. CVSS: 7.5 HIGH. Fix: resolve realpath, validate stays within base directory.",
     "severity": "HIGH", "type": "traversal"},
    {"id": "CVE-2024-0007", "title": "Insecure Deserialization in Session Handler",
     "text": "Python pickle deserialization of untrusted user-supplied data allows RCE. Session cookies base64-decoded and unpickled without validation. CVSS: 9.8 CRITICAL. Fix: use JSON serialisation, never pickle user data.",
     "severity": "CRITICAL", "type": "deserialization"},
    {"id": "CVE-2024-0008", "title": "Broken Access Control in API",
     "text": "IDOR vulnerability: /api/users/{id}/data returns other users' data without authorisation check. Horizontal privilege escalation. CVSS: 8.1 HIGH. Fix: verify ownership on every request, use indirect references.",
     "severity": "HIGH", "type": "bac"},
    {"id": "CVE-2024-0009", "title": "Race Condition in Balance Transfer",
     "text": "TOCTOU race condition in payment transfer: balance checked and debited in separate transactions. Concurrent requests allow transferring more than available balance. CVSS: 7.8 HIGH. Fix: database-level atomic transactions, row locking.",
     "severity": "HIGH", "type": "race"},
    {"id": "CVE-2024-0010", "title": "NoSQL Injection in MongoDB Query",
     "text": "MongoDB operator injection via JSON body: {\"username\": {\"$ne\": null}} bypasses authentication. All users exposed. CVSS: 9.4 CRITICAL. Fix: schema validation, reject operator keys in user input.",
     "severity": "CRITICAL", "type": "injection"},
]

# Expand corpus with variations for realistic scale
expanded_corpus = []
for i, doc in enumerate(CVE_CORPUS * 50):  # 500 docs
    d = doc.copy()
    d['id'] = f"CVE-2024-{i+1:04d}"
    d['text'] = d['text'] + f" Reference: {d['id']}."
    expanded_corpus.append(d)

print(f"Corpus loaded: {len(expanded_corpus)} security documents")
print(f"Types: {set(d['type'] for d in expanded_corpus)}")

📸 Verified Output:


Step 2: Chunking Strategies

📸 Verified Output:


📸 Verified Output:


Step 4: Re-Ranking with Cross-Encoder

📸 Verified Output:

💡 Re-ranking reorders the candidates — the cross-encoder demotes irrelevant results that happened to match by coincidence. In production: use cross-encoder/ms-marco-MiniLM-L-6-v2 from HuggingFace.


Step 5: Contextual Compression

📸 Verified Output:


Step 6: RAG Evaluation — RAGAS Metrics

📸 Verified Output:


Step 7: LangChain-Style Chain Architecture

📸 Verified Output:


Step 8: Capstone — Production Security Knowledge Assistant

📸 Verified Output:


Summary

Component
Naive RAG
Production RAG

Retrieval

Cosine search

Hybrid (dense + sparse)

Ranking

None

Cross-encoder re-ranking

Context

Full chunks

Contextual compression

Chunking

Fixed-size

Recursive / semantic

Evaluation

None

RAGAS metrics

Memory

None

Session context

Caching

None

Query-level cache

Further Reading

Last updated