Lab 05: RAG at Scale

Time: 50 minutes | Level: Architect | Docker: docker run -it --rm zchencow/innozverse-ai:latest bash

Overview

Retrieval-Augmented Generation (RAG) extends LLMs with enterprise knowledge bases. This lab covers the complete RAG architecture for production scale: document ingestion, chunking strategies, hybrid search, re-ranking, and evaluation frameworks.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     RAG at Scale Architecture                   │
├──────────────────────────────┬──────────────────────────────────┤
│     INGESTION PIPELINE       │         QUERY PIPELINE           │
│  Documents → Parse           │  Query → Embedding               │
│  → Chunk (strategy)          │  → BM25 sparse search            │
│  → Embed (model)             │  → Vector dense search           │
│  → Index (vector DB)         │  → RRF fusion                    │
│  → BM25 index                │  → Re-ranker (cross-encoder)     │
│                              │  → Context compression           │
│                              │  → LLM → Answer                  │
└──────────────────────────────┴──────────────────────────────────┘

Step 1: Document Ingestion Pipeline

Processing Flow:

Document Parsing Challenges:

  • PDF: tables, columns, headers — use PyMuPDF or Unstructured.io

  • HTML: navigation/boilerplate noise — use trafilatura

  • DOCX: style-based structure — python-docx

  • Code: language-aware splitting — tree-sitter

Metadata Strategy (Critical for Production):

💡 Metadata is your query-time filter. Without source/date metadata, you can't answer: "What does our Q1 2024 policy say about X?" — you'll get mixed results from all dates.


Step 2: Chunking Strategies

Chunking is the most impactful RAG parameter. Bad chunking = bad retrieval.

Fixed-Size Chunking:

Recursive Character Chunking (LangChain default):

Semantic Chunking:

Hierarchical Chunking (Best for Complex Docs):

Strategy
Chunk Size
Overlap
Best For

Fixed

512 tokens

50

Quick start, homogeneous content

Recursive

256-1024

20

General documents

Semantic

Variable

0

High-quality retrieval

Hierarchical

Multi-level

0

Long documents, complex queries


Step 3: Embedding Models

Embedding Model Comparison:

Model
Dimension
Context
MTEB Score
Cost
Best For

text-embedding-3-small

1536

8191

62.3

$0.02/1M

Cost-efficient

text-embedding-3-large

3072

8191

64.6

$0.13/1M

Highest quality

all-MiniLM-L6-v2

384

256

56.3

Free (local)

Speed-optimized

BGE-M3

1024

8192

65.0

Free (local)

Best open-source

E5-large

1024

512

63.2

Free (local)

Multilingual

Cohere embed-v3

1024

512

64.5

$0.10/1M

Production open-source alt

💡 BGE-M3 is the best free embedding model as of 2024. It supports sparse+dense (hybrid) from a single model. For production self-hosting, BGE-M3 is the top choice.


Step 4: Hybrid Search (BM25 + Vector)

Pure vector search misses exact keyword matches. Pure BM25 misses semantic meaning. Hybrid search combines both.

BM25 (Sparse Retrieval):

Dense Vector Search:

Reciprocal Rank Fusion (RRF):

Hybrid Search Architecture:


Step 5: Re-ranking

Retrieve many, re-rank to top-few. Re-rankers are cross-encoders that consider query+document jointly.

Bi-encoder vs Cross-encoder:

Re-ranking Pipeline:

Re-ranking Models:

  • cross-encoder/ms-marco-MiniLM-L-6-v2 (fast, free)

  • Cohere Rerank (API, excellent quality)

  • BGE-reranker-large (open-source, competitive quality)

💡 Re-ranking alone can improve RAG answer quality by 20-30% without changing the LLM or embeddings.


Step 6: Context Compression

LLM context windows are expensive. Compress retrieved context before sending to LLM.

Compression Techniques:

Technique
Method
Compression Ratio
Quality Impact

LLM extraction

Ask LLM to extract relevant sentences

3-5x

Low

Embeddings filter

Remove low-similarity sentences

2-3x

Low

Summarization

Summarize each chunk

3-8x

Medium

Query-focused

Keep only query-relevant parts

2-4x

Low

Contextual Compression Pipeline:


Step 7: RAG Evaluation Metrics

RAGAS Framework (Key Metrics):

Metric
Measures
Formula

Faithfulness

Is answer grounded in context?

statements in context / total statements

Answer Relevancy

Does answer address the question?

semantic sim(question, answer)

Context Precision

Are retrieved chunks relevant?

relevant chunks / total retrieved

Context Recall

Were all relevant chunks retrieved?

retrieved relevant / all relevant

Automated Evaluation Loop:


Step 8: Capstone — Hybrid RAG Search System

📸 Verified Output:


Summary

Concept
Key Points

Ingestion Pipeline

Parse → chunk → embed → index (vector + BM25)

Chunking

Fixed (simple) → Recursive (balanced) → Semantic (quality)

Embeddings

BGE-M3 (best open-source), text-embedding-3 (OpenAI)

Hybrid Search

BM25 (keywords) + Vector (semantic) + RRF (fusion)

Re-ranking

Cross-encoder: retrieve 100, re-rank to top-5 = +20-30% quality

Context Compression

Filter/compress before LLM = lower cost, better focus

Evaluation

RAGAS: faithfulness + answer relevancy + context precision/recall

Next Lab: Lab 06: AI Observability & Monitoring →

Last updated