Lab 03: Vector Database Architecture

Time: 50 minutes | Level: Architect | Docker: docker run -it --rm zchencow/innozverse-ai:latest bash

Overview

Vector databases power semantic search, RAG systems, and recommendation engines by storing and querying high-dimensional embeddings. This lab covers similarity metrics, index architectures, major vector DB comparisons, and dimensionality reduction techniques.

Architecture

┌──────────────────────────────────────────────────────────┐
│              Vector Database Architecture                 │
├──────────────────────────────────────────────────────────┤
│  Embedding Models (text/image/audio) → Dense Vectors     │
│  Dimension: 384 (MiniLM), 768 (BERT), 1536 (GPT-4)     │
├──────────────────────────────────────────────────────────┤
│  Index Layer:                                            │
│  ├── Flat Index (exact, small datasets <100K)            │
│  ├── IVF (inverted file, medium datasets)                │
│  └── HNSW (hierarchical, large datasets, best recall)    │
├──────────────────────────────────────────────────────────┤
│  Query: embedding → ANN search → Top-K results          │
└──────────────────────────────────────────────────────────┘

Step 1: Why Vector Databases?

Traditional databases can't efficiently answer: "Find the 10 most semantically similar documents to this query."

The RAG Architecture Need:

Vector DB vs Traditional DB:

Dimension
Relational DB
Vector DB

Data type

Structured rows

High-dim embeddings

Query type

Exact match, range

Approximate nearest neighbor

Scale

Billions of rows

Millions-billions of vectors

Latency

ms (indexed)

ms (ANN indexed)

Use case

OLTP, reports

Semantic search, RAG, recommenders


Step 2: Similarity Metrics

Choose your distance/similarity metric based on how your embeddings are trained.

Cosine Similarity (most common for NLP):

Euclidean Distance (L2 norm):

Dot Product (unnormalized cosine):

Metric
Best For
Notes

Cosine

Text, normalized embeddings

Unit sphere, magnitude-invariant

Euclidean

Image embeddings, coordinates

Sensitive to vector magnitude

Dot Product

Recommendation systems

Fast but requires normalized vectors

Inner Product

FAISS indexes, retrieval

Default for many embedding models

💡 OpenAI text-embedding models are optimized for cosine similarity. Always normalize vectors before computing cosine — use dot product on normalized vectors (identical, but faster).


Step 3: Index Types

The index determines how vectors are stored and searched.

Flat Index (Brute Force):

IVF (Inverted File Index):

HNSW (Hierarchical Navigable Small World):

Index Selection Guide:

Vectors
Requirement
Recommended Index

< 100K

Max recall

Flat

100K - 10M

Balanced

IVF (nlist=1024)

> 1M

Speed priority

HNSW (M=16)

> 1B

Memory-constrained

IVF + PQ (product quantization)


Step 4: Vector Database Comparison

Feature
pgvector
Pinecone
Weaviate
Chroma

Type

PostgreSQL extension

Managed cloud

Open-source/cloud

Open-source

Scale

<10M vectors

Billions

Millions-billions

Millions

Index types

IVF, HNSW

Proprietary

HNSW

HNSW

Hybrid search

With FTS

Limited

Metadata filtering

SQL

Self-hosted

GDPR/compliance

✅ (your infra)

Enterprise tier

Best for

Existing Postgres users

Managed/no-ops

Production RAG

Dev/prototyping

Architecture Decision:


Step 5: Dimensionality Reduction with PCA

High-dimensional embeddings are expensive. PCA reduces dimensions while preserving variance.

When to Use Dimensionality Reduction:

  • Cost optimization (smaller storage, faster queries)

  • Visualization (reduce to 2D/3D with t-SNE/UMAP)

  • Speed optimization (lower dimensions = faster ANN)

  • Memory constraints

PCA Trade-offs:

💡 Matryoshka Representation Learning (MRL) is better than post-hoc PCA for LLM embeddings. Models like OpenAI text-embedding-3 support variable dimensions natively.


Step 6: Approximate Nearest Neighbor Trade-offs

ANN algorithms trade recall for speed. Understand the parameters.

HNSW Parameters:

Recall vs Latency Curve:

IVF Parameters:


Step 7: Production Vector DB Design

Sharding Strategy for Large Collections:

Replication for High Availability:

Vector DB in Multi-tenant Architecture:


Step 8: Capstone — Build Vector Similarity Engine

📸 Verified Output:


Summary

Concept
Key Points

Vector DB Role

Store embeddings, enable semantic search and RAG

Similarity Metrics

Cosine (NLP), Euclidean (images), Dot product (recommenders)

Index Types

Flat (exact), IVF (medium scale), HNSW (large scale, best recall)

DB Comparison

pgvector (Postgres), Pinecone (managed), Weaviate (open-source prod), Chroma (dev)

PCA

Reduce dims, trade variance for speed/storage

ANN Trade-offs

Higher ef/nprobe = better recall, slower queries

Next Lab: Lab 04: LLM Infrastructure Design →

Last updated