Lab 17: Vector Databases & pgvector

Time: 50 minutes | Level: Architect | DB: PostgreSQL + pgvector, Pinecone, Weaviate

Vector databases enable semantic search, recommendation systems, and AI-powered applications by storing and querying high-dimensional embeddings. This lab covers pgvector, similarity metrics, indexing strategies, and a comparative analysis of vector database options.


Step 1: Vector Databases — The Why

Traditional databases query exact values. Vector databases query by similarity in high-dimensional space.

Traditional SQL:
  SELECT * FROM products WHERE category = 'shoes'
  → Exact match, deterministic

Vector Search:
  SELECT * FROM products
  ORDER BY description_embedding <-> query_embedding
  LIMIT 10
  → Semantic similarity: finds "sneakers", "footwear", "trainers" too

Use cases:
  ✓ Semantic document search (find by meaning, not keywords)
  ✓ Product recommendations ("customers who liked this also liked...")
  ✓ Image similarity search
  ✓ RAG (Retrieval-Augmented Generation) for LLM applications
  ✓ Anomaly detection (distance from cluster center)
  ✓ Duplicate detection
  ✓ Facial recognition

What is an embedding?

💡 OpenAI's text-embedding-ada-002 produces 1536-dimensional vectors. Newer models like text-embedding-3-small use 1536 dims too but with better quality. For production, always benchmark embedding quality on your specific domain data.


Step 2: pgvector — PostgreSQL Extension Setup

pgvector adds vector storage and similarity search to PostgreSQL without requiring a separate database.


Step 3: Similarity Metrics — L2, Cosine, Inner Product

pgvector supports three distance operators, each suited for different use cases.

💡 For maximum performance with normalized embeddings, normalize your vectors before inserting and use inner product (<#>). It skips the normalization calculation and is ~15% faster than cosine distance.


Step 4: IVFFlat Index — Approximate Nearest Neighbor

Exact nearest neighbor search is O(n) — too slow for millions of vectors. IVFFlat trades a small accuracy loss for massive speedup.

IVFFlat tuning guide:

Dataset Size
lists
probes (for 95% recall)

< 1M rows

100

10

1M - 10M rows

1000

100

> 10M rows

sqrt(n)

lists/10


Step 5: HNSW Index — High Recall Nearest Neighbor

HNSW (Hierarchical Navigable Small World) is the state-of-the-art approximate nearest neighbor algorithm, available in pgvector 0.5+.

💡 For RAG applications with LLMs, use HNSW with cosine distance. The higher recall means better context retrieval, which directly improves LLM response quality.


Step 6: Real-World pgvector Patterns


Step 7: Vector Database Comparison

💡 For most production applications < 50M vectors, pgvector with HNSW is the best choice — you get full SQL power, ACID transactions, existing PG tooling, and no additional infrastructure.


Step 8: Capstone — pgvector Similarity Demo

Run verification:

📸 Verified Output:


Summary

Concept
pgvector Syntax
Notes

Create vector column

embedding vector(1536)

1536 = OpenAI ada-002 dims

L2 distance

embedding <-> '[1,2,3]'

Euclidean, for spatial data

Cosine distance

embedding <=> '[1,2,3]'

Best for NLP embeddings

Inner product

embedding <#> '[1,2,3]'

For normalized vectors

IVFFlat index

USING ivfflat (emb vector_cosine_ops)

Faster build, lower recall

HNSW index

USING hnsw (emb vector_cosine_ops)

Better recall, recommended

Query accuracy

SET hnsw.ef_search = 40

Higher = better recall, slower

Semantic search

ORDER BY embedding <=> query LIMIT k

K-nearest neighbors

Hybrid search

ts_rank + cosine_distance

Best of keyword + semantic

Last updated