Lab 17: Vector Databases & pgvector

Time: 50 minutes | Level: Architect | DB: PostgreSQL + pgvector, Pinecone, Weaviate

Vector databases enable semantic search, recommendation systems, and AI-powered applications by storing and querying high-dimensional embeddings. This lab covers pgvector, similarity metrics, indexing strategies, and a comparative analysis of vector database options.

Step 1: Vector Databases — The Why

Traditional databases query exact values. Vector databases query by similarity in high-dimensional space.

Traditional SQL:
  SELECT * FROM products WHERE category = 'shoes'
  → Exact match, deterministic

Vector Search:
  SELECT * FROM products
  ORDER BY description_embedding <-> query_embedding
  LIMIT 10
  → Semantic similarity: finds "sneakers", "footwear", "trainers" too

Use cases:
  ✓ Semantic document search (find by meaning, not keywords)
  ✓ Product recommendations ("customers who liked this also liked...")
  ✓ Image similarity search
  ✓ RAG (Retrieval-Augmented Generation) for LLM applications
  ✓ Anomaly detection (distance from cluster center)
  ✓ Duplicate detection
  ✓ Facial recognition

What is an embedding?

# An embedding is a vector (array of floats) representing meaning
# Generated by ML models like OpenAI text-embedding-ada-002 (1536 dims)

"The quick brown fox" → [0.023, -0.891, 0.412, ..., 0.034]  # 1536 numbers
"A fast orange dog"  → [0.025, -0.887, 0.409, ..., 0.031]  # similar vector!

# The vectors are CLOSE in space even though the words are different
# This is the core insight of semantic search

💡 OpenAI's text-embedding-ada-002 produces 1536-dimensional vectors. Newer models like text-embedding-3-small use 1536 dims too but with better quality. For production, always benchmark embedding quality on your specific domain data.

Step 2: pgvector — PostgreSQL Extension Setup

pgvector adds vector storage and similarity search to PostgreSQL without requiring a separate database.

-- Install pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Verify installation
SELECT * FROM pg_extension WHERE extname = 'vector';

-- Create table with vector column
-- vector(1536) = OpenAI ada-002 embedding dimensions
CREATE TABLE documents (
    id          BIGSERIAL PRIMARY KEY,
    content     TEXT NOT NULL,
    metadata    JSONB DEFAULT '{}',
    embedding   vector(1536),           -- 1536-dimensional embedding
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

-- Smaller example for lab demos
CREATE TABLE items (
    id        SERIAL PRIMARY KEY,
    name      TEXT,
    embedding vector(3)                 -- 3D for visualization
);

-- Insert sample vectors
INSERT INTO items (name, embedding) VALUES
    ('document_A', '[1.0, 2.0, 3.0]'),
    ('document_B', '[4.0, 5.0, 6.0]'),
    ('document_C', '[1.1, 2.1, 3.1]'),
    ('document_D', '[7.0, 8.0, 9.0]'),
    ('document_E', '[0.5, 1.5, 2.5]');

-- Check table size with vectors
SELECT pg_size_pretty(pg_total_relation_size('documents')) AS total_size;
-- Note: 1536-dim vector = 1536 × 4 bytes = 6KB per row
-- 1 million documents × 6KB = ~6GB just for vectors!

Step 3: Similarity Metrics — L2, Cosine, Inner Product

pgvector supports three distance operators, each suited for different use cases.

-- Query vector (what we're searching for)
-- '[1.0, 2.0, 4.0]' is our search vector

-- ── L2 Distance (Euclidean) ──────────────────────────────────────────────────
-- Operator: <->
-- Best for: spatial data, when vector magnitude matters
-- Formula: sqrt(sum((a_i - b_i)^2))

SELECT id, name, embedding,
    embedding <-> '[1.0, 2.0, 4.0]' AS l2_distance
FROM items
ORDER BY embedding <-> '[1.0, 2.0, 4.0]'
LIMIT 3;

-- ── Cosine Distance ──────────────────────────────────────────────────────────
-- Operator: <=>
-- Best for: NLP/text embeddings where direction matters more than magnitude
-- Formula: 1 - (a·b)/(|a|×|b|)
-- Distance 0 = identical direction, Distance 2 = opposite direction

SELECT id, name, embedding,
    embedding <=> '[1.0, 2.0, 4.0]' AS cosine_distance,
    1 - (embedding <=> '[1.0, 2.0, 4.0]') AS cosine_similarity
FROM items
ORDER BY embedding <=> '[1.0, 2.0, 4.0]'
LIMIT 3;

-- ── Negative Inner Product ────────────────────────────────────────────────────
-- Operator: <#>
-- Best for: normalized vectors (unit vectors), recommendation systems
-- When vectors are normalized, inner product = cosine similarity
-- Returns negative value (ORDER BY ASC = highest similarity first)

SELECT id, name, embedding,
    (embedding <#> '[1.0, 2.0, 4.0]') * -1 AS inner_product_similarity
FROM items
ORDER BY embedding <#> '[1.0, 2.0, 4.0]'
LIMIT 3;

-- ── Which metric to use? ──────────────────────────────────────────────────────
-- OpenAI embeddings → cosine distance (<=>)
-- Normalized embeddings → inner product (<#>) [fastest]
-- Geographic/spatial → L2 distance (<->)
-- Images (pixel space) → L2 distance (<->)

💡 For maximum performance with normalized embeddings, normalize your vectors before inserting and use inner product (<#>). It skips the normalization calculation and is ~15% faster than cosine distance.

Step 4: IVFFlat Index — Approximate Nearest Neighbor

Exact nearest neighbor search is O(n) — too slow for millions of vectors. IVFFlat trades a small accuracy loss for massive speedup.

-- IVFFlat index: Inverted File with Flat quantization
-- Divides vectors into 'lists' (Voronoi cells)
-- Search only scans 'probes' number of cells

-- Create IVFFlat index for cosine similarity
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);
-- lists = sqrt(total_rows) is a good starting point
-- For 1M rows: lists = 1000

-- Create IVFFlat index for L2 distance
CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops)
    WITH (lists = 100);

-- Create IVFFlat index for inner product
CREATE INDEX ON items USING ivfflat (embedding vector_ip_ops)
    WITH (lists = 100);

-- Control accuracy vs speed tradeoff at query time
-- Higher probes = better recall, slower query
SET ivfflat.probes = 10;    -- search 10 of 100 cells (10% of data)
SET ivfflat.probes = 50;    -- search 50 cells (50% of data, near-exact)

-- Check index usage
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, name, embedding <-> '[1.0, 2.0, 4.0]' AS dist
FROM items
ORDER BY embedding <-> '[1.0, 2.0, 4.0]'
LIMIT 5;
-- Look for "Index Scan using items_embedding_idx" in the plan

IVFFlat tuning guide:

Dataset Size

lists

probes (for 95% recall)

< 1M rows

100

1M - 10M rows

1000

100

> 10M rows

sqrt(n)

lists/10

Step 5: HNSW Index — High Recall Nearest Neighbor

HNSW (Hierarchical Navigable Small World) is the state-of-the-art approximate nearest neighbor algorithm, available in pgvector 0.5+.

-- HNSW index: builds a multi-layer graph structure
-- Much better recall than IVFFlat, higher memory usage
-- No need to specify probes at query time

-- Create HNSW index for cosine similarity
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);
-- m = number of connections per node (16 = good default)
-- ef_construction = size of search list during build (64 = good default)
-- Higher values = better recall, slower build, more memory

-- HNSW is the recommended index for production (pgvector 0.5+)
-- Query time: O(log n) vs IVFFlat's O(n/lists)

-- HNSW query tuning
SET hnsw.ef_search = 40;   -- search list size (default: 40)
                            -- higher = better recall, slower

-- IVFFlat vs HNSW comparison:
-- IVFFlat: faster to build, less memory, good for batch indexing
-- HNSW:    slower to build, more memory, better recall, better for streaming inserts

-- Benchmark query performance
EXPLAIN (ANALYZE, TIMING)
SELECT id, name, embedding <=> '[1.0, 2.0, 4.0]' AS cos_dist
FROM documents
ORDER BY embedding <=> '[1.0, 2.0, 4.0]'
LIMIT 10;

💡 For RAG applications with LLMs, use HNSW with cosine distance. The higher recall means better context retrieval, which directly improves LLM response quality.

Step 6: Real-World pgvector Patterns

-- ── RAG Document Store Pattern ────────────────────────────────────────────────

CREATE TABLE knowledge_base (
    id          BIGSERIAL PRIMARY KEY,
    source      TEXT,                    -- URL, file path, etc.
    chunk_index INTEGER,                 -- position within source
    content     TEXT NOT NULL,
    embedding   vector(1536),
    token_count INTEGER,
    metadata    JSONB DEFAULT '{}',
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX kb_embedding_hnsw ON knowledge_base
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- Semantic search function
CREATE OR REPLACE FUNCTION semantic_search(
    query_embedding vector(1536),
    similarity_threshold FLOAT DEFAULT 0.7,
    max_results INTEGER DEFAULT 10
)
RETURNS TABLE(
    id BIGINT,
    content TEXT,
    similarity FLOAT,
    metadata JSONB
) AS $$
BEGIN
    RETURN QUERY
    SELECT 
        kb.id,
        kb.content,
        1 - (kb.embedding <=> query_embedding) AS similarity,
        kb.metadata
    FROM knowledge_base kb
    WHERE 1 - (kb.embedding <=> query_embedding) > similarity_threshold
    ORDER BY kb.embedding <=> query_embedding
    LIMIT max_results;
END;
$$ LANGUAGE plpgsql;

-- Hybrid search: combine semantic + keyword
CREATE OR REPLACE FUNCTION hybrid_search(
    query_text TEXT,
    query_embedding vector(1536),
    max_results INTEGER DEFAULT 10
)
RETURNS TABLE(id BIGINT, content TEXT, score FLOAT) AS $$
BEGIN
    RETURN QUERY
    WITH semantic AS (
        SELECT kb.id, kb.content,
            (1 - (kb.embedding <=> query_embedding)) AS sem_score
        FROM knowledge_base kb
        ORDER BY kb.embedding <=> query_embedding
        LIMIT max_results * 2
    ),
    keyword AS (
        SELECT kb.id, kb.content,
            ts_rank(to_tsvector('english', kb.content),
                    plainto_tsquery('english', query_text)) AS kw_score
        FROM knowledge_base kb
        WHERE to_tsvector('english', kb.content) @@ plainto_tsquery('english', query_text)
        LIMIT max_results * 2
    )
    SELECT COALESCE(s.id, k.id),
           COALESCE(s.content, k.content),
           -- Reciprocal Rank Fusion score
           COALESCE(s.sem_score, 0) * 0.7 + COALESCE(k.kw_score, 0) * 0.3 AS score
    FROM semantic s
    FULL OUTER JOIN keyword k ON s.id = k.id
    ORDER BY score DESC
    LIMIT max_results;
END;
$$ LANGUAGE plpgsql;

Step 7: Vector Database Comparison

┌─────────────────┬──────────────┬──────────────┬──────────────┬──────────────┐
│ Feature         │ pgvector     │ Pinecone     │ Weaviate     │ Chroma       │
├─────────────────┼──────────────┼──────────────┼──────────────┼──────────────┤
│ Type            │ PG Extension │ Managed SaaS │ Open Source  │ Open Source  │
│ Hosting         │ Self/RDS/etc │ Fully managed│ Self/Cloud   │ Self/Cloud   │
│ Max dims        │ 16,000       │ 20,000       │ Unlimited    │ Unlimited    │
│ Index types     │ IVFFlat,HNSW │ Proprietary  │ HNSW         │ HNSW         │
│ Filtering       │ SQL WHERE    │ Metadata     │ GraphQL      │ dict filter  │
│ Hybrid search   │ Yes (SQL)    │ Yes          │ Yes (BM25)   │ Limited      │
│ ACID            │ Yes (PG)     │ No           │ No           │ No           │
│ SQL queries     │ Yes          │ No           │ No (GraphQL) │ No           │
│ Joins           │ Yes (SQL)    │ No           │ No           │ No           │
│ Cost (1M vecs)  │ ~$50/mo      │ ~$70-140/mo  │ ~$50-100/mo  │ Free (self)  │
│ Latency         │ 5-50ms       │ 1-10ms       │ 5-20ms       │ 10-100ms     │
│ Scale ceiling   │ ~100M rows   │ Billions     │ Billions     │ Millions     │
│ Best for        │ Existing PG  │ Largest scale│ GraphQL apps │ Dev/Prototype│
└─────────────────┴──────────────┴──────────────┴──────────────┴──────────────┘

Decision guide:
  Already on PostgreSQL → pgvector (no new infra, SQL joins, ACID)
  Need billion-scale, no ops burden → Pinecone
  Need GraphQL + vector → Weaviate
  Prototyping / Python apps → Chroma (local, no server needed)

💡 For most production applications < 50M vectors, pgvector with HNSW is the best choice — you get full SQL power, ACID transactions, existing PG tooling, and no additional infrastructure.

Step 8: Capstone — pgvector Similarity Demo

import math

# ── Cosine and L2 similarity (pgvector simulation) ───────────────────────────

def dot_product(a, b):
    return sum(x * y for x, y in zip(a, b))

def magnitude(v):
    return math.sqrt(sum(x**2 for x in v))

def cosine_similarity(a, b):
    return dot_product(a, b) / (magnitude(a) * magnitude(b))

def cosine_distance(a, b):
    return 1 - cosine_similarity(a, b)

def l2_distance(a, b):
    return math.sqrt(sum((x - y)**2 for x, y in zip(a, b)))

# Simulated items table (matching pgvector INSERT above)
items = [
    (1, [1.0, 2.0, 3.0], 'document_A'),
    (2, [4.0, 5.0, 6.0], 'document_B'),
    (3, [1.1, 2.1, 3.1], 'document_C'),
    (4, [7.0, 8.0, 9.0], 'document_D'),
    (5, [0.5, 1.5, 2.5], 'document_E'),
]
query = [1.0, 2.0, 4.0]

print('=== pgvector SIMULATION ===')
print('Table: items(id, embedding vector(3), name)')
print(f'Query vector: {query}')

print('\n-- L2 Distance (ORDER BY embedding <-> query) --')
l2_results = sorted(items, key=lambda x: l2_distance(x[1], query))
print(f'{"id":<5} {"name":<12} {"embedding":<20} {"l2_distance"}')
for id_, emb, name in l2_results:
    dist = l2_distance(emb, query)
    print(f'{id_:<5} {name:<12} {str(emb):<20} {dist:.4f}')

print('\n-- Cosine Distance (ORDER BY embedding <=> query) --')
cos_results = sorted(items, key=lambda x: cosine_distance(x[1], query))
print(f'{"id":<5} {"name":<12} {"cos_dist":<12} {"similarity"}')
for id_, emb, name in cos_results:
    cdist = cosine_distance(emb, query)
    csim  = cosine_similarity(emb, query)
    print(f'{id_:<5} {name:<12} {cdist:<12.4f} {csim:.4f}')

print('\n=== INDEX COMPARISON ===')
print('IVFFlat: ~10ms query, O(n/lists) search, good recall at scale')
print('HNSW:    ~2ms query,  O(log n) search, higher recall, more memory')
print('\nNearest neighbor to [1,2,4]: id=3 (document_C) by L2, id=5 (document_E) by cosine')

Run verification:

docker run --rm python:3.11-slim python3 -c "
import math
def l2(a,b): return math.sqrt(sum((x-y)**2 for x,y in zip(a,b)))
def cos(a,b):
    dp=sum(x*y for x,y in zip(a,b))
    return 1-dp/(math.sqrt(sum(x**2 for x in a))*math.sqrt(sum(x**2 for x in b)))
items=[(1,[1,2,3],'A'),(2,[4,5,6],'B'),(3,[1.1,2.1,3.1],'C'),(4,[7,8,9],'D'),(5,[.5,1.5,2.5],'E')]
q=[1,2,4]
print('L2 nearest:',sorted(items,key=lambda x:l2(x[1],q))[0][2],'dist=',round(l2(sorted(items,key=lambda x:l2(x[1],q))[0][1],q),4))
print('Cosine nearest:',sorted(items,key=lambda x:cos(x[1],q))[0][2],'dist=',round(cos(sorted(items,key=lambda x:cos(x[1],q))[0][1],q),4))
print('pgvector simulation: PASSED')
"

📸 Verified Output:

=== pgvector SIMULATION ===
Table: items(id, embedding vector(3), name)
Query vector: [1.0, 2.0, 4.0]

-- L2 Distance (ORDER BY embedding <-> query) --
id    name         embedding            l2_distance
3     document_C   [1.1, 2.1, 3.1]      0.9110
1     document_A   [1.0, 2.0, 3.0]      1.0000
5     document_E   [0.5, 1.5, 2.5]      1.6583
2     document_B   [4.0, 5.0, 6.0]      4.6904
4     document_D   [7.0, 8.0, 9.0]      9.8489

-- Cosine Distance (ORDER BY embedding <=> query) --
id    name         cos_dist     similarity
5     document_E   0.0041       0.9959
1     document_A   0.0085       0.9915
3     document_C   0.0103       0.9897
2     document_B   0.0550       0.9450
4     document_D   0.0756       0.9244

=== INDEX COMPARISON ===
IVFFlat: ~10ms query, O(n/lists) search, good recall at scale
HNSW:    ~2ms query,  O(log n) search, higher recall, more memory

Nearest neighbor to [1,2,4]: id=3 (document_C) by L2, id=5 (document_E) by cosine

Summary

Concept

pgvector Syntax

Notes

Create vector column

embedding vector(1536)

1536 = OpenAI ada-002 dims

L2 distance

embedding <-> '[1,2,3]'

Euclidean, for spatial data

Cosine distance

embedding <=> '[1,2,3]'

Best for NLP embeddings

Inner product

embedding <#> '[1,2,3]'

For normalized vectors

IVFFlat index

USING ivfflat (emb vector_cosine_ops)

Faster build, lower recall

HNSW index

USING hnsw (emb vector_cosine_ops)

Better recall, recommended

Query accuracy

SET hnsw.ef_search = 40

Higher = better recall, slower

Semantic search

ORDER BY embedding <=> query LIMIT k

K-nearest neighbors

Hybrid search

ts_rank + cosine_distance

Best of keyword + semantic

PreviousLab 16: Database Cost Optimization NextLab 18: Database Security Architecture

Last updated 26 days ago

Good morning

hashtagStep 1: Vector Databases — The Why

hashtagStep 2: pgvector — PostgreSQL Extension Setup

hashtagStep 3: Similarity Metrics — L2, Cosine, Inner Product

hashtagStep 4: IVFFlat Index — Approximate Nearest Neighbor

hashtagStep 5: HNSW Index — High Recall Nearest Neighbor

hashtagStep 6: Real-World pgvector Patterns

hashtagStep 7: Vector Database Comparison

hashtagStep 8: Capstone — pgvector Similarity Demo

hashtagSummary