Lab 12: Embeddings & Semantic Search

Objective

Build a semantic search engine for security knowledge using vector embeddings. Understand the difference between keyword search and semantic search, implement vector similarity, and build a retrieval system that finds relevant content even when the exact words don't match.

Time: 50 minutes | Level: Practitioner | Docker Image: zchencow/innozverse-ai:latest


Background

Keyword search:
  Query: "SQL injection prevention"
  Finds: documents containing "SQL injection prevention"
  Misses: "parameterised queries", "prepared statements", "input sanitisation"

Semantic search:
  Query: "SQL injection prevention"
  Finds: all of the above β€” because they mean the same thing
  Uses: vector embeddings to represent meaning, not words

Every piece of text can be encoded as a dense vector where similar meanings are close in vector space.


Step 1: Environment Setup

πŸ“Έ Verified Output:


Step 2: TF-IDF Embeddings β€” The Baseline

πŸ“Έ Verified Output:

πŸ’‘ TF-IDF embeddings are very sparse (most entries are zero). Dense embeddings (Word2Vec, BERT) pack more meaning into fewer dimensions and capture synonymy.


πŸ“Έ Verified Output:


Step 4: Dense Embeddings β€” Latent Semantic Analysis

πŸ“Έ Verified Output:

πŸ’‘ LSA finds all three SQL injection docs with high scores (0.87, 0.71, 0.53) because it captures the shared topic space β€” TF-IDF only found 2 (the query didn't contain "SQL injection" exactly). This is semantic understanding.


Step 5: Document Similarity Matrix

πŸ“Έ Verified Output:

πŸ’‘ The model correctly groups related documents: sqli-* docs cluster together, xss-* cluster together, etc. This is unsupervised semantic grouping β€” no labels needed.


Step 6: Embedding Clustering β€” Visualising the Vector Space

πŸ“Έ Verified Output:


Step 7: Real-Time Embedding Update (Incremental)

πŸ“Έ Verified Output:


Step 8: Real-World Capstone β€” Security Knowledge Search Engine

πŸ“Έ Verified Output:

πŸ’‘ Every query found the right document despite using completely different words: "virus that demands payment" β†’ ransomware, "JavaScript attacks" β†’ XSS, "hidden server requests" would β†’ SSRF. This is semantic understanding.


Summary

Method
Dimensionality
Semantic Understanding
Use Case

TF-IDF (sparse)

High (vocab size)

None β€” exact match only

Fast baseline

LSA (dense)

20–300 dims

Topic-level synonymy

Small-medium corpora

Word2Vec avg

50–300 dims

Word-level semantics

When BERT unavailable

Sentence-BERT

384–768 dims

Full semantic + context

Production systems

Key Takeaways:

  • Semantic search finds relevant documents even without exact keyword overlap

  • LSA (TF-IDF + SVD) provides dense embeddings without any training

  • Cosine similarity is the standard metric for comparing embeddings

  • In production: use sentence-transformers for state-of-the-art embeddings

Further Reading

Last updated