Lab 08: NLP & TF-IDF

Objective

Build an NLP pipeline from scratch: text tokenisation and normalisation, Term Frequency-Inverse Document Frequency (TF-IDF) vectorisation, cosine similarity for document comparison, Naive Bayes text classifier, and a semantic search engine for Microsoft product descriptions.

Background

TF-IDF measures how important a word is to a document within a corpus. TF (term frequency) = how often a word appears in a document. IDF (inverse document frequency) = log(N/df) where df is how many documents contain the word — common words like "the" get low IDF. The product TF×IDF gives high scores to words that are frequent in a document but rare in the corpus — exactly the words that characterise that document.

Time

30 minutes

Prerequisites

  • Lab 01 (Linear Regression) — numpy fundamentals

Tools

  • Docker: zchencow/innozverse-python:latest


Lab Instructions

💡 IDF is what separates TF-IDF from simple word counts. The word "great" appears in many product reviews — high TF, but low IDF because it's in almost every document. IDF = log(6/6)+1 ≈ 1.0. The word "detachable" appears only in Surface Book reviews — IDF = log(6/1)+1 ≈ 2.8. When multiplied by TF, "detachable" becomes the dominant term for Surface Book documents. This is exactly why search engines rank documents with rare matching terms higher.

📸 Verified Output:


Summary

Concept
Formula
Purpose

TF

count(word) / total_words

Word frequency in doc

IDF

log(N/df) + 1

Penalise common words

TF-IDF

TF × IDF

Document word importance

Cosine sim

a·b / (‖a‖‖b‖)

Document similarity

Naive Bayes

`P(c

x) ∝ P(c)·ΠP(xᵢ

Last updated