Lab 06: K-Nearest Neighbours

Objective

Implement KNN for both classification and regression: Euclidean, Manhattan, and Minkowski distances, weighted voting by distance, the curse of dimensionality, K selection via cross-validation, and a product recommendation engine using KNN similarity search.

Background

KNN is a lazy learner β€” it memorises the training set and makes predictions by looking up the K most similar training examples at query time. There is no explicit training phase. For classification, the K neighbours vote (majority wins). For regression, their labels are averaged (optionally weighted by 1/distance). KNN is powerful but slow at inference: finding the K nearest points in N-dimensional data costs O(NΒ·D) per query.

Time

25 minutes

Prerequisites

  • Lab 04 (K-Means) β€” distance concepts

Tools

  • Docker: zchencow/innozverse-python:latest


Lab Instructions

πŸ’‘ KNN scales poorly with data size. Finding K nearest neighbours requires computing distance to every training point β€” O(NΒ·D) per query. With 1M training points and D=100 features, each prediction computes 100M multiplications. Production systems use Approximate Nearest Neighbour (ANN) libraries like FAISS (Meta), HNSW, or Annoy (Spotify) that trade a tiny accuracy loss for 100-1000Γ— speed gains using spatial indexing. This is how Spotify's recommendations, vector databases, and semantic search work.

πŸ“Έ Verified Output:


Summary

KNN aspect
Detail

Training

Store all data (lazy learning)

Prediction

Find K nearest, vote/average

Weighted KNN

Weight by 1/distance

K selection

Cross-validation β€” pick best K

Scalability

O(NΒ·D) per query β€” use ANN in production

Last updated