Lab 04: K-Means Clustering

Objective

Implement K-Means from scratch: random centroid initialisation, K-Means++ smart initialisation, the assignment-update EM loop, inertia (within-cluster sum of squares), the Elbow method for choosing K, and cluster quality metrics (silhouette score) β€” applied to grouping Microsoft products into natural market segments.

Background

K-Means is an unsupervised algorithm β€” it finds structure in unlabelled data. It alternates between two steps: Assignment (assign each point to its nearest centroid) and Update (move each centroid to the mean of its assigned points). This continues until centroids stop moving. K-Means++ improves initialisation by choosing initial centroids proportional to their distance from existing centroids β€” dramatically reducing bad random starts.

Time

30 minutes

Prerequisites

  • Lab 01 (Linear Regression) β€” numpy fundamentals

Tools

  • Docker: zchencow/innozverse-python:latest


Lab Instructions

πŸ’‘ The Elbow method looks for the "knee" in the inertia-vs-K curve. Adding more clusters always reduces inertia (more centroids = tighter fit). The elbow is where adding one more cluster gives a dramatically smaller improvement β€” that's your natural K. Silhouette score is more rigorous: it measures how much closer each point is to its own cluster vs the next-nearest cluster. Values near 1 = clear separation; near 0 = ambiguous; negative = wrong cluster.

πŸ“Έ Verified Output:


Summary

Concept
Detail

K-Means loop

Assign β†’ Update β†’ repeat

K-Means++

Distance-weighted centroid init

Inertia

WCSS β€” lower is tighter

Elbow method

Find K where inertia drop slows

Silhouette

Quality metric [-1, 1], higher is better

Last updated