Implement K-Means from scratch: random centroid initialisation, K-Means++ smart initialisation, the assignment-update EM loop, inertia (within-cluster sum of squares), the Elbow method for choosing K, and cluster quality metrics (silhouette score) β applied to grouping Microsoft products into natural market segments.
Background
K-Means is an unsupervised algorithm β it finds structure in unlabelled data. It alternates between two steps: Assignment (assign each point to its nearest centroid) and Update (move each centroid to the mean of its assigned points). This continues until centroids stop moving. K-Means++ improves initialisation by choosing initial centroids proportional to their distance from existing centroids β dramatically reducing bad random starts.
Time
30 minutes
Prerequisites
Lab 01 (Linear Regression) β numpy fundamentals
Tools
Docker: zchencow/innozverse-python:latest
Lab Instructions
π‘ The Elbow method looks for the "knee" in the inertia-vs-K curve. Adding more clusters always reduces inertia (more centroids = tighter fit). The elbow is where adding one more cluster gives a dramatically smaller improvement β that's your natural K. Silhouette score is more rigorous: it measures how much closer each point is to its own cluster vs the next-nearest cluster. Values near 1 = clear separation; near 0 = ambiguous; negative = wrong cluster.