Lab 07: PCA

Objective

Implement PCA from scratch using NumPy's eigendecomposition: covariance matrix, eigenvectors and eigenvalues, explained variance ratio, dimensionality reduction from N features to K components, and reconstruction error β€” applied to compressing product feature vectors and visualising high-dimensional data in 2D.

Background

PCA finds the directions (principal components) of maximum variance in your data by computing the eigenvectors of the covariance matrix. The first PC captures the most variance, the second captures the most remaining variance, and so on. Projecting data onto the top-K PCs reduces dimensionality while preserving as much information as possible. PCA is used in face recognition (eigenfaces), preprocessing before ML, noise reduction, and data visualisation.

Time

30 minutes

Prerequisites

  • Lab 01 (Linear Regression) β€” numpy matrix operations

Tools

  • Docker: zchencow/innozverse-python:latest


Lab Instructions

πŸ’‘ PCA assumes linear relationships. If your data has nonlinear structure (clusters arranged in a circle, spiral patterns), PCA will miss it β€” use kernel PCA or t-SNE/UMAP for visualisation. PCA also requires standardisation first: if features have different scales (Price: 400–3500, Weight: 0.5–4.6), the high-variance features dominate the covariance matrix. After z-score normalisation, all features contribute equally. Always standardise before PCA.

πŸ“Έ Verified Output:


Summary

Step
Operation
Code

Standardise

(X - ΞΌ) / Οƒ

(X_raw - mu) / std

Covariance

Xα΅€X / (n-1)

X.T @ X / (n-1)

Eigenvectors

Solve Cv = Ξ»v

np.linalg.eigh(C)

Project

Z = X @ W_k

Top-K eigenvectors

Reconstruct

XΜ‚ = Z @ W_kα΅€

Lossy compression

Last updated