Lab 13: Data Preprocessing

Objective

Build a complete data preprocessing pipeline from scratch: handling missing values, outlier detection and treatment, feature scaling (Min-Max, Z-score, Robust), categorical encoding (one-hot, label, target encoding), feature selection via correlation and variance, and polynomial feature generation — applied to cleaning real-world Surface sales data.

Background

"Garbage in, garbage out." A perfectly tuned model fed dirty data will perform worse than a simple model fed clean data. Feature engineering — transforming raw inputs into informative representations — accounts for the majority of performance gains in competitive ML. This lab covers the same preprocessing steps that sklearn.preprocessing implements, but built from first principles so you understand why each transformation matters.

Time

35 minutes

Prerequisites

  • Lab 01 (Linear Regression) — numpy arrays

Tools

  • Docker: zchencow/innozverse-python:latest


Lab Instructions

💡 Scaling choice depends on the algorithm. Distance-based models (KNN, SVM, K-Means) are highly sensitive to scale — Price ranges 399–3499 while Rating ranges 3.9–4.8, so unscaled Price dominates all distance calculations. Tree-based models (Decision Trees, Random Forests) are scale-invariant — splits are based on rank, not magnitude. Linear models need scaling for regularisation to work correctly (L2 penalty treats all weights equally only if features have equal scale). When in doubt: Z-score for linear models, Min-Max for neural networks, no scaling for trees.

📸 Verified Output:


Summary

Issue
Technique
When to use

Missing values

Median imputation

Numeric with outliers

Outliers

IQR clip

Continuous features

Scale difference

Z-score / MinMax / Robust

Distance/linear models

Categorical

One-hot / target encoding

Nominal categories

Feature creation

Polynomial features

Linear models on non-linear data

Feature reduction

Correlation filter

Remove irrelevant features

Last updated