Lab 15: Capstone — Full ML Pipeline

Objective

Build a complete, production-grade ML pipeline from scratch: data ingestion, preprocessing, feature engineering, model training, hyperparameter tuning via grid search, ensemble methods (bagging + voting), final evaluation, and model serialisation — predicting Surface Pro sales volume from product specs and market conditions.

Background

This capstone integrates all 14 preceding labs into a single coherent pipeline. Real ML projects are 80% data work and 20% modelling. The pipeline pattern — fit on training data, transform both train and test with the same parameters — is the core abstraction behind sklearn.Pipeline. Ensemble methods (Random Forest = bagging of decision trees; voting classifiers = diverse model combination) consistently outperform any single model by reducing variance (bagging) or bias (boosting).

Time

45 minutes

Prerequisites

  • All Labs 01–14

Tools

  • Docker: zchencow/innozverse-python:latest


Lab Instructions

💡 Ensembles win because errors are uncorrelated across diverse models. KNN errors depend on neighbourhood density; linear model errors depend on linearity violations; random forest errors depend on bootstrap sampling luck. When you average three models that each make different mistakes, the errors partially cancel. The ensemble is almost always better than any individual model — this is why XGBoost (gradient boosted ensembles) consistently tops Kaggle leaderboards. The one case where ensembles don't help: when all models make the same systematic error (shared bias from data quality issues).

📸 Verified Output:


Capstone Summary — Pipeline Checklist

Stage
Component
Key Concept

1

Data ingestion

Synthetic + noise injection

2

Preprocessing

Impute → Scale → Poly features

3

Models

Linear, KNN, Random Forest

4

CV selection

K-fold RMSE comparison

5

Test eval

Held-out RMSE + R²

6

Ensemble

Weighted average, error correlation

7

Inference

New product prediction + confidence

8

Serialisation

JSON round-trip, MD5 checksum

🎉 Congratulations — you've completed all 15 Python AI labs! You now have the mathematical foundations to understand and extend any modern ML framework.

Last updated