Lab 05: Decision Trees

Objective

Build a decision tree classifier from scratch: Gini impurity and information gain, greedy best-split search, recursive tree construction, max-depth pruning, and feature importance scores β€” trained to predict Surface product tier from specs.

Background

A decision tree learns a hierarchy of binary questions ("Is RAM > 16GB?") that partitions the training data into increasingly pure subsets. Gini impurity measures class mixing: G = 1 - Ξ£pα΅’Β² (0 = pure, 0.5 = maximally mixed for binary). At each node, the algorithm tries every feature and every threshold, picking the split that maximally reduces impurity (information gain). Trees are powerful, interpretable, and the foundation of Random Forests and XGBoost.

Time

35 minutes

Prerequisites

  • Lab 02 (Logistic Regression) β€” classification concepts

Tools

  • Docker: zchencow/innozverse-python:latest


Lab Instructions

πŸ’‘ Decision trees always overfit without constraints. With unlimited depth, a tree can memorise every training example (accuracy = 100%), but it will fail on new data. max_depth, min_samples_split, and min_samples_leaf are the main pruning knobs. Random Forests fix overfitting by averaging many trees trained on random subsets of data and features β€” each tree overfits differently, and the average cancels out.

πŸ“Έ Verified Output:


Summary

Concept
Formula/Rule
Notes

Gini impurity

1 - Ξ£pα΅’Β²

0=pure, 0.5=max mix

Information gain

G(parent) - weighted G(children)

Maximised at each split

Pruning

max_depth, min_samples

Prevents overfitting

Feature importance

Sum of gain Γ— samples

Normalised across features

Last updated