Lab 02: Decision Trees & Random Forests

Objective

Build, visualise, and tune decision trees and random forest ensembles. Understand why ensembles almost always outperform single models and how feature importance reveals what your data is telling you.

Time: 45 minutes | Level: Practitioner | Docker Image: zchencow/innozverse-ai:latest


Step 1: Environment Setup

docker run -it --rm zchencow/innozverse-ai:latest bash
python3 -c "from sklearn.tree import DecisionTreeClassifier; from sklearn.ensemble import RandomForestClassifier; print('OK')"

📸 Verified Output:

OK

Step 2: How Decision Trees Work

A decision tree splits data at each node by choosing the feature and threshold that maximally separates classes.

The split quality is measured by:

  • Gini impurity: G = 1 - Σ pᵢ² (default in sklearn)

  • Entropy / Information Gain: H = -Σ pᵢ log₂(pᵢ)

📸 Verified Output:

💡 A gini of 0.5 means the node is completely impure (random guessing). The tree splits to reduce this towards 0.


Step 3: Train a Decision Tree

📸 Verified Output:

💡 The deep tree memorised training data (100% train accuracy) but generalised poorly. Pruning (max_depth=5) reduces overfitting — lower train accuracy but better test accuracy.


Step 4: Visualise the Decision Path

📸 Verified Output:


Step 5: Random Forests — The Ensemble Idea

A single decision tree is high-variance (results change a lot with different training data). Random Forest fixes this by:

  1. Training N trees on different bootstrap samples of the data

  2. Each tree only sees a random subset of features at each split

  3. Final prediction = majority vote (classification) or average (regression)

📸 Verified Output:

💡 Random Forest jumped from 0.795 (single tree) to 0.890 test accuracy. This is the power of ensembles.


Step 6: Feature Importance

Random Forest scores each feature by how much it reduces impurity across all trees:

📸 Verified Output:

💡 Features 17, 16, and 5 dominate. In a real project, this tells you which columns are driving predictions — crucial for interpretability and data collection priorities.


Step 7: Hyperparameter Tuning

📸 Verified Output:


Step 8: Real-World Capstone — Intrusion Detection System

📸 Verified Output:

💡 unique_dests (number of unique destination IPs) is the most powerful attack indicator — port scanners and lateral movement tools contact many hosts rapidly.


Summary

Model
Pros
Cons
Best For

Decision Tree

Interpretable, fast

Overfits easily

Explainability required

Random Forest

High accuracy, robust

Slower, less interpretable

General-purpose classification

Key Takeaways:

  • Unpruned trees always overfit — use max_depth and min_samples_leaf

  • Random Forest ≈ decision tree + bagging + feature randomness

  • Feature importances reveal which inputs drive predictions

  • class_weight='balanced' handles class imbalance automatically

Further Reading

Last updated