Build, visualise, and tune decision trees and random forest ensembles. Understand why ensembles almost always outperform single models and how feature importance reveals what your data is telling you.
A decision tree splits data at each node by choosing the feature and threshold that maximally separates classes.
The split quality is measured by:
Gini impurity:G = 1 - Σ pᵢ² (default in sklearn)
Entropy / Information Gain:H = -Σ pᵢ log₂(pᵢ)
📸 Verified Output:
💡 A gini of 0.5 means the node is completely impure (random guessing). The tree splits to reduce this towards 0.
Step 3: Train a Decision Tree
📸 Verified Output:
💡 The deep tree memorised training data (100% train accuracy) but generalised poorly. Pruning (max_depth=5) reduces overfitting — lower train accuracy but better test accuracy.
Step 4: Visualise the Decision Path
📸 Verified Output:
Step 5: Random Forests — The Ensemble Idea
A single decision tree is high-variance (results change a lot with different training data). Random Forest fixes this by:
Training N trees on different bootstrap samples of the data
Each tree only sees a random subset of features at each split
Final prediction = majority vote (classification) or average (regression)
📸 Verified Output:
💡 Random Forest jumped from 0.795 (single tree) to 0.890 test accuracy. This is the power of ensembles.
Step 6: Feature Importance
Random Forest scores each feature by how much it reduces impurity across all trees:
📸 Verified Output:
💡 Features 17, 16, and 5 dominate. In a real project, this tells you which columns are driving predictions — crucial for interpretability and data collection priorities.
Step 7: Hyperparameter Tuning
📸 Verified Output:
Step 8: Real-World Capstone — Intrusion Detection System
📸 Verified Output:
💡 unique_dests (number of unique destination IPs) is the most powerful attack indicator — port scanners and lateral movement tools contact many hosts rapidly.
Summary
Model
Pros
Cons
Best For
Decision Tree
Interpretable, fast
Overfits easily
Explainability required
Random Forest
High accuracy, robust
Slower, less interpretable
General-purpose classification
Key Takeaways:
Unpruned trees always overfit — use max_depth and min_samples_leaf
Random Forest ≈ decision tree + bagging + feature randomness
Feature importances reveal which inputs drive predictions
class_weight='balanced' handles class imbalance automatically
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings; warnings.filterwarnings('ignore')
X, y = make_classification(n_samples=2000, n_features=20, n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Unconstrained tree (will overfit)
dt_deep = DecisionTreeClassifier(random_state=42)
dt_deep.fit(X_train, y_train)
# Pruned tree
dt_pruned = DecisionTreeClassifier(max_depth=5, min_samples_leaf=20, random_state=42)
dt_pruned.fit(X_train, y_train)
print(f"Deep tree — depth: {dt_deep.get_depth():>3} train acc: {accuracy_score(y_train, dt_deep.predict(X_train)):.3f} test acc: {accuracy_score(y_test, dt_deep.predict(X_test)):.3f}")
print(f"Pruned tree — depth: {dt_pruned.get_depth():>3} train acc: {accuracy_score(y_train, dt_pruned.predict(X_train)):.3f} test acc: {accuracy_score(y_test, dt_pruned.predict(X_test)):.3f}")
Deep tree — depth: 22 train acc: 1.000 test acc: 0.768
Pruned tree — depth: 5 train acc: 0.833 test acc: 0.795
from sklearn.tree import export_text
feature_names = [f'feature_{i}' for i in range(20)]
# Print top 3 levels of the tree
tree_text = export_text(dt_pruned, feature_names=feature_names, max_depth=3)
print(tree_text[:1500])
Individual trees (noisy but diverse):
Tree 1: [1, 0, 1, 1, 0]
Tree 2: [0, 0, 1, 1, 1]
Tree 3: [1, 0, 1, 0, 0]
Tree 4: [1, 0, 1, 1, 0]
Tree 5: [1, 0, 1, 1, 1]
───────────────
Vote: [1, 0, 1, 1, 0] ← more robust than any single tree
from sklearn.ensemble import RandomForestClassifier
import numpy as np
rf = RandomForestClassifier(
n_estimators=100, # number of trees
max_depth=10, # max depth per tree
max_features='sqrt', # features per split = √n_features (default)
random_state=42
)
rf.fit(X_train, y_train)
print(f"Random Forest — train acc: {accuracy_score(y_train, rf.predict(X_train)):.3f} test acc: {accuracy_score(y_test, rf.predict(X_test)):.3f}")
Random Forest — train acc: 0.998 test acc: 0.890
import numpy as np
feature_names = [f'feature_{i}' for i in range(20)]
importances = rf.feature_importances_
sorted_idx = np.argsort(importances)[::-1]
print("Feature Importances (top 10):")
print(f"{'Feature':<15} {'Importance':>12} {'Bar'}")
print("-" * 50)
for i in range(10):
idx = sorted_idx[i]
bar = '█' * int(importances[idx] * 200)
print(f"feature_{idx:<6} {importances[idx]:>12.4f} {bar}")