Lab 03: Gradient Boosting & XGBoost

Objective

Master gradient boosting — the algorithm that wins most tabular ML competitions. Learn how sequential weak learners become a powerful ensemble, then apply XGBoost to real-world classification.

Time: 45 minutes | Level: Practitioner | Docker Image: zchencow/innozverse-ai:latest


Step 1: The Boosting Idea

Random Forest builds trees in parallel, each independent. Boosting builds trees sequentially — each new tree corrects the errors of the previous ones.

Initial prediction: ŷ₀ = mean(y)

Round 1: Train tree₁ on residuals (y - ŷ₀)
         ŷ₁ = ŷ₀ + η * tree₁(x)

Round 2: Train tree₂ on residuals (y - ŷ₁)
         ŷ₂ = ŷ₁ + η * tree₂(x)

...

Round N: ŷ_N = ŷ₀ + η * Σ treeₙ(x)

η = learning_rate (small = better generalisation, more trees needed)

💡 This is literally gradient descent in function space. Each tree is a gradient step that reduces the loss. That is why it is called gradient boosting.


Step 2: sklearn GradientBoostingClassifier

📸 Verified Output:


Step 3: XGBoost — The Competition Winner

XGBoost (eXtreme Gradient Boosting) adds several improvements over vanilla gradient boosting:

Feature
GradientBoosting
XGBoost

Regularisation

None

L1 + L2 on weights

Missing values

Manual handling

Built-in

Speed

Slow

10–100× faster

Parallel

No

Yes (within each tree)

Pruning

Pre-pruning only

Post-pruning (max_delta_step)

📸 Verified Output:


Step 4: Learning Curves — Finding the Right n_estimators

📸 Verified Output:

💡 Early stopping prevents overfitting AND saves compute. Always use it when you have a validation set.


Step 5: Cross-Validation

📸 Verified Output:

💡 Low standard deviation (±0.008) means the model is stable across different data splits — a good sign of generalisation.


Step 6: Feature Importance (XGBoost)

XGBoost has three importance types:

  • weight: how often a feature is used in splits

  • gain: average gain in loss when the feature is used (most informative)

  • cover: average number of samples covered by splits on this feature

📸 Verified Output:


Step 7: Hyperparameter Tuning

Key XGBoost hyperparameters and their effect:

📸 Verified Output:


Step 8: Real-World Capstone — Malware Traffic Classifier

📸 Verified Output:

💡 Payload entropy and scanning behaviour are the top two indicators — consistent with how ransomware encrypts data (high entropy) and botnets scan for new hosts.


Summary

Algorithm
When to Use
Key Params

GradientBoostingClassifier

Small datasets, need sklearn pipeline

n_estimators, learning_rate, max_depth

XGBClassifier

Large datasets, competitions, production

+ reg_alpha, colsample_bytree, early stopping

Key Takeaways:

  • Boosting = sequential trees, each correcting the last

  • Use max_depth=3–6 for boosting (shallow trees are better)

  • Always use early_stopping_rounds + eval set

  • XGBoost's gain importance is more meaningful than weight

Further Reading

Last updated