Master gradient boosting — the algorithm that wins most tabular ML competitions. Learn how sequential weak learners become a powerful ensemble, then apply XGBoost to real-world classification.
Random Forest builds trees in parallel, each independent. Boosting builds trees sequentially — each new tree corrects the errors of the previous ones.
Initial prediction: ŷ₀ = mean(y)
Round 1: Train tree₁ on residuals (y - ŷ₀)
ŷ₁ = ŷ₀ + η * tree₁(x)
Round 2: Train tree₂ on residuals (y - ŷ₁)
ŷ₂ = ŷ₁ + η * tree₂(x)
...
Round N: ŷ_N = ŷ₀ + η * Σ treeₙ(x)
η = learning_rate (small = better generalisation, more trees needed)
💡 This is literally gradient descent in function space. Each tree is a gradient step that reduces the loss. That is why it is called gradient boosting.
Step 2: sklearn GradientBoostingClassifier
📸 Verified Output:
Step 3: XGBoost — The Competition Winner
XGBoost (eXtreme Gradient Boosting) adds several improvements over vanilla gradient boosting:
Feature
GradientBoosting
XGBoost
Regularisation
None
L1 + L2 on weights
Missing values
Manual handling
Built-in
Speed
Slow
10–100× faster
Parallel
No
Yes (within each tree)
Pruning
Pre-pruning only
Post-pruning (max_delta_step)
📸 Verified Output:
Step 4: Learning Curves — Finding the Right n_estimators
📸 Verified Output:
💡 Early stopping prevents overfitting AND saves compute. Always use it when you have a validation set.
Step 5: Cross-Validation
📸 Verified Output:
💡 Low standard deviation (±0.008) means the model is stable across different data splits — a good sign of generalisation.
Step 6: Feature Importance (XGBoost)
XGBoost has three importance types:
weight: how often a feature is used in splits
gain: average gain in loss when the feature is used (most informative)
cover: average number of samples covered by splits on this feature
💡 Payload entropy and scanning behaviour are the top two indicators — consistent with how ransomware encrypts data (high entropy) and botnets scan for new hosts.
Summary
Algorithm
When to Use
Key Params
GradientBoostingClassifier
Small datasets, need sklearn pipeline
n_estimators, learning_rate, max_depth
XGBClassifier
Large datasets, competitions, production
+ reg_alpha, colsample_bytree, early stopping
Key Takeaways:
Boosting = sequential trees, each correcting the last
Use max_depth=3–6 for boosting (shallow trees are better)
Always use early_stopping_rounds + eval set
XGBoost's gain importance is more meaningful than weight
docker run -it --rm zchencow/innozverse-ai:latest bash
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
import warnings; warnings.filterwarnings('ignore')
X, y = make_classification(n_samples=5000, n_features=30, n_informative=15, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
gb = GradientBoostingClassifier(
n_estimators=100, # number of boosting rounds
learning_rate=0.1, # shrinkage: smaller = slower but better
max_depth=3, # shallow trees work best for boosting
subsample=0.8, # stochastic boosting: use 80% of data per tree
random_state=42
)
gb.fit(X_tr, y_tr)
y_pred = gb.predict(X_te)
y_prob = gb.predict_proba(X_te)[:, 1]
print(f"GradientBoosting — Accuracy: {accuracy_score(y_te, y_pred):.4f} ROC-AUC: {roc_auc_score(y_te, y_prob):.4f}")
import numpy as np
importances = xgb_model.get_booster().get_score(importance_type='gain')
sorted_imp = sorted(importances.items(), key=lambda x: -x[1])[:10]
print("Top 10 features by gain:")
for feat, score in sorted_imp:
bar = '█' * int(score / max(v for _, v in sorted_imp) * 30)
print(f" {feat:<12} {score:>8.1f} {bar}")
Top 10 features by gain:
f14 1247.3 ██████████████████████████████
f10 891.2 █████████████████████
f4 774.6 ██████████████████
f19 663.1 ████████████████
f2 541.8 █████████████
...