Lab 04: Feature Engineering for ML

Objective

Transform raw data into the features that actually matter. Feature engineering is the single most impactful thing you can do to improve model performance — more important than algorithm choice in most real-world tabular ML problems.

Time: 50 minutes | Level: Practitioner | Docker Image: zchencow/innozverse-ai:latest


Background

"Coming up with features is difficult, time-consuming, and requires expert knowledge. Applied machine learning is basically feature engineering." — Andrew Ng

Raw data rarely arrives in the shape a model can learn from. Feature engineering bridges that gap:

Raw data → Feature engineering → Model-ready features

Examples:
  "2024-03-15 14:32:00"  →  hour=14, weekday=4, is_weekend=0
  "192.168.1.100"        →  is_private_ip=1, subnet=192.168
  income=50000, debt=20000  →  debt_ratio=0.40

Step 1: Environment Setup

docker run -it --rm zchencow/innozverse-ai:latest bash

📸 Verified Output:


Step 2: Numerical Feature Transformations

📸 Verified Output:

💡 Log-transforming skewed features like bytes_transferred makes the distribution more normal — most linear and distance-based models perform better on normally distributed features.


Step 3: Categorical Feature Encoding

📸 Verified Output:

💡 Use label encoding for tree-based models (they handle ordinal relationships well). Use one-hot encoding for linear models. Use target encoding for high-cardinality categoricals (100+ unique values) to avoid memory explosion.


Step 4: Datetime Feature Extraction

📸 Verified Output:

💡 Cyclical encoding (sin/cos) is crucial for time features. Without it, a model would think hour=23 and hour=0 are far apart, when they are actually adjacent.


Step 5: Feature Selection — Filter Methods

📸 Verified Output:

💡 Using only 33% of features achieved the same accuracy. Fewer features = faster inference, less memory, and sometimes better generalisation (removing noise features).


Step 6: Feature Selection — Wrapper Methods (RFE)

📸 Verified Output:


Step 7: Handling Missing Values

📸 Verified Output:

💡 Never just drop rows with missing data — you lose information. Adding a _missing flag tells the model that the absence of data is itself meaningful (e.g., a sensor that only goes offline during attacks).


Step 8: Real-World Capstone — Network Intrusion Feature Pipeline

📸 Verified Output:

💡 Even missing_src_bytes was selected as informative — the model learned that missing source bytes correlates with attack traffic. Missingness indicators are often surprisingly powerful.


Summary

Technique
When to Use
Impact

Log transform

Right-skewed features

High for linear models

Cyclical encoding

Hour, day, angle

Medium — prevents boundary artefacts

One-hot encoding

Nominal categoricals + linear models

Required

Target encoding

High-cardinality categoricals

High, but risk of leakage

Missingness indicator

Any dataset with nulls

Often surprisingly informative

Feature selection (MI)

High-dimensional data

Reduces overfitting + compute

Key Takeaways:

  • Feature engineering > algorithm choice for tabular data

  • Always encode missingness as a feature, not just impute it

  • Use mutual information for non-linear feature selection

  • Log-transform skewed distributions before linear models

Further Reading

Last updated