Lab 16: Time Series Forecasting

Objective

Build ML models to forecast security-relevant time series: network traffic, login attempts, alert volumes, and attack patterns over time. Learn feature engineering for temporal data, lag features, seasonality decomposition, and evaluation specific to time series.

Time: 50 minutes | Level: Practitioner | Docker Image: zchencow/innozverse-ai:latest


Background

Time series data has structure that standard ML ignores:

  • Order matters: yesterday's traffic affects today's

  • Seasonality: attacks peak at certain hours/days

  • Trend: gradual growth or decline over weeks

  • Autocorrelation: today's value is correlated with yesterday's

Standard ML: treats each row as independent
Time Series ML: exploits temporal dependencies

Key features to create:
  lag_1  = value at t-1  (yesterday)
  lag_7  = value at t-7  (last week)
  rolling_mean_7 = avg of last 7 days
  hour_of_day, day_of_week → seasonality

Step 1: Environment Setup

📸 Verified Output:


Step 2: Generate Realistic Security Time Series

📸 Verified Output:


Step 3: Time Series Feature Engineering

📸 Verified Output:


Step 4: Time-Aware Train/Test Split

📸 Verified Output:


Step 5: Model Training and Evaluation

📸 Verified Output:

💡 Gradient Boosting dramatically outperforms the baseline. Lag features are the key — lag_24 (same hour yesterday) and lag_168 (same hour last week) capture the strong daily and weekly patterns.


Step 6: Anomaly Detection in Forecasts

📸 Verified Output:

💡 These anomaly timestamps are when login_attempts spiked far above the ML-predicted baseline — likely attack windows. At 3AM and 10PM, normal traffic is low (model predicts ~30), but actual was 892 and 756 — classic brute force or credential stuffing timing.


Step 7: Walk-Forward Validation

📸 Verified Output:

💡 MAE improves with each fold because the model has more training data. This is normal — time series models benefit greatly from more historical data.


Step 8: Real-World Capstone — Security Alert Volume Forecaster

📸 Verified Output:

💡 lag_24 (same hour yesterday) and lag_168 (same hour last week) are the strongest predictors — the system correctly learned daily and weekly seasonality. SOC teams can use the 24h forecast for staffing decisions and anomaly alerts for incident response.


Summary

Technique
Purpose
Key Consideration

Lag features

Capture autocorrelation

lag_24 for daily, lag_168 for weekly

Rolling statistics

Capture trend/smoothing

Shift by 1 to avoid leakage

Cyclical encoding

Capture hour/day periodicity

Use sin/cos, not raw integers

Temporal split

Correct evaluation

Never shuffle time series

Walk-forward CV

Robust evaluation

Mimics real deployment

Anomaly detection

Incident alerting

Residual z-score thresholding

Key Takeaways:

  • Never shuffle time series data for train/test split — data leakage ruins evaluation

  • Lag features (especially lag_24 and lag_168) usually dominate feature importance

  • Gradient Boosting consistently outperforms linear models for complex time patterns

  • Forecast + anomaly detection = proactive security operations

Further Reading

Last updated