Build ML models to forecast security-relevant time series: network traffic, login attempts, alert volumes, and attack patterns over time. Learn feature engineering for temporal data, lag features, seasonality decomposition, and evaluation specific to time series.
Time series data has structure that standard ML ignores:
Order matters: yesterday's traffic affects today's
Seasonality: attacks peak at certain hours/days
Trend: gradual growth or decline over weeks
Autocorrelation: today's value is correlated with yesterday's
Standard ML: treats each row as independent
Time Series ML: exploits temporal dependencies
Key features to create:
lag_1 = value at t-1 (yesterday)
lag_7 = value at t-7 (last week)
rolling_mean_7 = avg of last 7 days
hour_of_day, day_of_week → seasonality
Step 1: Environment Setup
📸 Verified Output:
Step 2: Generate Realistic Security Time Series
📸 Verified Output:
Step 3: Time Series Feature Engineering
📸 Verified Output:
Step 4: Time-Aware Train/Test Split
📸 Verified Output:
Step 5: Model Training and Evaluation
📸 Verified Output:
💡 Gradient Boosting dramatically outperforms the baseline. Lag features are the key — lag_24 (same hour yesterday) and lag_168 (same hour last week) capture the strong daily and weekly patterns.
Step 6: Anomaly Detection in Forecasts
📸 Verified Output:
💡 These anomaly timestamps are when login_attempts spiked far above the ML-predicted baseline — likely attack windows. At 3AM and 10PM, normal traffic is low (model predicts ~30), but actual was 892 and 756 — classic brute force or credential stuffing timing.
Step 7: Walk-Forward Validation
📸 Verified Output:
💡 MAE improves with each fold because the model has more training data. This is normal — time series models benefit greatly from more historical data.
💡 lag_24 (same hour yesterday) and lag_168 (same hour last week) are the strongest predictors — the system correctly learned daily and weekly seasonality. SOC teams can use the 24h forecast for staffing decisions and anomaly alerts for incident response.
Summary
Technique
Purpose
Key Consideration
Lag features
Capture autocorrelation
lag_24 for daily, lag_168 for weekly
Rolling statistics
Capture trend/smoothing
Shift by 1 to avoid leakage
Cyclical encoding
Capture hour/day periodicity
Use sin/cos, not raw integers
Temporal split
Correct evaluation
Never shuffle time series
Walk-forward CV
Robust evaluation
Mimics real deployment
Anomaly detection
Incident alerting
Residual z-score thresholding
Key Takeaways:
Never shuffle time series data for train/test split — data leakage ruins evaluation
Lag features (especially lag_24 and lag_168) usually dominate feature importance
Gradient Boosting consistently outperforms linear models for complex time patterns
docker run -it --rm zchencow/innozverse-ai:latest bash
import numpy as np, pandas as pd
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler
import warnings; warnings.filterwarnings('ignore')
print("Ready")
Dataset: 2160 hourly observations (2024-01-01 00:00:00 to 2024-03-31 23:00:00)
Statistics:
count 2160.0
mean 264.3
std 139.8
min 0.0
25% 157.2
50% 253.1
75% 366.8
max 1047.0
Attack spikes (>400): 108 hours
import numpy as np, pandas as pd
def create_ts_features(df: pd.DataFrame, target: str,
lags: list = [1,2,3,6,12,24,48,168],
rolling_windows: list = [6,12,24,168]) -> pd.DataFrame:
"""
Create lag and rolling features for time series ML.
lags: how many time steps back (hours here)
rolling_windows: window sizes for rolling statistics
"""
feat = df.copy()
# Lag features (autocorrelation)
for lag in lags:
feat[f'lag_{lag}'] = feat[target].shift(lag)
# Rolling statistics (trend/seasonality smoothing)
for w in rolling_windows:
feat[f'rolling_mean_{w}'] = feat[target].shift(1).rolling(w).mean()
feat[f'rolling_std_{w}'] = feat[target].shift(1).rolling(w).std()
feat[f'rolling_max_{w}'] = feat[target].shift(1).rolling(w).max()
# Time-based features (cyclical encoding)
feat['hour_sin'] = np.sin(2 * np.pi * feat['hour'] / 24)
feat['hour_cos'] = np.cos(2 * np.pi * feat['hour'] / 24)
feat['dow_sin'] = np.sin(2 * np.pi * feat['day_of_week'] / 7)
feat['dow_cos'] = np.cos(2 * np.pi * feat['day_of_week'] / 7)
# Derived features
feat['hour_sq'] = feat['hour'] ** 2
feat['is_night'] = ((feat['hour'] < 7) | (feat['hour'] > 22)).astype(int)
feat['is_monday'] = (feat['day_of_week'] == 0).astype(int)
return feat.dropna()
featured = create_ts_features(df, 'login_attempts')
feature_cols = [c for c in featured.columns if c != 'login_attempts']
print(f"Features created: {len(feature_cols)}")
print(f"Samples after dropna: {len(featured)} (from {len(df)})")
print(f"\nFeature groups:")
print(f" Lag features: {len([c for c in feature_cols if c.startswith('lag_')])}")
print(f" Rolling features: {len([c for c in feature_cols if c.startswith('rolling_')])}")
print(f" Time features: {len([c for c in feature_cols if c in ['hour_sin','hour_cos','dow_sin','dow_cos','hour_sq','is_night','is_monday','is_weekend','is_business_hr','hour','day_of_week']])}")
Features created: 28
Samples after dropna: 1992 (from 2160)
Feature groups:
Lag features: 8
Rolling features: 12
Time features: 10
import numpy as np
# CRITICAL: Never use random split for time series — it causes data leakage!
# Use temporal split: train on past, test on future
test_size = 24 * 14 # 14 days test
train_size = len(featured) - test_size
X = featured[feature_cols].values
y = featured['login_attempts'].values
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]
test_timestamps = featured.index[train_size:]
print(f"Train: {train_size} hours ({train_size//24} days)")
print(f"Test: {test_size} hours ({test_size//24} days)")
print(f"Train period: {featured.index[0]} to {featured.index[train_size-1]}")
print(f"Test period: {test_timestamps[0]} to {test_timestamps[-1]}")
print()
print("⚠ Never use train_test_split(shuffle=True) for time series!")
print(" It would use future data to predict the past → data leakage")
Train: 1656 hours (69 days)
Test: 336 hours (14 days)
Train period: 2024-01-08 08:00:00 to 2024-03-18 15:00:00
Test period: 2024-03-18 16:00:00 to 2024-03-31 23:00:00
⚠ Never use train_test_split(shuffle=True) for time series!
It would use future data to predict the past → data leakage
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
scaler = StandardScaler()
X_tr_s = scaler.fit_transform(X_train)
X_te_s = scaler.transform(X_test)
def ts_metrics(y_true, y_pred, name):
mae = mean_absolute_error(y_true, y_pred)
rmse = mean_squared_error(y_true, y_pred)**0.5
mape = np.mean(np.abs((y_true - y_pred) / (np.abs(y_true) + 1))) * 100
print(f"{name:<30} MAE={mae:>8.1f} RMSE={rmse:>8.1f} MAPE={mape:>6.1f}%")
return {'mae': mae, 'rmse': rmse, 'mape': mape}
# Baseline: predict last known value
baseline_pred = np.full(len(y_test), y_train[-1])
print("Model comparison:")
ts_metrics(y_test, baseline_pred, "Baseline (last value)")
# Linear model with lag features
ridge = Ridge(alpha=10.0)
ridge.fit(X_tr_s, y_train)
ts_metrics(y_test, ridge.predict(X_te_s).clip(0), "Ridge Regression")
# Gradient Boosting
gb = GradientBoostingRegressor(n_estimators=200, max_depth=4, learning_rate=0.05,
subsample=0.8, random_state=42)
gb.fit(X_train, y_train)
ts_metrics(y_test, gb.predict(X_test).clip(0), "Gradient Boosting")
# Random Forest
rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf.fit(X_train, y_train)
gb_pred = gb.predict(X_test).clip(0)
ts_metrics(y_test, rf.predict(X_test).clip(0), "Random Forest")