Lab 01: Linear & Logistic Regression for Security

Objective

Implement and understand the two most fundamental ML algorithms — linear regression for predicting continuous values and logistic regression for binary classification — both from scratch using NumPy and via scikit-learn.

Time: 45 minutes | Level: Practitioner | Docker Image: zchencow/innozverse-ai:latest


Prerequisites

  • Python basics (functions, loops, arrays)

  • AI Foundations Labs 1–3 (ML taxonomy, how AI works)

  • Docker installed


Step 1: Environment Setup

docker run -it --rm zchencow/innozverse-ai:latest bash

Verify imports:

python3 -c "import numpy as np; import sklearn; print('numpy', np.__version__, '| sklearn', sklearn.__version__)"

📸 Verified Output:

numpy 2.0.0 | sklearn 1.5.1

💡 zchencow/innozverse-ai:latest includes numpy, pandas, scikit-learn, scipy, xgboost — everything needed for the Practitioner track.


Step 2: The Maths Behind Linear Regression

Linear regression fits a straight line through data: ŷ = w₀ + w₁x₁ + w₂x₂ + ... + wₙxₙ

The cost function (Mean Squared Error) measures how wrong our predictions are:

Gradient descent iteratively improves weights by moving in the direction that reduces MSE:

Implement from scratch:

📸 Verified Output:

💡 The model correctly recovers the true parameters (3 and 2). The small bias term (~0.05) is noise from the random seed.


Step 3: scikit-learn Linear Regression

📸 Verified Output:

💡 R² of 0.97 means the model explains 97% of the variance in the target — excellent for a noisy dataset.


Step 4: Logistic Regression — From Regression to Classification

Logistic regression adds a sigmoid function to squash outputs into [0, 1]:

The binary cross-entropy loss is used instead of MSE:

📸 Verified Output:


Step 5: scikit-learn Logistic Regression

📸 Verified Output:

💡 classification_report gives precision, recall, and F1 per class — always examine this, not just accuracy, especially for imbalanced datasets.


Step 6: Regularisation — Preventing Overfitting

Without regularisation, models memorise training data. Add a penalty term to the loss:

  • L1 (Lasso): Penalty = λΣ|wᵢ| → drives some weights to exactly zero (feature selection)

  • L2 (Ridge): Penalty = λΣwᵢ² → keeps all weights small (preferred for most cases)

📸 Verified Output:

💡 Lasso zeroed out 40 of 50 features (only 10 were truly informative). This is automatic feature selection — powerful for high-dimensional data.


Step 7: Hyperparameter Tuning with GridSearchCV

📸 Verified Output:


Step 8: Real-World Capstone — Credit Risk Classifier

Build a complete credit risk scoring model:

📸 Verified Output:

💡 Late payments and debt ratio are the strongest default predictors — matching real-world credit risk intuition. Negative coefficients (credit score, income) reduce default probability.


Summary

Algorithm
Use Case
Key Hyperparameter
Metric

Linear Regression

Continuous output

alpha (regularisation)

R², RMSE

Logistic Regression

Binary classification

C (inverse reg.)

Accuracy, ROC-AUC

Ridge

Regression + L2 penalty

alpha

RMSE

Lasso

Regression + feature selection

alpha

RMSE, non-zero coefs

Key Takeaways:

  • Gradient descent is the engine behind all these models

  • Always apply StandardScaler before logistic regression

  • Use classification_report not just accuracy

  • Regularisation prevents overfitting — always tune C or alpha

Further Reading

Last updated