Lab 13: AI Data Pipeline Architecture

Time: 50 minutes | Level: Architect | Docker: docker run -it --rm zchencow/innozverse-ai:latest bash

Overview

ML data pipelines are the foundation of every AI system. This lab covers batch vs streaming architectures (Lambda/Kappa), feature engineering at scale, data versioning with DVC, validation with Great Expectations, lineage tracking, and feature store design patterns.

Architecture

┌──────────────────────────────────────────────────────────────┐
│               ML Data Pipeline Architecture                  │
├────────────────────────┬─────────────────────────────────────┤
│    LAMBDA ARCHITECTURE │    KAPPA ARCHITECTURE               │
│  ─────────────────     │  ──────────────────────             │
│  Batch Layer (Spark)   │  Streaming Layer only               │
│  Speed Layer (Flink)   │  (Kafka + Flink)                    │
│  Serving Layer         │  Re-processing = replay topic       │
│  Merge at query time   │  Simpler, but streaming-first       │
├────────────────────────┴─────────────────────────────────────┤
│  Feature Engineering → Data Versioning → Validation          │
│        ↓                    (DVC)           (GE)             │
│  Feature Store (Feast) → Training Data → Model               │
└──────────────────────────────────────────────────────────────┘

Step 1: Batch ETL vs Streaming Architectures

Batch ETL (Traditional):

Streaming (Real-time):

Lambda Architecture:

Kappa Architecture:

💡 Most enterprises start with Lambda (batch + streaming) and migrate to Kappa as streaming maturity grows. Don't over-engineer — pure batch solves 80% of problems.


Step 2: Feature Engineering at Scale

Scale Challenges:

Feature Engineering Tools:

Tool
Type
Scale
Best For

Pandas

Single-node

< 1M rows

Prototyping, exploration

Spark (PySpark)

Distributed

Billions of rows

Production batch features

Dask

Distributed Python

Medium

Python-native, parallel

Flink

Streaming

Real-time

Event-time streaming features

DBT

SQL transform

Millions-billions

SQL-native teams

Feature Types:

Window Aggregations (Most Common ML Features):


Step 3: Data Versioning with DVC

DVC (Data Version Control) treats data like code — version, share, and reproduce.

DVC Workflow:

DVC Pipelines:

Benefits:


Step 4: Data Validation with Great Expectations

Data quality is the #1 silent killer of ML model performance.

Great Expectations (GE) Concepts:

Concept
Definition

Expectation Suite

Collection of data quality rules

Expectation

Individual rule: "column_A must not be null"

Data Docs

Auto-generated data quality reports

Checkpoint

Run expectations against new data

Validation Result

Pass/fail result with statistics

Common Expectations:

GE in ML Pipeline:


Step 5: Data Lineage Tracking

Lineage Graph Example:

What Lineage Enables:

Lineage Tools:

  • Apache Atlas: Enterprise, integrates with Hadoop/Spark

  • OpenLineage: Open standard, vendor-neutral

  • Marquez: OpenLineage reference implementation

  • DataHub (LinkedIn): Comprehensive data catalog + lineage

  • MLflow: ML-specific lineage (model → data → code)


Step 6: Feature Store Design (Feast)

Feast Architecture:

Feast Feature View:


Step 7: Training-Serving Skew Prevention

The #1 ML Production Bug:

Prevention Strategies:


Step 8: Capstone — ML Data Pipeline with Feature Versioning

📸 Verified Output:


Summary

Concept
Key Points

Lambda Architecture

Batch + Speed layers merged at query time; complete but complex

Kappa Architecture

Streaming only; simpler but streaming-first teams only

Feature Engineering at Scale

Spark (batch), Flink (streaming), Feast (feature store)

DVC

Version data like code: git-trackable pointers to large files

Great Expectations

Automated data validation; block bad data from reaching training

Lineage Tracking

Know which data → which model → which decision; audit + debugging

Training-Serving Skew

Single feature store = same features at training and serving time

Next Lab: Lab 14: Responsible AI Audit →

Last updated