Lab 13: AI Data Pipeline Architecture
Overview
Architecture
┌──────────────────────────────────────────────────────────────┐
│ ML Data Pipeline Architecture │
├────────────────────────┬─────────────────────────────────────┤
│ LAMBDA ARCHITECTURE │ KAPPA ARCHITECTURE │
│ ───────────────── │ ────────────────────── │
│ Batch Layer (Spark) │ Streaming Layer only │
│ Speed Layer (Flink) │ (Kafka + Flink) │
│ Serving Layer │ Re-processing = replay topic │
│ Merge at query time │ Simpler, but streaming-first │
├────────────────────────┴─────────────────────────────────────┤
│ Feature Engineering → Data Versioning → Validation │
│ ↓ (DVC) (GE) │
│ Feature Store (Feast) → Training Data → Model │
└──────────────────────────────────────────────────────────────┘Step 1: Batch ETL vs Streaming Architectures
Step 2: Feature Engineering at Scale
Tool
Type
Scale
Best For
Step 3: Data Versioning with DVC
Step 4: Data Validation with Great Expectations
Concept
Definition
Step 5: Data Lineage Tracking
Step 6: Feature Store Design (Feast)
Step 7: Training-Serving Skew Prevention
Step 8: Capstone — ML Data Pipeline with Feature Versioning
Summary
Concept
Key Points
Last updated
