Lab 02: Model Serving at Scale
Overview
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Model Serving Architecture │
├──────────────┬──────────────────┬───────────────────────────┤
│ Online Serving│ Batch Serving │ Streaming Serving │
│ (< 100ms) │ (hours/minutes)│ (< 1 second) │
│ REST/gRPC │ Spark/Ray │ Kafka + Flink │
├──────────────┴──────────────────┴───────────────────────────┤
│ Load Balancer / API Gateway / Service Mesh │
├─────────────────────────────────────────────────────────────┤
│ Model Server Pool (auto-scaled, K8s, GPU-enabled) │
│ [Model v1.0 - 95%] [Model v1.1 - 5% canary] │
└─────────────────────────────────────────────────────────────┘Step 1: Serving Patterns
Pattern
Latency
Throughput
Use Case
Complexity
Step 2: REST vs gRPC Serving
Dimension
REST
gRPC
Step 3: Model Server Architectures
Server
Framework
GPU Support
Batching
Multi-model
Best For
Step 4: Latency vs Throughput Trade-offs
Strategy
Description
Best For
Step 5: Canary, Shadow, and A/B Deployments
Strategy
Traffic Split
Risk
Rollback
Use Case
Step 6: SLO Design for ML Systems
Step 7: Auto-scaling for Model Serving
Step 8: Capstone — Design High-Availability Model Serving
Component
Design Decision
Summary
Concept
Key Points
Last updated
