Lab 02: Model Serving at Scale

Time: 50 minutes | Level: Architect | Docker: docker run -it --rm zchencow/innozverse-ai:latest bash

Overview

Model serving is the bridge between trained models and business value. This lab covers serving patterns, model server architectures, latency-throughput trade-offs, deployment strategies, and SLO design for ML systems.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                Model Serving Architecture                   │
├──────────────┬──────────────────┬───────────────────────────┤
│ Online Serving│  Batch Serving  │   Streaming Serving       │
│ (< 100ms)    │  (hours/minutes)│   (< 1 second)            │
│ REST/gRPC    │  Spark/Ray      │   Kafka + Flink            │
├──────────────┴──────────────────┴───────────────────────────┤
│         Load Balancer / API Gateway / Service Mesh          │
├─────────────────────────────────────────────────────────────┤
│  Model Server Pool (auto-scaled, K8s, GPU-enabled)          │
│  [Model v1.0 - 95%] [Model v1.1 - 5% canary]              │
└─────────────────────────────────────────────────────────────┘

Step 1: Serving Patterns

Choose the right serving pattern based on latency requirements and data volume.

Pattern
Latency
Throughput
Use Case
Complexity

Online

< 100ms

Low-medium

Fraud detection, recommendations

Medium

Batch

Hours-days

Very high

Overnight scoring, ETL enrichment

Low

Streaming

< 1 second

Medium-high

Real-time alerts, live scoring

High

Near-real-time

1-30 seconds

Medium

Personalization, content filtering

Medium

Decision Framework:

💡 Most teams over-engineer for online serving when batch is sufficient. Ask: "Does the user need the prediction in < 1 second?"


Step 2: REST vs gRPC Serving

Dimension
REST
gRPC

Protocol

HTTP/1.1

HTTP/2

Payload

JSON (verbose)

Protobuf (compact)

Latency

Higher (JSON serialization)

5-10x lower

Streaming

Limited

Native bi-directional

Client support

Universal

Requires gRPC client

Debugging

Easy (curl/browser)

Harder (need grpcurl)

Use case

External APIs, microservices

High-throughput internal services

When to choose gRPC:

  • High-throughput inference (> 10k RPS)

  • Low-latency requirements (< 10ms)

  • Embedding services (large vectors)

  • Service-to-service communication

REST API Design for ML:


Step 3: Model Server Architectures

Comparison of Production Model Servers:

Server
Framework
GPU Support
Batching
Multi-model
Best For

TorchServe

PyTorch

PyTorch models

TF Serving

TensorFlow

TensorFlow models

Triton

NVIDIA

✅ (CUDA)

✅ (dynamic)

Multi-framework, GPU

BentoML

Any

Python-first, fast dev

KServe

Any

Kubernetes-native

vLLM

LLMs

✅ (continuous)

LLM inference

Triton Architecture (NVIDIA):

💡 For LLM serving (Llama, Mistral), use vLLM with PagedAttention — it achieves 10-24x higher throughput vs naive serving through KV cache management.


Step 4: Latency vs Throughput Trade-offs

These two metrics are fundamentally in tension. Understand the trade-off curve.

Batching Effect:

Key Insight: GPU utilization improves with batch size, but P99 latency increases.

SLO-Aware Batching Strategy:

Batching Strategies:

Strategy
Description
Best For

Static batching

Fixed batch size, synchronous

Offline/batch serving

Dynamic batching

Waits max_wait_ms, fills batch

Mixed online serving

Continuous batching

Per-token for LLMs (vLLM)

LLM inference

Micro-batching

Very small batches, low latency

Streaming ML


Step 5: Canary, Shadow, and A/B Deployments

Deployment Strategies Comparison:

Strategy
Traffic Split
Risk
Rollback
Use Case

Canary

95%/5% → gradual

Low

< 1 min

General production rollouts

Blue-Green

0%/100% switch

Medium

< 30 sec

Zero-downtime deploys

Shadow

100% mirror

None

N/A

Safe new model testing

A/B Test

50%/50% (statistical)

Medium

< 1 min

Business metric experiments

Canary Deployment Process:

Shadow Mode (Safest for ML):

💡 Always run shadow mode for at least 48 hours before promoting an LLM or complex model to production. Collect real production distribution data.


Step 6: SLO Design for ML Systems

Service Level Objectives for ML have two dimensions: latency AND model quality.

Latency SLOs:

Model Quality SLOs:

Error Budget Concept:


Step 7: Auto-scaling for Model Serving

Kubernetes HPA for ML:

GPU-aware Scaling Considerations:

  • GPU pods take 30-60 seconds to start (model loading)

  • Pre-warm minimum replicas to avoid cold starts

  • Use GPU fractional allocation for small models (MIG on A100)

  • Consider GPU sharing for low-QPS models


Step 8: Capstone — Design High-Availability Model Serving

Scenario: Your fraud detection model must serve 10,000 RPS with P99 < 50ms, 99.99% availability.

Verification — Run FastAPI + SLO Simulation:

📸 Verified Output:

Architecture for 10k RPS Fraud Detection:

Component
Design Decision

Load balancer

AWS ALB / GCP HTTPS LB

Model server

Triton on GPU nodes (A10G)

Batch size

8 (balanced latency/throughput)

Replicas

10 minimum, autoscale to 50

Deployment

Canary, 5% → 25% → 100%

Monitoring

P99 latency alert at 80ms

Rollback

Automated if error_rate > 0.1%


Summary

Concept
Key Points

Serving Patterns

Online (<100ms), Batch (hours), Streaming (<1s)

REST vs gRPC

REST for external; gRPC for 5-10x lower latency internal

Model Servers

Triton (GPU/multi-framework), vLLM (LLMs), BentoML (Python-first)

Batching

Larger batches = higher GPU utilization but higher P99 latency

Deployment Strategies

Shadow (safest) → Canary (standard) → Blue-Green (instant)

SLO Design

Latency SLO + Model Quality SLO + Error Budget

Auto-scaling

GPU pods need 30-60s warm-up; pre-warm minimum replicas

Next Lab: Lab 03: Vector Database Architecture →

Last updated