Lab 04: LLM Infrastructure Design
Overview
Architecture
┌────────────────────────────────────────────────────────────────┐
│ LLM Serving Infrastructure │
├────────────────────────────────────────────────────────────────┤
│ Client → API Gateway → Load Balancer │
│ ↓ │
│ vLLM / TGI / llama.cpp serving engine │
│ ├── Continuous Batching (PagedAttention) │
│ ├── KV Cache (GPU HBM → CPU → Disk tiering) │
│ └── Multi-GPU: Tensor Parallel / Pipeline Parallel │
├────────────────────────────────────────────────────────────────┤
│ Hardware: H100 80GB / A100 80GB / A10G 24GB │
│ Quantization: FP32 → FP16 → INT8 → INT4 │
└────────────────────────────────────────────────────────────────┘Step 1: Inference Hardware Selection
GPU
Memory
Bandwidth
FP16 TFLOPS
Price/hr (cloud)
Best For
Step 2: Quantization Strategies
Precision
Bits
LLaMA-70B Size
Speed Gain
Quality Loss
Step 3: Serving Frameworks (vLLM, llama.cpp)
Framework
Hardware
Throughput
Latency
Use Case
Step 4: Batching (Dynamic and Continuous)
Step 5: KV Cache Architecture
Step 6: Model Sharding (Tensor and Pipeline Parallelism)
Step 7: Cost Per Token Optimization
Model
Precision
GPU
GPUs
$/hr
Tokens/sec
$/1M tokens
Step 8: Capstone — LLM Infrastructure Cost Calculator
Summary
Concept
Key Points
Last updated
