Lab 04: LLM Infrastructure Design

Time: 50 minutes | Level: Architect | Docker: docker run -it --rm zchencow/innozverse-ai:latest bash

Overview

Deploying Large Language Models at scale requires specialized infrastructure decisions. This lab covers inference hardware selection, quantization strategies, serving frameworks, KV cache architecture, model parallelism, and cost optimization.

Architecture

┌────────────────────────────────────────────────────────────────┐
│                   LLM Serving Infrastructure                   │
├────────────────────────────────────────────────────────────────┤
│  Client → API Gateway → Load Balancer                         │
│                              ↓                                 │
│  vLLM / TGI / llama.cpp serving engine                        │
│  ├── Continuous Batching (PagedAttention)                      │
│  ├── KV Cache (GPU HBM → CPU → Disk tiering)                  │
│  └── Multi-GPU: Tensor Parallel / Pipeline Parallel           │
├────────────────────────────────────────────────────────────────┤
│  Hardware: H100 80GB / A100 80GB / A10G 24GB                  │
│  Quantization: FP32 → FP16 → INT8 → INT4                     │
└────────────────────────────────────────────────────────────────┘

Step 1: Inference Hardware Selection

GPU Comparison for LLM Inference:

GPU
Memory
Bandwidth
FP16 TFLOPS
Price/hr (cloud)
Best For

H100 SXM5

80GB HBM3

3.35 TB/s

989 TFLOPS

$5-8

Largest models, max throughput

A100 SXM4

80GB HBM2e

2.0 TB/s

312 TFLOPS

$3-4

Standard production LLMs

A10G

24GB GDDR6

600 GB/s

125 TFLOPS

$1.0-1.5

7B-13B models, cost-optimized

L40S

48GB GDDR6

864 GB/s

362 TFLOPS

$2-3

Mid-size models, inference

CPU (96-core)

768GB RAM

300 GB/s

N/A

$3-5

Low-QPS, latency-tolerant

Hardware Selection Decision Tree:

💡 Memory bandwidth is more important than TFLOPS for LLM inference. LLMs are memory-bandwidth-bound, not compute-bound at small batch sizes.


Step 2: Quantization Strategies

Quantization reduces model precision to lower memory and increase speed.

Precision Levels:

Precision
Bits
LLaMA-70B Size
Speed Gain
Quality Loss

FP32

32

280 GB

1x (baseline)

None

FP16/BF16

16

140 GB

2x

Negligible

INT8

8

70 GB

3-4x

~1% accuracy

INT4

4

35 GB

5-8x

1-3% accuracy

GPTQ

4

35 GB

5-8x

Best INT4 quality

AWQ

4

35 GB

6-10x

Slightly better than GPTQ

Quantization Impact:

Quantization Tooling:

  • bitsandbytes: Easy INT8/INT4, Hugging Face integration

  • GPTQ: Post-training quantization, high quality

  • AWQ: Activation-aware, best quality/speed

  • llama.cpp: CPU quantization (Q4_K_M, Q5_K_M, Q8_0)


Step 3: Serving Frameworks (vLLM, llama.cpp)

vLLM (Production LLM Serving):

Key Innovation: PagedAttention — manages KV cache like OS virtual memory

vLLM Architecture:

llama.cpp (CPU/Edge):

  • Pure C++, runs on CPU (Apple Silicon, x86)

  • Quantized models (GGUF format: Q4_K_M ≈ 4GB for 7B)

  • 10-30 tokens/sec on modern CPU (vs 100+ on GPU)

  • Use for: local dev, edge deployment, cost-sensitive low-QPS

Framework Comparison:

Framework
Hardware
Throughput
Latency
Use Case

vLLM

GPU

Highest

Low

Production, high QPS

TGI (HuggingFace)

GPU

High

Low

Production, HF ecosystem

llama.cpp

CPU/GPU

Low

Medium

Local, edge, dev

Ollama

CPU/GPU

Low

Medium

Developer experience


Step 4: Batching (Dynamic and Continuous)

Static vs Dynamic vs Continuous Batching:

Why Continuous Batching is Critical for LLMs:


Step 5: KV Cache Architecture

The KV (Key-Value) cache is the most critical memory component in LLM inference.

What is KV Cache?

KV Cache for LLaMA-7B:

KV Cache Tiering:

Context Length Cost:

  • Context 2K → 4K: 4x KV cache memory

  • Context 4K → 128K: 32x KV cache memory

  • Long context models (Claude, Gemini) require careful KV memory management


Step 6: Model Sharding (Tensor and Pipeline Parallelism)

When a model doesn't fit on one GPU, you must split it.

Tensor Parallelism (within a layer):

Pipeline Parallelism (across layers):

ZeRO (Zero Redundancy Optimizer) for Training:


Step 7: Cost Per Token Optimization

Cost Calculator Framework:

Model
Precision
GPU
GPUs
$/hr
Tokens/sec
$/1M tokens

LLaMA-7B

INT8

A10G

1

$1.20

~1000

$0.33

LLaMA-13B

FP16

A100

1

$3.00

~600

$0.83

LLaMA-70B

INT4

A100

1

$3.00

~600

$0.83

GPT-175B

INT8

H100

3

$15.00

~400

$12.50

Cost Optimization Strategies:

  1. Quantize: INT4 vs FP16 = 4x cost reduction, minimal quality loss

  2. Cascade models: Use 7B for simple queries, 70B only for complex

  3. Prompt caching: Cache KV for repeated system prompts (80% cost savings on prefix)

  4. Speculative decoding: Draft small model + verify with large = 2-3x faster

  5. Batch offline workloads: Spot/preemptible instances = 70% cost reduction


Step 8: Capstone — LLM Infrastructure Cost Calculator

📸 Verified Output:


Summary

Concept
Key Points

Hardware

H100 > A100 > A10G; bandwidth > TFLOPS for inference

Quantization

FP16 (standard), INT8 (~1% loss), INT4 (~3% loss, 4x savings)

vLLM

PagedAttention = virtual memory for KV cache, 3-10x throughput

Continuous Batching

New requests join mid-batch, maximizes GPU utilization

KV Cache

2 × layers × heads × dim × seq × batch × 2 bytes

Tensor Parallelism

Split heads across GPUs (within layer, NVLink required)

Pipeline Parallelism

Split layers across GPUs (micro-batch to fill pipeline)

Cost Optimization

Quantize → cascade → cache prompts → speculative decoding

Next Lab: Lab 05: RAG at Scale →

Last updated