Lab 19: Distributed Training Architecture

Time: 50 minutes | Level: Architect | Docker: docker run -it --rm zchencow/innozverse-ai:latest bash


Overview

Training large language models requires distributing computation across dozens to thousands of GPUs. This lab covers every layer of the distributed training stack: data parallelism (DDP), model parallelism, pipeline parallelism, ZeRO optimizer stages, gradient compression via Top-K sparsification, mixed precision training, and communication backends. You'll also simulate gradient compression achieving 90% bandwidth reduction.

What you'll build:

  • Data parallelism simulation with 4 workers

  • Top-K gradient sparsification (90% compression)

  • ZeRO optimizer stage memory analysis

  • Mixed precision FP16/BF16 trade-offs

  • Gradient checkpointing memory savings

  • Communication backend comparison (NCCL/Gloo/MPI)


Architecture

┌─────────────────────────────────────────────────────────────┐
│              Distributed Training Stack                      │
│                                                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │ Worker 0 │  │ Worker 1 │  │ Worker 2 │  │ Worker 3 │   │
│  │ GPU Shard│  │ GPU Shard│  │ GPU Shard│  │ GPU Shard│   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │              │              │              │         │
│       └──────────────┴──────────────┴──────────────┘        │
│                            │                                 │
│                    ┌───────┴────────┐                        │
│                    │  AllReduce /   │                        │
│                    │  Gradient Comp │                        │
│                    │  (NCCL/Gloo)   │                        │
│                    └───────┬────────┘                        │
│                            │                                 │
│                    ┌───────┴────────┐                        │
│                    │  ZeRO Optimizer│                        │
│                    │  Stage 0/1/2/3 │                        │
│                    └────────────────┘                        │
└─────────────────────────────────────────────────────────────┘

Step 1: Data Parallelism — DDP Fundamentals

Distributed Data Parallel (DDP) is the standard approach: each worker holds a full model copy, processes a data shard, then synchronizes gradients via AllReduce.

💡 DDP vs DP: PyTorch DistributedDataParallel (DDP) outperforms DataParallel (DP) because DDP uses one process per GPU with NCCL AllReduce, while DP has a parameter server bottleneck on GPU 0. Always prefer DDP for multi-GPU training.


Step 2: Gradient Compression — Top-K Sparsification

In large clusters, gradient communication is the bottleneck. Top-K sparsification transmits only the K largest gradients, reducing bandwidth by 90%+.


Step 3: Compression Performance Benchmark

📸 Verified Output:

💡 Fidelity Trade-off: Cosine similarity of 0.66 means compressed gradients point in roughly the same direction but miss small-magnitude components. Use error feedback (accumulate dropped gradients and add to next iteration) to recover convergence speed.


Step 4: ZeRO Optimizer Stages

ZeRO (Zero Redundancy Optimizer) partitions optimizer state, gradients, and parameters across workers to eliminate memory redundancy:


Step 5: Model Parallelism and Pipeline Parallelism

When a model doesn't fit on a single GPU, split it across devices:

💡 3D Parallelism: Production LLM training (GPT-4, LLaMA) uses all three in combination: Data Parallel × Tensor Parallel × Pipeline Parallel. Megatron-LM uses TP=8 (within a node via NVLink) × PP=8 (across nodes via InfiniBand) × DP=N (across replica groups).


Step 6: Mixed Precision and Gradient Checkpointing


Step 7: Communication Backend Selection


Step 8: Capstone — Full Gradient Compression Pipeline

📸 Verified Output:

Top-K sparsification achieves exactly 90% compression (1000 → 100 elements per worker) with 0.66 cosine similarity, enabling a simulated 3x bandwidth speedup. ZeRO-3 reduces a 7B model from 448 GB to 56 GB per GPU — enabling training on 8× A100 80GB GPUs that would otherwise require 448 GB per device.


Summary

Strategy
Memory Reduction
Throughput Impact
Complexity

DDP (Data Parallel)

0%

Linear scaling

Low

Tensor Parallelism

1/N parameters

~Linear (NVLink)

Medium

Pipeline Parallelism

1/N activations

Linear - bubble

Medium

ZeRO Stage 1

50% optimizer

None

Low

ZeRO Stage 2

75% opt+grad

None

Low

ZeRO Stage 3

87.5% all

10-20% comm overhead

High

Top-K Grad Compression

N/A

3x bandwidth savings

Medium

Mixed Precision (BF16)

50% activation

1.5-2x throughput

Low

Gradient Checkpointing

60-80% activation

-33% compute

Low

Key Takeaways:

  • DDP is the default; add ZeRO stages as model size grows

  • Top-K compression is most effective on bandwidth-bound training (slow inter-node)

  • BF16 preferred over FP16 for training stability (same exponent range as FP32)

  • ZeRO-3 + gradient checkpointing enables 70B+ model training on commodity clusters

  • Communication backend choice: NCCL for GPUs, MPI for HPC heterogeneous


Next: Lab 20 — Capstone: Enterprise AI Security Platform

Last updated