Training large models requires distributing computation across multiple GPUs or nodes. Data Parallelism replicates the model on each GPU and splits the data batch — each GPU computes gradients independently, then they are averaged. Tensor Parallelism splits individual weight matrices across GPUs — each computes a partition of the matrix multiplication. Pipeline Parallelism assigns different layers to different GPUs — micro-batches flow through the pipeline. Modern LLM training (GPT-4, Llama, Claude) uses all three in combination: 3D parallelism.
Real-life analogy: The factory assembly line
Data Parallelism: 8 identical factories each produce the same product from different raw materials. At the end of the day they compare notes and average their quality improvements. Tensor Parallelism: one giant machine is split across 8 factories — each handles one section of the production line simultaneously. Pipeline Parallelism: 8 factories form a sequential assembly line — factory 1 does cutting, factory 2 does shaping, factory 3 does painting. Products flow through the pipeline.
Data Parallelism — the simplest approach
Data parallelism with PyTorch DDP and FSDP
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import MixedPrecision, ShardingStrategy
import os
# ── DataParallel (single-node, easy but not recommended) ──
model = nn.Linear(1024, 512)
if torch.cuda.device_count() > 1:
model = nn.DataParallel(model) # Splits batch across all GPUs, naive
model.cuda()
# ── DDP (Distributed Data Parallel — production standard) ──
# Each GPU has a full model copy; gradients are all-reduced (averaged)
def setup_ddp(rank: int, world_size: int):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group("nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
def train_ddp(rank: int, world_size: int):
setup_ddp(rank, world_size)
model = nn.Linear(1024, 512).to(rank)
model = DDP(model, device_ids=[rank]) # Wrap with DDP
# Each GPU gets different data (DistributedSampler)
sampler = torch.utils.data.distributed.DistributedSampler(
dataset, num_replicas=world_size, rank=rank)
loader = torch.utils.data.DataLoader(dataset, sampler=sampler, batch_size=32)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-4)
for X, y in loader:
X, y = X.to(rank), y.to(rank)
loss = criterion(model(X), y)
loss.backward() # Gradients auto-sync across all GPUs
optimizer.step()
optimizer.zero_grad()
# Launch: torchrun --nproc-per-node=8 train.py (8 GPUs on one node)
# or: torchrun --nnodes=4 --nproc-per-node=8 train.py (4 nodes × 8 GPUs = 32 GPUs)
# ── FSDP (Fully Sharded Data Parallel — for large models) ──
# Model weights, gradients, and optimizer states are SHARDED across GPUs
# Each GPU only holds 1/N of the model — enables models larger than single GPU VRAM
model = nn.Linear(1024, 512).to(rank)
fsdp_model = FSDP(
model,
sharding_strategy=ShardingStrategy.FULL_SHARD, # Shard everything
mixed_precision=MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.float32, # Gradient reduction in FP32
),
device_id=rank,
)
# With FSDP + 8 GPUs: each GPU holds ~1/8 of the 7B model = ~1.75GB (instead of 14GB)
# ── Unsloth + HuggingFace: automatic parallelism ──
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./output",
per_device_train_batch_size=4, # 4 per GPU
gradient_accumulation_steps=4, # Effective batch = 4 × 4 GPUs × 4 accum = 64
bf16=True,
dataloader_num_workers=4,
# DDP is used automatically when launched with torchrun
)Tensor and pipeline parallelism — for models too large for DDP
| Strategy | What is split | Memory saving | Communication | Best for |
|---|---|---|---|---|
| Data Parallel (DDP) | Batch data | None (full model per GPU) | Low (only gradient all-reduce) | Models that fit on 1 GPU, many GPUs |
| FSDP | Model weights + gradients + optimizer states | Up to N× (N = GPUs) | Medium (gather weights when needed) | Large models, up to ~100B params |
| Tensor Parallel | Individual weight matrices split column/row-wise | N× | High (all-reduce in forward+backward) | 100B+ models, fast interconnect required |
| Pipeline Parallel | Layers split across GPUs | N× | Low (pass activations between stages) | 100B+ models, depth ≥ num GPUs |
| 3D Parallel (DP+TP+PP) | All of the above simultaneously | N×M×K× | All types | Frontier models (GPT-4 scale, 1T+ params) |
How Llama-3 70B is trained in practice
Meta trains Llama 3 70B using FSDP across 16,384 NVIDIA H100 GPUs. The training uses BF16 compute, FP32 master weights, gradient checkpointing, and flash attention. Total compute: ~6.4 × 10²³ FLOPs. Effective batch size: 8192 sequences of 8192 tokens = 67M tokens per step. This is what "large-scale training" means — not just a big model, but sophisticated distributed infrastructure.
Practice questions
- DDP with 8 GPUs and batch_size=32 per GPU. What is the effective batch size? (Answer: 8 × 32 = 256. Each GPU processes 32 examples independently; gradients are averaged (all-reduce) across all 8 GPUs, equivalent to computing the mean gradient over 256 examples.)
- FSDP vs DDP — when would you choose FSDP? (Answer: FSDP when the model does not fit on a single GPU. FSDP shards weights, gradients, and optimizer states across GPUs. A 70B model in BF16 = 140GB — requires 10+ A100s even with FSDP. DDP requires each GPU to hold the full model — fine for 7B (14GB on A100-80GB) but impossible for 70B+.)
- What is gradient accumulation and why is it used in distributed training? (Answer: Instead of updating weights every step, accumulate gradients for N steps then update. Allows effective batch_size = actual_batch × N × num_GPUs without needing more GPU memory per step. Useful when per-GPU batch must be small (due to memory) but you want stable large-batch training.)
- Tensor parallelism splits a weight matrix column-wise across GPUs. If W is 4096×4096 and you use 4-way TP: (Answer: Each GPU holds a 4096×1024 column partition. During forward pass: each GPU computes XW_i for its columns in parallel. Results are gathered with all-reduce to get the full output. Communication happens at each layer, requiring fast NVLink interconnect.)
- Why does pipeline parallelism create "pipeline bubbles" and how does micro-batching reduce them? (Answer: With 4 pipeline stages and 1 micro-batch: stage 1 runs, then stages 2-4 are idle (bubble). Stages 2-4 run, then stage 1 is idle. Micro-batching: split each batch into M micro-batches that flow through the pipeline like an assembly line. While stage 4 processes micro-batch 1, stage 3 processes micro-batch 2, etc. Bubble fraction = (stages-1)/(micro-batches).)
On LumiChats
Training Claude requires thousands of GPUs running 3D parallelism. Understanding these techniques explains why LLM training costs millions of dollars (massive GPU infrastructure) and why fine-tuning with LoRA (which trains only 0.1% of params) on a single GPU is so valuable — it democratises access to large model adaptation.
Try it free