BD Brain Drip
🔀
Module 04 7 concepts

Distributed Training

Parallelism strategies and distributed systems for large-scale training.

01

Data Parallelism & Distributed Data Parallel (DDP)

Data parallelism replicates the entire model on every GPU and splits the training data across them, synchronizing gradients after each step to keep all copies in lockstep.

02

Tensor (Model) Parallelism

Tensor parallelism splits individual layers of a neural network across multiple GPUs, so each GPU computes only a slice of every layer’s output, enabling training of models whose single layers are too large for one device.

03

Pipeline Parallelism

Pipeline parallelism distributes consecutive layers of a model across different GPUs like an assembly line, using micro-batching to keep all stages busy simultaneously and minimize idle time (pipeline bubbles).

04

ZeRO & FSDP (Fully Sharded Data Parallel)

ZeRO and FSDP eliminate the memory redundancy of data parallelism by sharding optimizer states, gradients, and parameters across GPUs, enabling training of models that no single GPU can hold while preserving the simplicity of data-parallel training.

05

3D Parallelism & Training at Scale

3D parallelism combines data, tensor, and pipeline parallelism into a unified strategy that maps each dimension to the hardware topology, enabling the training of the largest language models (hundreds of billions to trillions of parameters) across thousands of GPUs.

06

Expert Parallelism

Expert parallelism distributes the experts of a Mixture-of-Experts (MoE) model across different GPUs, using all-to-all communication to route tokens to their assigned experts and back – enabling models with trillions of total parameters (like Switch Transformer’s 1.6T) while keeping per-token compute costs manageable through sparse activation.

07

Ring Attention

Ring Attention distributes long sequences across multiple GPUs arranged in a ring topology, overlapping the communication of key-value blocks with attention computation to enable near-linear scaling of context length with the number of devices – supporting millions of tokens with less than 5% communication overhead.