Training Fundamentals

Optimization, loss functions, scaling laws, and training data.

Cross-Entropy Loss

Cross-entropy loss is the objective function that drives LLM training by measuring how “surprised” the model is by the actual next token, rooted in information theory’s concept of encoding efficiency.

Backpropagation and Gradient Descent

Backpropagation is the algorithm that computes how much each parameter in a neural network contributed to the prediction error, enabling gradient descent to systematically adjust billions of parameters toward better predictions.

Adam and AdamW Optimizer

AdamW is the near-universal optimizer for LLM training, combining adaptive per-parameter learning rates with momentum and properly decoupled weight decay to navigate the complex, high-dimensional loss landscapes of billion-parameter models.

Learning Rate Scheduling

Learning rate scheduling – gradually warming up, then systematically decaying the learning rate during training – is a critical technique that prevents early training instability and ensures the model converges to a good minimum rather than oscillating around one.

Gradient Clipping, Accumulation, and Checkpointing

Three essential training stability techniques – gradient clipping prevents catastrophic parameter updates from exploding gradients, gradient accumulation simulates larger batch sizes without additional memory, and gradient checkpointing trades recomputation for memory savings on stored activations.

Mixed Precision Training

Mixed precision training uses lower-precision number formats (FP16 or BF16) for most computations while maintaining a master copy of weights in FP32, cutting memory usage in half and dramatically increasing throughput by leveraging specialized hardware tensor cores.

Gradient Checkpointing

Gradient checkpointing trades additional computation for dramatically reduced memory during training by selectively storing activations at checkpoint layers and recomputing intermediate values during the backward pass.

Pre-Training

Pre-training is the foundational, most expensive phase of LLM development where a model learns language, facts, reasoning, and code by predicting the next token across trillions of words of text.

Training Data Curation

Training data curation – the process of collecting, filtering, deduplicating, and mixing massive text datasets – is arguably the most underappreciated factor in LLM quality, with data quality consistently proving more important than data quantity.

Data Mixing & Domain Weighting

Data mixing – the art of choosing how much of each data source to include in training – has as much impact on model quality as architecture or scale, with optimal ratios differing substantially from natural data distributions.

Curriculum Learning

Curriculum learning presents training examples in a meaningful order – typically easy to hard – rather than random order, inspired by human education, enabling better final performance and faster convergence at the same compute budget.

Scaling Laws

Scaling laws are empirically discovered power-law relationships showing that LLM performance improves predictably and smoothly as you increase model parameters, training data, and compute – enabling researchers to forecast the capabilities of models costing hundreds of millions of dollars before training them.

Emergent Abilities

Emergent abilities are capabilities that appear to arise suddenly and unpredictably in large language models once they cross certain scale thresholds – sparking both excitement about potential breakthroughs and deep concern about our ability to forecast and control AI systems.

Grokking

Grokking is the phenomenon where a neural network suddenly generalizes to unseen data long after it has already memorized the training set, challenging assumptions about when and how models truly learn.

Model Collapse

Model collapse is the progressive degradation of model quality that occurs when AI models are recursively trained on data generated by other AI models, causing irreversible loss of distributional diversity and rare-but-valid patterns.

Catastrophic Forgetting

Catastrophic forgetting is the phenomenon where neural networks abruptly lose previously learned knowledge when trained on new tasks or data, because gradient updates for the new task overwrite parameters critical to old tasks.

Self-Play and Self-Improvement

Self-play and self-improvement methods enable language models to bootstrap stronger capabilities from their own outputs – generating reasoning traces, filtering for correctness, and training on the successes – achieving dramatic gains like GPT-J 6B jumping from 36.6% to 72.5% on CommonsenseQA without any human-written rationales.