RL for Language Models
RLHF, reward modeling, and RL in the LLM training pipeline.
DPO as Implicit RL
Direct Preference Optimization reframes RLHF as a supervised learning problem by deriving the optimal policy in closed form – eliminating the reward model, PPO loop, and value function while producing equivalent results from the same preference data.
GRPO
DeepSeek’s Group Relative Policy Optimization eliminates the value function entirely by estimating advantages from groups of sampled outputs – a critic-free RL algorithm that is simpler, cheaper, and powered the reasoning breakthroughs in DeepSeek-R1.
PPO for Language Models
Adapting Proximal Policy Optimization from game environments to text generation – where actions are tokens, episodes are sequences, rewards arrive only at the end, and four full-sized neural networks must coexist in GPU memory.
Reward Modeling for LLMs
Training a neural network to predict human preferences from pairwise comparisons – the critical bottleneck in LLM alignment where Goodhart’s Law meets the impossibility of specifying what “good” means mathematically.
RLAIF and Constitutional AI
Replacing human annotators with AI-generated feedback guided by explicit principles for scalable alignment – reducing the cost per preference comparison from 1--10 to approximately $0.001 while achieving comparable quality.
RLHF Pipeline
The three-stage process (SFT, reward model, PPO) that transformed language models from text predictors into aligned assistants – the alignment breakthrough where a 1.3B parameter RLHF model outperformed a 175B parameter supervised-only model.
RLVR
Reinforcement Learning with Verifiable Rewards uses objectively checkable outcomes – correct math answers, passing code tests, provable logical validity – as reward signals, completely bypassing learned reward models and their susceptibility to Goodhart’s Law.