RL for Language Models

RLHF, reward modeling, and RL in the LLM training pipeline.

DPO as Implicit RL

Direct Preference Optimization reframes RLHF as a supervised learning problem by deriving the optimal policy in closed form – eliminating the reward model, PPO loop, and value function while producing equivalent results from the same preference data.

GRPO

DeepSeek’s Group Relative Policy Optimization eliminates the value function entirely by estimating advantages from groups of sampled outputs – a critic-free RL algorithm that is simpler, cheaper, and powered the reasoning breakthroughs in DeepSeek-R1.

PPO for Language Models

Adapting Proximal Policy Optimization from game environments to text generation – where actions are tokens, episodes are sequences, rewards arrive only at the end, and four full-sized neural networks must coexist in GPU memory.

Reward Modeling for LLMs

Training a neural network to predict human preferences from pairwise comparisons – the critical bottleneck in LLM alignment where Goodhart’s Law meets the impossibility of specifying what “good” means mathematically.

RLAIF and Constitutional AI

Replacing human annotators with AI-generated feedback guided by explicit principles for scalable alignment – reducing the cost per preference comparison from 1--10 to approximately $0.001 while achieving comparable quality.

RLHF Pipeline

The three-stage process (SFT, reward model, PPO) that transformed language models from text predictors into aligned assistants – the alignment breakthrough where a 1.3B parameter RLHF model outperformed a 175B parameter supervised-only model.

RLVR

Reinforcement Learning with Verifiable Rewards uses objectively checkable outcomes – correct math answers, passing code tests, provable logical validity – as reward signals, completely bypassing learned reward models and their susceptibility to Goodhart’s Law.