Alignment & Post-Training

RLHF, DPO, reward modeling, and preference learning.

Supervised Fine-Tuning (SFT) & Instruction Tuning

Supervised fine-tuning transforms a raw language model that merely predicts the next token into an assistant that can follow instructions, by training it on curated (instruction, response) pairs.

RLHF (Reinforcement Learning from Human Feedback)

RLHF aligns language models with human preferences by training a reward model on human comparisons, then using reinforcement learning to optimize the language model’s outputs against that reward signal – while a KL penalty keeps it from straying too far from its original behavior.

Reward Modeling

Reward modeling trains a neural network to predict human preferences over model outputs, producing a scalar score that serves as the optimization signal for reinforcement learning from human feedback – and its quality is the single biggest bottleneck in the entire alignment pipeline.

Process Reward Models (PRMs) vs. Outcome Reward Models (ORMs)

Process reward models evaluate each intermediate reasoning step for correctness, while outcome reward models only evaluate the final answer – a distinction that fundamentally changes how AI systems learn to reason, moving from “did you get the right answer?” to “did you reason correctly?”

Direct Preference Optimization (DPO)

DPO collapses the entire RLHF pipeline – reward model training and RL optimization – into a single supervised learning step by showing that the optimal policy can be derived directly from preference data using a simple classification loss.

Rejection Sampling in Alignment

Rejection sampling (Best-of-N) generates N candidate responses from a language model, scores each with a reward model, and selects the highest-scoring output – providing an implicit KL-constrained policy improvement that captured most of the alignment gains in Llama 2, often matching PPO while being far simpler.

Preference Learning Variants

Alternatives to DPO that reduce data requirements, simplify training pipelines, or improve robustness – each trading off different aspects of preference optimization.

GRPO (Group Relative Policy Optimization)

GRPO is a reinforcement learning algorithm developed by DeepSeek that eliminates the critic (value) model entirely by estimating advantages through group-based relative scoring of multiple sampled outputs – dramatically reducing memory requirements while achieving stable, effective policy optimization.

RLAIF (Reinforcement Learning from AI Feedback)

RLAIF replaces human annotators with AI models in the preference labeling stage of RLHF, using techniques like position debiasing and self-consistency voting to generate preference data that matches human-quality alignment at a fraction of the cost – approximately 0.001 per comparison versus 1-10 for human annotators.

Constitutional AI (CAI)

Constitutional AI aligns language models by replacing human preference labels with AI-generated feedback guided by an explicit set of principles (a “constitution”), making the alignment process more scalable, transparent, and auditable.

Synthetic Data for Training

Synthetic data generation uses existing LLMs to create training data for other (often smaller) models, offering a scalable path around the “data wall” but introducing risks of model collapse, reduced diversity, and inherited biases.

RLVR (Reinforcement Learning with Verifiable Rewards)

RLVR trains language models using reinforcement learning where the reward signal comes from objectively verifiable outcomes – like whether a math answer is correct or code passes tests – avoiding the Goodhart’s Law problems of learned reward models and producing models with genuinely stronger reasoning.

Chain-of-Thought Training & Reasoning Models

Chain-of-thought has evolved from a simple prompting trick into a full training paradigm, where models like OpenAI’s o1/o3 and DeepSeek-R1 are explicitly trained to produce extended internal reasoning before answering – representing a fundamental shift from “System 1” to “System 2” thinking in AI.