Policy Gradient Methods

REINFORCE, PPO, A2C, and actor-critic methods.

A2C and A3C

Parallel actor-critic training through multiple environment workers – A3C uses asynchronous gradient updates for decorrelation, while A2C’s synchronous batching often matches performance and better utilizes GPUs.

Actor-Critic Methods

A two-network architecture that combines a policy (the actor) with a learned value function (the critic) to reduce the high variance of pure policy gradient methods while maintaining low bias.

Advantage Estimation

Methods for estimating how much better a specific action is compared to the average action in a given state – the key signal that drives stable, efficient policy gradient updates.

Entropy Regularization

Adding a policy entropy bonus to the optimization objective to encourage exploration, prevent premature convergence to deterministic policies, and improve robustness – a simple technique with deep connections to maximum entropy RL.

Policy Gradient Theorem

The mathematical foundation that enables direct optimization of parameterized policies via gradient ascent on expected return, bypassing the need to differentiate through unknown environment dynamics.

Proximal Policy Optimization (PPO)

A clipped surrogate objective that approximates trust region constraints using only first-order optimization – the dominant algorithm in modern reinforcement learning and the engine behind RLHF for large language models.

REINFORCE

The simplest policy gradient algorithm – sample a complete trajectory, weight each action’s log-probability by the return that followed it, and update the policy in the direction that reinforces successful behavior.

Trust Region Methods

Constraining each policy update to a “trust region” where the local approximation is reliable, preventing the catastrophic performance collapses that plague unconstrained policy gradients – realized through TRPO and natural policy gradients.