BD Brain Drip
Foundations

Bellman Equations

The recursive decomposition of value into immediate reward plus discounted future value – the fundamental identity of RL.

Prerequisites | markov-decision-processes.md return-and-discount-factor.md policies.md value-functions.md.

What Are the Bellman Equations?

Imagine planning a cross-country road trip. You don’t need to evaluate the entire route at once. Instead, you reason: “The value of being in Denver equals the pleasure of driving the next leg to Salt Lake City, plus the value of being in Salt Lake City.” You decompose a global evaluation into a local step plus the value of where you end up. That recursive insight – the value of now equals the reward of the next step plus the discounted value of later – is the Bellman equation.

Named after Richard Bellman (1957), these equations are the mathematical backbone of nearly every RL algorithm. They convert the problem of evaluating an infinite sum of future rewards into a system of simultaneous equations, one for each state.

How It Works

Bellman Expectation Equation for VπV^\pi

Starting from the definition of the state-value function and using the recursive structure of the return (Gt=Rt+1+γGt+1G_t = R_{t+1} + \gamma G_{t+1}):

Vπ(s)=Eπ[GtSt=s]=Eπ[Rt+1+γGt+1St=s]V^\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s] = \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1} \mid S_t = s]

Expanding the expectation over actions and next states:

Vπ(s)=aπ(as)sP(ss,a)[R(s,a,s)+γVπ(s)]V^\pi(s) = \sum_{a} \pi(a \mid s) \sum_{s'} P(s' \mid s, a) \left[ R(s, a, s') + \gamma V^\pi(s') \right]

This says: the value of state ss under policy π\pi equals the expected immediate reward plus the discounted value of the next state, averaged over the policy’s action choices and the environment’s stochastic transitions.

Bellman Expectation Equation for QπQ^\pi

Similarly, for the action-value function:

Qπ(s,a)=sP(ss,a)[R(s,a,s)+γaπ(as)Qπ(s,a)]Q^\pi(s, a) = \sum_{s'} P(s' \mid s, a) \left[ R(s, a, s') + \gamma \sum_{a'} \pi(a' \mid s') \, Q^\pi(s', a') \right]

This decomposes QπQ^\pi into the immediate reward plus the discounted Q-value of the next state-action pair, weighted by the policy at ss'.

The Four Bellman Equations (Summary)

EquationRecursive Form
VπV^\piVπ(s)=aπ(as)[r(s,a)+γsP(ss,a)Vπ(s)]V^\pi(s) = \sum_a \pi(a \mid s) \left[ r(s,a) + \gamma \sum_{s'} P(s' \mid s,a) V^\pi(s') \right]
QπQ^\piQπ(s,a)=r(s,a)+γsP(ss,a)aπ(as)Qπ(s,a)Q^\pi(s,a) = r(s,a) + \gamma \sum_{s'} P(s' \mid s,a) \sum_{a'} \pi(a' \mid s') Q^\pi(s',a')
VV^*V(s)=maxa[r(s,a)+γsP(ss,a)V(s)]V^*(s) = \max_a \left[ r(s,a) + \gamma \sum_{s'} P(s' \mid s,a) V^*(s') \right]
QQ^*Q(s,a)=r(s,a)+γsP(ss,a)maxaQ(s,a)Q^*(s,a) = r(s,a) + \gamma \sum_{s'} P(s' \mid s,a) \max_{a'} Q^*(s',a')

Bellman Optimality Equations

The Bellman optimality equation for VV^* replaces the policy average with a maximization:

V(s)=maxa[R(s,a)+γsP(ss,a)V(s)]V^*(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) \, V^*(s') \right]

For QQ^*:

Q(s,a)=R(s,a)+γsP(ss,a)maxaQ(s,a)Q^*(s, a) = R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) \max_{a'} Q^*(s', a')

The key difference: the expectation equations use π\pi to weight actions, while the optimality equations use max\max. The optimality equations are nonlinear (due to the max\max) and generally cannot be solved in closed form.

Derivation Sketch

Starting from Vπ(s)=Eπ[GtSt=s]V^\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s]:

  1. Substitute Gt=Rt+1+γGt+1G_t = R_{t+1} + \gamma G_{t+1}
  2. Apply the law of total expectation, conditioning on AtA_t and St+1S_{t+1}
  3. Use the Markov property: E[Gt+1St+1=s]=Vπ(s)\mathbb{E}[G_{t+1} \mid S_{t+1} = s'] = V^\pi(s')
  4. The result is the Bellman expectation equation

For the optimality equation, replace the policy average aπ(as)[]\sum_a \pi(a \mid s) [\cdots] with maxa[]\max_a [\cdots], reflecting that the optimal policy always selects the best action.

Matrix Form (Finite MDPs)

For a finite MDP with n=Sn = |\mathcal{S}| states, the Bellman expectation equation for VπV^\pi is a linear system:

vπ=rπ+γPπvπ\mathbf{v}^\pi = \mathbf{r}^\pi + \gamma \mathbf{P}^\pi \mathbf{v}^\pi

where vπRn\mathbf{v}^\pi \in \mathbb{R}^n, rπRn\mathbf{r}^\pi \in \mathbb{R}^n is the expected reward vector, and PπRn×n\mathbf{P}^\pi \in \mathbb{R}^{n \times n} is the state transition matrix under π\pi. The closed-form solution is:

vπ=(IγPπ)1rπ\mathbf{v}^\pi = (\mathbf{I} - \gamma \mathbf{P}^\pi)^{-1} \mathbf{r}^\pi

This requires O(n3)O(n^3) computation for the matrix inverse, which is feasible for small MDPs but intractable for large ones (n=106n = 10^6 states would require 101810^{18} operations).

The Bellman Operator

Define the Bellman operator TπT^\pi for policy evaluation:

(TπV)(s)=aπ(as)[R(s,a)+γsP(ss,a)V(s)](T^\pi V)(s) = \sum_a \pi(a \mid s) \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V(s') \right]

And the Bellman optimality operator TT^*:

(TV)(s)=maxa[R(s,a)+γsP(ss,a)V(s)](T^* V)(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V(s') \right]

Both operators are γ\gamma-contractions in the \ell_\infty norm:

TπVTπVγVV\| T^\pi V - T^\pi V' \|_\infty \leq \gamma \| V - V' \|_\infty

By the Banach fixed-point theorem, each operator has a unique fixed point, and repeated application converges to it. This is the theoretical foundation for:

  • Policy evaluation (iterative application of TπT^\pi) converging to VπV^\pi.
  • Value iteration (iterative application of TT^*) converging to VV^*.

Algorithms Built on Bellman Equations

AlgorithmBellman Equation UsedApproach
Policy EvaluationVπV^\pi expectationIterative TπT^\pi
Value IterationVV^* optimalityIterative TT^*
Q-LearningQQ^* optimalitySample-based TT^*
SARSAQπQ^\pi expectationSample-based TπT^\pi
TD(0)VπV^\pi expectationSample-based TπT^\pi

Why It Matters

The Bellman equations transform the RL problem from “evaluate an infinite sum of future rewards” to “solve a system of equations relating neighboring states.” This recursive structure enables:

  • Dynamic programming algorithms that solve MDPs exactly when the model is known.
  • Temporal-difference learning that bootstraps (estimates values from other estimates), enabling online, incremental learning.
  • Convergence guarantees through the contraction mapping property.

Without Bellman equations, RL would require evaluating complete trajectories for every value estimate – the Bellman decomposition makes tractable learning possible.

Key Technical Details

  • Value iteration converges at rate O(γk)O(\gamma^k): after kk iterations, VkVγkV0V\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty. For γ=0.99\gamma = 0.99, reaching ϵ\epsilon-accuracy requires klog(1/ϵ)1γ100log(1/ϵ)k \approx \frac{\log(1/\epsilon)}{1 - \gamma} \approx 100 \log(1/\epsilon) iterations.
  • Policy iteration alternates between solving VπV^\pi (policy evaluation) and updating π\pi greedily (policy improvement). It converges in at most AS|\mathcal{A}|^{|\mathcal{S}|} iterations (the number of deterministic policies), but in practice converges much faster.
  • The deadly triad: Combining bootstrapping (Bellman equations), function approximation, and off-policy learning can cause divergence. This fundamental tension motivates algorithms like fitted Q-iteration, gradient TD, and emphatic TD.
  • Q-learning’s update rule is a stochastic approximation to the Bellman optimality equation: Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)].

Common Misconceptions

“The Bellman equation is an update rule.” The Bellman equation is an identity – a condition that the true value function must satisfy. Algorithms like TD learning and value iteration use update rules inspired by the Bellman equation, but the equation itself is a mathematical relationship, not a procedure.

“You need the transition model to use Bellman equations.” Model-free methods (TD, Q-learning, SARSA) use sample-based Bellman updates, replacing the expectation over transitions with single observed transitions. The model is not required.

“Bellman optimality equations can be solved directly.” The max\max operator makes them nonlinear. Unlike the expectation equations, there is no closed-form solution. They must be solved iteratively (value iteration) or via linear programming.

Connections to Other Concepts

  • Value Functions – Bellman equations define the recursive structure of value functions.
  • Return And Discount Factor – The recursive decomposition Gt=Rt+1+γGt+1G_t = R_{t+1} + \gamma G_{t+1} is the starting point.
  • Policies – The expectation equations depend on π\pi; the optimality equations define π\pi^*.
  • Markov Decision Processes – Bellman equations exploit the Markov property.
  • Exploration Vs Exploitation – Q-learning (based on Bellman optimality) is combined with exploration strategies.

Further Reading

  • Bellman (1957)Dynamic Programming. The origin of the principle of optimality and the Bellman equation.
  • Sutton & Barto (2018)Reinforcement Learning: An Introduction, Chapters 3-4. Derivation and use of Bellman equations in dynamic programming.
  • Bertsekas (2012)Dynamic Programming and Optimal Control (4th edition). Rigorous treatment of Bellman equations, contraction mappings, and convergence theory.
  • Watkins & Dayan (1992) – “Q-learning.” Machine Learning, 8. Proves convergence of Q-learning to the Bellman optimality fixed point. [Scholar]
  • Tsitsiklis & Van Roy (1997) – “An analysis of temporal-difference learning with function approximation.” IEEE Transactions on Automatic Control. Foundational analysis of TD learning and Bellman equation approximation. [Scholar]