BD Brain Drip
Tabular Methods

N-Step Methods

Bridging Monte Carlo and TD by bootstrapping after n steps – tunable bias-variance trade-off.

Prerequisites | temporal-difference-learning.md monte-carlo-methods.md return-and-discount-factor.md value-functions.md

What Is N-Step Methods?

Imagine predicting the weather. A one-step forecast (“what will tomorrow be like based on today?”) is low variance but relies heavily on your current model’s accuracy. A full-season forecast (“average of all days this season”) uses real data but is noisy. An nn-day forecast – looking ahead a week before trusting your model – balances these extremes. N-step methods in RL work identically: instead of bootstrapping after one step (TD) or waiting until the end of the episode (MC), they wait nn steps, using real observed rewards for those steps and then bootstrapping from the estimated value at step nn.

N-step methods provide a continuous, tunable bridge between TD(0) at one extreme (n=1n = 1) and Monte Carlo at the other (n=n = \infty). The parameter nn controls the bias-variance trade-off: small nn gives low variance but higher bias (from bootstrapping off potentially inaccurate estimates), while large nn gives lower bias but higher variance (from accumulating stochastic rewards).

How It Works

The N-Step Return

The nn-step return from time tt is defined as:

Gt:t+n=Rt+1+γRt+2+γ2Rt+3++γn1Rt+n+γnV(St+n)G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})

More compactly:

Gt:t+n=k=0n1γkRt+k+1+γnV(St+n)G_{t:t+n} = \sum_{k=0}^{n-1} \gamma^k R_{t+k+1} + \gamma^n V(S_{t+n})

Special cases reveal the unifying nature:

  • n=1n = 1: Gt:t+1=Rt+1+γV(St+1)G_{t:t+1} = R_{t+1} + \gamma V(S_{t+1}) – the TD(0) target.
  • n=Ttn = T - t (remaining episode length): Gt:T=k=0Tt1γkRt+k+1G_{t:T} = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1} – the full MC return (with V(ST)=0V(S_T) = 0 for terminal states).

N-Step TD Prediction

The value update uses the nn-step return as the target:

V(St)V(St)+α[Gt:t+nV(St)]V(S_t) \leftarrow V(S_t) + \alpha \left[ G_{t:t+n} - V(S_t) \right]

This update can only be made at time t+nt + n (when Rt+nR_{t+n} and St+nS_{t+n} are known), introducing a delay of nn steps between experiencing a transition and updating the value.

N-Step TD Prediction:
Initialize V(s) arbitrarily
For each episode:
    Store S_0; T <- infinity
    For t = 0, 1, 2, ...:
        If t < T:
            Take action, observe R_{t+1}, S_{t+1}
            If S_{t+1} is terminal: T <- t + 1
        tau <- t - n + 1  (time of state being updated)
        If tau >= 0:
            G <- sum_{k=tau+1}^{min(tau+n, T)} gamma^{k-tau-1} * R_k
            If tau + n < T: G <- G + gamma^n * V(S_{tau+n})
            V(S_tau) <- V(S_tau) + alpha * [G - V(S_tau)]
    until tau = T - 1

Recommended visual: Performance of nn-step TD on the 19-state random walk as a function of nn and α\alpha. Source: Sutton & Barto, Reinforcement Learning: An Introduction, Figure 7.2.

N-Step SARSA

Extending to control, nn-step SARSA uses action-value nn-step returns:

Gt:t+n=Rt+1+γRt+2++γn1Rt+n+γnQ(St+n,At+n)G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n Q(S_{t+n}, A_{t+n})

The update becomes:

Q(St,At)Q(St,At)+α[Gt:t+nQ(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ G_{t:t+n} - Q(S_t, A_t) \right]

This naturally extends to nn-step Expected SARSA by replacing Q(St+n,At+n)Q(S_{t+n}, A_{t+n}) with aπ(aSt+n)Q(St+n,a)\sum_a \pi(a \mid S_{t+n}) Q(S_{t+n}, a).

N-Step Off-Policy Learning

For off-policy nn-step methods, importance sampling ratios correct for the distribution mismatch:

ρt:t+n1=k=tt+n1π(AkSk)b(AkSk)\rho_{t:t+n-1} = \prod_{k=t}^{t+n-1} \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)}

The update becomes:

V(St)V(St)+αρt:t+n1[Gt:t+nV(St)]V(S_t) \leftarrow V(S_t) + \alpha \rho_{t:t+n-1} \left[ G_{t:t+n} - V(S_t) \right]

The product of nn importance sampling ratios can have high variance for large nn, which is a practical limitation of off-policy nn-step methods.

The Bias-Variance Trade-Off

The choice of nn controls a fundamental trade-off:

PropertySmall nn (TD-like)Large nn (MC-like)
BiasHigher (bootstraps from estimates)Lower (uses more real rewards)
VarianceLower (fewer random variables)Higher (product of many random rewards)
Speed of propagationSlow (one step at a time)Fast (information travels nn steps per update)
Sensitivity to initial valuesHigher (bootstraps from them)Lower (actual returns dominate)

The optimal nn is problem-dependent. The classic result from Sutton and Barto’s 19-state random walk shows that intermediate values of nn (around 4-8) consistently outperform both extremes across a range of learning rates. This demonstrates that the best trade-off is rarely at either extreme.

Truncated N-Step Returns

Near episode boundaries, when fewer than nn steps remain, the return is naturally truncated:

Gt:T=k=0Tt1γkRt+k+1G_{t:T} = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}

This graceful degradation means nn-step methods automatically become MC-like near episode termination, using whatever steps are available.

Why It Matters

N-step methods formalize the intuition that neither pure TD nor pure MC is universally optimal. They provide practitioners with a tunable knob (nn) to adapt to the specific characteristics of their problem: use smaller nn for problems with accurate value estimates and high reward variance, and larger nn for problems with poor initial estimates or strong sequential dependencies. N-step returns also serve as the conceptual stepping stone to eligibility traces, which elegantly average over all values of nn simultaneously.

Key Technical Details

  • Optimal nn: Problem-dependent, but n[4,16]n \in [4, 16] frequently outperforms TD(0) and MC in tabular benchmarks.
  • Memory requirement: Must store the last nn states, actions, and rewards, requiring O(n)O(n) additional memory per episode.
  • Update delay: Values for StS_t cannot be updated until time t+nt + n, delaying credit assignment. This means the first n1n - 1 transitions of each episode produce no updates.
  • Computational cost: Each update costs O(n)O(n) to compute the nn-step return, compared to O(1)O(1) for TD(0).
  • Off-policy instability: The product of nn importance sampling ratios ρt:t+n1\rho_{t:t+n-1} grows exponentially in variance with nn, making off-policy nn-step methods impractical for large nn.
  • Convergence: NN-step TD converges to VπV^\pi under the same conditions as TD(0), with the rate depending on nn, α\alpha, and the environment structure.

Common Misconceptions

  • “Larger nn is always better because it reduces bias.” While larger nn reduces bias from bootstrapping, it increases variance from reward stochasticity. The optimal nn balances these effects and is almost never n=n = \infty (full MC).
  • “N-step methods are just a theoretical construct.” N-step returns are directly used in practical algorithms. A3C uses nn-step returns (typically n=5n = 5 or n=20n = 20) as its primary update target. Many deep RL implementations use nn-step returns for improved sample efficiency.
  • “You must choose a single nn in advance.” Eligibility traces (TD(λ\lambda)) and compound returns allow effectively averaging over multiple values of nn, but even fixed nn can be tuned as a hyperparameter.

Connections to Other Concepts

  • Temporal Difference Learning – TD(0) is the special case n=1n = 1. N-step methods generalize TD by deferring bootstrapping.
  • Monte Carlo Methods – MC is the special case n=n = \infty (or n=Ttn = T - t). N-step methods recover MC when nn equals the remaining episode length.
  • Eligibility Traces – TD(λ\lambda) can be viewed as a geometric average over all nn-step returns, providing a more elegant way to trade off bias and variance.
  • Sarsa – N-step SARSA extends SARSA by using nn-step returns, often substantially improving performance.
  • Q Learning – N-step Q-learning (e.g., “tree backup” algorithm) extends Q-learning to multi-step off-policy updates.
  • A2c And A3c – A3C uses nn-step returns as its default update target in deep RL, typically with n=5n = 5.

Further Reading

  1. “Reinforcement Learning: An Introduction,” Chapter 7 (Sutton & Barto, 2018) – The definitive treatment of nn-step methods, including prediction, control, off-policy corrections, and the random walk experiments. [Scholar]
  2. “Multi-Step Reinforcement Learning: A Unifying Algorithm” (De Asis et al., 2018) – Proposes a unified nn-step algorithm encompassing Q-learning, SARSA, Expected SARSA, and Tree Backup as special cases. [Scholar]
  3. “Asynchronous Methods for Deep Reinforcement Learning” (Mnih et al., 2016) – The A3C paper that popularized nn-step returns in deep RL, demonstrating their effectiveness with neural network function approximation. [Scholar]
  4. “Safe and Efficient Off-Policy Reinforcement Learning” (Munos et al., 2016) – Introduces Retrace(λ\lambda), addressing the high variance of importance sampling in multi-step off-policy learning. [Scholar]