Skip to main content

TD(lambda) learning without eligibility traces: a theoretical analysis

Buy Article:

$71.00 + tax (Refund Policy)

A common approach to learning from delayed rewards is to use temporal difference (TD) methods for predicting future reinforcement values. They are parameterized by a recency factor lambda which determines whether and how the outcomes from several consecutive time steps contribute to a single prediction update. TD(lambda> 0) has been found to usually yield noticeably faster learning than TD(0), but its standard eligibility traces implementation is associated with some well known deficiencies, in particular significantly increased computation expense. This article investigates theoretically two possible ways of implementing TD(lambda) without eligibility traces, both proposed by prior work. One is the TTD procedure, which efficiently approximates the effects of eligibility traces by the use of truncated TD(lambda) returns. The other is experience replay, which relies on replaying TD prediction updates backwards in time. We provide novel theoretical results related to the former and present an original analysis of the effects of two variations of the latter. The ultimate effect of these investigations is a unified view of the apparently different computational techniques. This contributes to the TD(lambda) research in general, by highlighting interesting relationships between several TD-based algorithms and facilitating their further analysis.

Keywords: ELIGIBILITY TRACES; REINFORCEMENT LEARNING; TEMPORAL DIFFERENCES

Document Type: Research Article

Publication date: 01 April 1999

More about this publication?
  • Access Key
  • Free content
  • Partial Free content
  • New content
  • Open access content
  • Partial Open access content
  • Subscribed content
  • Partial Subscribed content
  • Free trial content