A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Reducing sampling error in batch temporal difference learning
[article]
2021
Temporal difference (TD) learning is one of the main foundations of modern reinforcement learning. This thesis studies the use of TD(0), a canonical TD algorithm, to estimate the value function of a given evaluation policy from a batch of data. In this batch setting, we show that TD(0) may converge to an inaccurate value function because the update following an action is weighted according to the number of times that action occurred in the batch -- not the true probability of the action under
doi:10.26153/tsw/14853
fatcat:r37puwzrvfc7pexbddotlzhh4q