A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence
Reinforcement Learning (RL) with sparse rewards is a major challenge. We pro- pose Hindsight Trust Region Policy Optimization (HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with hindsight to tackle the challenge of sparse rewards. Hindsight refers to the algorithm's ability to learn from information across goals, including past goals not intended for the current task. We derive the hindsight form of TRPO, together with QKL, a quadratic approximation to the KLdoi:10.24963/ijcai.2021/459 fatcat:vdp5jtfod5c3fjxehwx5ss4bbi