Residual Loss Prediction: Reinforcement Learning With No Incremental Feedback

Hal Daumé III, John Langford, Amr Sharaf
2018 International Conference on Learning Representations  
We consider reinforcement learning and bandit structured prediction problems with very sparse loss feedback: only at the end of an episode. We introduce a novel algorithm, RESIDUAL LOSS PREDICTION (RESLOPE), that solves such problems by automatically learning an internal representation of a denser reward function. RESLOPE operates as a reduction to contextual bandits, using its learned loss representation to solve the credit assignment problem, and a contextual bandit oracle to trade-off
more » ... tion and exploitation. RESLOPE enjoys a no-regret reductionstyle theoretical guarantee and outperforms state of the art reinforcement learning algorithms in both MDP environments and bandit structured prediction settings. * Authors are listed alphabetically. 1 This problem can be-and to a large degree has been-mitigated through the task-specific and complex process of reward engineering and reward shaping. Indeed, we were surprised to find that many classic RL algorithms fail badly when incremental rewards disappear. We aim to make such problems disappear.
dblp:conf/iclr/Daume0S18 fatcat:gz4rvzh5szgankzwvyoqnkfenm