SWIRL: A sequential windowed inverse reinforcement learning algorithm for robot tasks with delayed rewards

Sanjay Krishnan, Animesh Garg, Richard Liaw, Brijen Thananjeyan, Lauren Miller, Florian T Pokorny, Ken Goldberg
2018 The international journal of robotics research  
Inverse Reinforcement Learning (IRL) allows a robot to generalize from demonstrations to previously unseen scenarios by learning the demonstrator's reward function. However, in multi-step tasks, the learned rewards might be delayed and hard to directly optimize. We present Sequential Windowed Inverse Reinforcement Learning (SWIRL), a three-phase algorithm that partitions a complex task into shorter-horizon subtasks based on Switched Linear Dynamical transitions that occur consistently across
more » ... onstrations. SWIRL then learns a sequence of local reward functions that describe the motion between transitions. Once these reward functions are learned, SWIRL applies Q-learning to compute a policy that maximizes the rewards. We compare SWIRL (demonstrations to segments to rewards) with Supervised Policy Learning (SPL -demonstrations to policies) and Maximum Entropy IRL (MaxEnt-IRL demonstrations to rewards) on standard Reinforcement Learning benchmarks: Parallel Parking with noisy dynamics, Two-Link acrobot, and a 2D GridWorld. We find that SWIRL converges to a policy with similar success rates (60%) in 3x fewer time-steps than MaxEnt-IRL, and requires 5x fewer demonstrations than SPL. In physical experiments using the da Vinci surgical robot, we evaluate the extent to which SWIRL generalizes from linear cutting demonstrations to cutting sequences of curved paths. Sequential Windowed Inverse Reinforcement Learning This section describes an algorithm to infer the parameters for the proposed model. Algorithm Description Let D be a set of demonstration trajectories {d 1 , ..., d N } of a task with a delayed reward. SWIRL can be described in terms of three sub-algorithms: Inputs: Demonstrations D, Dynamics (Optional) P
doi:10.1177/0278364918784350 fatcat:ze2skzbkfbek5fntflror45ibq