Investigating Two Policy Gradient Methods Under Different Time Discretizations

Homayoon Farrahi
2021
Continuous-time reinforcement learning tasks commonly use discrete time steps of fixed cycle times for actions. Choosing a small action-cycle time in such tasks allows reinforcement learning agents fast reaction and a more temporally detailed perception of the environment. The learning performance of both policy gradient and action-value methods, however, may deteriorate as the cycle time duration is reduced, which necessitates the tuning of the cycle time as a hyper-parameter. Since tuning an
more » ... dditional hyper-parameter is time-consuming, specifically for real-world robots, existing algorithms can benefit from having hyper-parameters that are robust to the choice of cycle time. In this thesis, we aim to study how changing the action-cycle time affects the performance of two prominent policy gradient algorithms PPO and SAC and investigate the efficacy of their widely-used hyper-parameter values across different cycle times. We explore how changing some of these hyperparameters based on the cycle time can help or hinder the performance of these algorithms and inquire into and understand the relationship between them. These relationships are put forward as new hyper-parameters that can be adjusted based on the cycle time, and their effectiveness is examined and validated on simulated and real-world robotic tasks. We show that the new hyper-parameters, unlike the existing ones, can be more robust to different environments and cycle times and can enable hyper-parameter values tuned to a cycle time on a specific problem to be transferred to a different cycle time. I am eternally grateful to my supervisor Prof. Rupam Mahmood for his invaluable guidance and regard to training rigorous scientists. He patiently explains the underlying reason for everything and encourages perseverance and longterm thinking, all of which I greatly treasure. I appreciate Prof. Richard Sutton and Prof. Michael Bowling for their thorough examination of this thesis. I thank the Reinforcement Learning and Artificial Intelligence (RLAI) Lab, Alberta Machine Intelligence Institute (Amii), and the Canada CIFAR AI Chairs Program for funding this research. I am thankful to Kindred Inc. for their generous donation of the UR5 robotic arm and all of the amazing people in our Robot Lair group for the discussions. Last but not least, I extend my gratitude to my mother Sara Vaziri, my father Farrokh Farrahi, and my sister Shiva Farrahi, whose unwavering love and support made my journey less demanding. v Contents 1 Introduction 1.1 Reinforcement Learning With Small Action-Cycle Times . . .
doi:10.7939/r3-sttb-hb65 fatcat:3v3pmbwqyjhvjnvmgn725hdnrm