148 Hits in 5.2 sec

Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO [article]

Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, Aleksander Madry
2020 arXiv   pre-print
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms: Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO  ...  Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.  ...  ACKNOWLEDGEMENTS We would like to thank Chloe Hsu for identifying a bug in our initial implementation of PPO and TPRO.  ... 
arXiv:2005.12729v1 fatcat:4mnogm5b7zfstexlh4s5e5rgde

On Proximal Policy Optimization's Heavy-tailed Gradients [article]

Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, J. Zico Kolter, Zachary C. Lipton, Sivaraman Balakrishnan, Ruslan Salakhutdinov, Pradeep Ravikumar
2021 arXiv   pre-print
Modern policy gradient algorithms such as Proximal Policy Optimization (PPO) rely on an arsenal of heuristics, including loss clipping and gradient clipping, to ensure successful learning.  ...  In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function.  ...  Acknowledgements We acknowledge the support of Lockheed Martin, DARPA via HR00112020006, and NSF via IIS-1909816, OAC-1934584.  ... 
arXiv:2102.10264v2 fatcat:cfqvu3kcf5dmrfqytbe35dvi24

Evolved Policy Gradients [article]

Rein Houthooft, Richard Y. Chen, Phillip Isola, Bradly C. Stadie, Filip Wolski, Jonathan Ho, Pieter Abbeel
2018 arXiv   pre-print
Empirical results show that our evolved policy gradient algorithm (EPG) achieves faster learning on several randomized environments compared to an off-the-shelf policy gradient method.  ...  We propose a metalearning approach for learning gradient-based reinforcement learning (RL) algorithms.  ...  Acknowledgments We thank Igor Mordatch, Ilya Sutskever, John Schulman, and Karthik Narasimhan for helpful comments and conversations.  ... 
arXiv:1802.04821v2 fatcat:cxu3brnxzjbetkudpjkwnnneey

Policy Gradient in Partially Observable Environments: Approximation and Convergence [article]

Kamyar Azizzadenesheli, Yisong Yue, Animashree Anandkumar
2020 arXiv   pre-print
Policy gradient is a generic and flexible reinforcement learning approach that generally enjoys simplicity in analysis, implementation, and deployment.  ...  In this paper, we generalize a variety of these advances to partially observable settings, and similar to the fully observable case, we keep our focus on the class of Markovian policies.  ...  Anandkumar is supported in part by Bren endowed chair, DARPA PAIHR00111890035 and LwLL grants, Raytheon, Microsoft, Google, and Adobe faculty fellowships.  ... 
arXiv:1810.07900v3 fatcat:6dstjbjajnf2tozszot33twzqy

Deep Reinforcement Learning that Matters [article]

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, David Meger
2019 arXiv   pre-print
We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible.  ...  In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (RL).  ...  Acknowledgements We thank NSERC, CIFAR, the Open Philanthropy Project, and the AWS Cloud Credits for Research Program.  ... 
arXiv:1709.06560v3 fatcat:4x7p4hrdvjbgxlftameeyz6od4

A general class of surrogate functions for stable and efficient reinforcement learning [article]

Sharan Vaswani, Olivier Bachem, Simone Totaro, Robert Mueller, Shivam Garg, Matthieu Geist, Marlos C. Machado, Pablo Samuel Castro, Nicolas Le Roux
2022 arXiv   pre-print
Common policy gradient methods rely on the maximization of a sequence of surrogate functions.  ...  Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties  ...  Acknowledgements We would like to thank Veronica Chelu for suggesting the use of the log-sum-exp mirror map in Section 5. Nicolas Le Roux and Marlos C. Machado are funded by a CIFAR chair.  ... 
arXiv:2108.05828v4 fatcat:gw6ttqp6k5dkrimmiv6nlpimay

Deep Reinforcement Learning That Matters

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, David Meger
We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible.  ...  In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (RL).  ...  Acknowledgements We thank NSERC, CIFAR, the Open Philanthropy Project, and the AWS Cloud Credits for Research Program for their generous contributions.  ... 
doi:10.1609/aaai.v32i1.11694 fatcat:2smrisva5jb3hg5tb6rs73hazm

Regularization Matters in Policy Optimization [article]

Zhuang Liu, Xuanlin Li, Bingyi Kang, Trevor Darrell
2021 arXiv   pre-print
in the same environment, and because the deep RL community focuses more on high-level algorithm designs.  ...  Deep Reinforcement Learning (Deep RL) has been receiving increasingly more attention thanks to its encouraging performance on a variety of control tasks.  ...  For BN and dropout, we also note that almost all hurting cases are in on-policy algorithms, except one case for BN in SAC.  ... 
arXiv:1910.09191v5 fatcat:ifdtx2wqbffubkxg3n352vnkjy

Marginal Policy Gradients: A Unified Family of Estimators for Bounded Action Spaces with Applications [article]

Carson Eisenach, Haichuan Yang, Ji Liu, Han Liu
2019 arXiv   pre-print
Experimental results on a popular RTS game and a navigation task show that the APG estimator offers a substantial improvement over the standard policy gradient.  ...  In the former, an agent learns a policy over R^d and in the latter, over a discrete set of actions each of which is parametrized by a continuous parameter.  ...  In addition, stochastic gradients of policy loss functions for TRPO or PPO Schulman et al. (2015; can be computed in a similar way since we can easily get the derivative of f π (θ) when M d−1 (α) and  ... 
arXiv:1806.05134v3 fatcat:luvdhxyp3vaxzmm6w57bptuiv4

Expected Policy Gradients for Reinforcement Learning [article]

Kamil Ciosek, Shimon Whiteson
2020 arXiv   pre-print
We also establish a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases.  ...  For discrete action spaces, we derive a variant of EPG based on softmax policies.  ...  We also thank Paavo Parmas and Rika Antonova for helpful feedback as well as Kaiqing Zhang and Zac Chen for pointing out typos in the original draft.  ... 
arXiv:1801.03326v2 fatcat:667mciiqpzgolbscpnflupy5wm

A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning [article]

Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, Phuong Ha Nguyen, Marten van Dijk, Quoc Tran-Dinh
2020 arXiv   pre-print
𝒪(ε^-4) and SVRPG 𝒪(ε^-10/3) in the non-composite setting.  ...  We propose a novel hybrid stochastic policy gradient estimator by combining an unbiased policy gradient estimator, the REINFORCE estimator, with another biased one, an adapted SARAH estimator for policy  ...  PPO (Schulman et al., 2017) is an extension of TRPO which uses a clipped surrogate objective resulting a simpler implementation.  ... 
arXiv:2003.00430v2 fatcat:u4xuiamjgrgf5ahxxdl7qzxqwm

Policy Gradient based Quantum Approximate Optimization Algorithm [article]

Jiahao Yao, Marin Bukov, Lin Lin
2020 arXiv   pre-print
Taking such constraints into account, we show that policy-gradient-based reinforcement learning (RL) algorithms are well suited for optimizing the variational parameters of QAOA in a noise-robust fashion  ...  experiments, and (iii) the values of the objective function may be sensitive to various sources of uncertainty, as is the case for noisy intermediate-scale quantum (NISQ) devices.  ...  Finally, this work only considers implementations on a classical computer.  ... 
arXiv:2002.01068v2 fatcat:uvkwnhqafbelfbbbkrofw7fgru

Deep Reinforcement Learning: A State-of-the-Art Walkthrough

Aristotelis Lazaridis, Anestis Fachantidis, Ioannis Vlahavas
2020 The Journal of Artificial Intelligence Research  
Deep Reinforcement Learning is a topic that has gained a lot of attention recently, due to the unprecedented achievements and remarkable performance of such algorithms in various benchmark tests and environmental  ...  In this work we gather the essential methods related to Deep Reinforcement Learning, extracting common property structures for three complementary core categories: a) Model-Free, b) Model-Based and c)  ...  Proximal Policy Optimization (PPO) ) is a subcategory of Policy Gradient methods that maximizes gradient step size, without letting it become regrettably big, similarly to TRPO.  ... 
doi:10.1613/jair.1.12412 fatcat:mlytimfmuffz7he4uwjwsce2dy

TD-regularized actor-critic methods

Simone Parisi, Voot Tangkaratt, Jan Peters, Mohammad Emtiyaz Khan
2019 Machine Learning  
The resulting method, which we call the TD-regularized actor-critic method, is a simple plug-and-play approach to improve stability and overall performance of the actor-critic methods.  ...  This is partly due to the interaction between the actor and critic during learning, e.g., an inaccurate step taken by one of them might adversely affect the other and destabilize the learning.  ...  For example, deep deterministic policy gradient (DDPG) requires implementation tricks such as target networks, and it is known to be highly sensitive to its hyperparameters (Henderson et al. 2017) .  ... 
doi:10.1007/s10994-019-05788-0 fatcat:osifv5utpnft5kjlmh2xfnxktu

Investigating Generalisation in Continuous Deep Reinforcement Learning [article]

Chenyang Zhao, Olivier Sigaud, Freek Stulp, Timothy M. Hospedales
2019 arXiv   pre-print
In particular, common practice in the field is to train policies on largely deterministic simulators and to evaluate algorithms through training performance alone, without a train/test distinction to ensure  ...  In this paper we study these issues by first characterising the sources of uncertainty that provide generalisation challenges in Deep RL.  ...  Training Algorithms and Architectures We study several model-free policy gradient based Deep RL algorithms with OpenAI baseline implementations including Trust Region Policy Optimisation (TRPO) (Schulman  ... 
arXiv:1902.07015v2 fatcat:yev57e2ji5dprcutzju7nkxqfm
« Previous Showing results 1 — 15 out of 148 results