Filters








137,057 Hits in 2.9 sec

Trust Region Policy Optimization [article]

John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel
2017 arXiv   pre-print
By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO).  ...  We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement.  ...  Discussion We proposed and analyzed trust region methods for optimizing stochastic control policies.  ... 
arXiv:1502.05477v5 fatcat:joxhue64ejedjmsxuobgmxx4xi

Hindsight Trust Region Policy Optimization [article]

Hanbo Zhang, Site Bai, Xuguang Lan, David Hsu, Nanning Zheng
2021 arXiv   pre-print
We propose Hindsight Trust Region Policy Optimization(HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with hindsight to tackle the challenge of sparse rewards.  ...  It introduces QKL, a quadratic approximation to the KL divergence constraint on the trust region, leading to reduced variance in KL divergence estimation and improved stability in policy update.  ...  To address these challenges, we propose Hindsight Trust Region Policy Optimization (HTRPO) 1 , a hindsight form of policy optimization problem based on Trust Region Policy Optimization(TRPO) (Schulman  ... 
arXiv:1907.12439v5 fatcat:px2ffrbvjrafhe6obr2mkudgym

Boosting Trust Region Policy Optimization by Normalizing Flows Policy [article]

Yunhao Tang, Shipra Agrawal
2019 arXiv   pre-print
We propose to improve trust region policy search with normalizing flows policy.  ...  We illustrate that when the trust region is constructed by KL divergence constraints, normalizing flows policy generates samples far from the 'center' of the previous policy iterate, which potentially  ...  Trust Region Policy Optimization Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) applies information theoretic constraints instead of Euclidean constraints (as in (1) ) between θ new  ... 
arXiv:1809.10326v3 fatcat:ggr6jvew75e4bcmmfxbmfushhm

Multi-Agent Trust Region Policy Optimization [article]

Hepeng Li, Haibo He
2020 arXiv   pre-print
We extend trust region policy optimization (TRPO) to multi-agent reinforcement learning (MARL) problems.  ...  This algorithm can optimize distributed policies based on local observations and private rewards.  ...  MULTI-AGENT TRUST REGION POLICY OPTIMIZATION A. Trust Region Optimization for Multiple Agents Consider the policy optimization problem (1) for the multiagent case.  ... 
arXiv:2010.07916v2 fatcat:o376netptfdbzbv5bckqwwjo3q

Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy [article]

Boyi Liu, Qi Cai, Zhuoran Yang, Zhaoran Wang
2019 arXiv   pre-print
Proximal policy optimization and trust region policy optimization (PPO and TRPO) with actor and critic parametrized by neural networks achieve significant empirical success in deep reinforcement learning  ...  In this paper, we prove that a variant of PPO and TRPO equipped with overparametrized neural networks converges to the globally optimal policy at a sublinear rate.  ...  Coupled with neural networks, proximal policy optimization (PPO) (Schulman et al., 2017) and trust region policy optimization (TRPO) (Schulman * equal contribution † Northwestern University; boyiliu2018  ... 
arXiv:1906.10306v2 fatcat:sisvdcrugzbttnsdbfhousisla

Model-Ensemble Trust-Region Policy Optimization [article]

Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, Pieter Abbeel
2018 arXiv   pre-print
Altogether, our approach Model-Ensemble Trust-Region Policy Optimization (ME-TRPO) significantly reduces the sample complexity compared to model-free deep RL methods on challenging continuous control benchmark  ...  In this paper, we analyze the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and show that the learned policy  ...  Second, we use Trust Region Policy Optimization (TRPO) to optimize the policy over the model ensemble.  ... 
arXiv:1802.10592v2 fatcat:2p2vevibdraf3ehcpkqjgnunni

Quasi-Newton Trust Region Policy Optimization [article]

Devesh Jha, Arvind Raghunathan, Diego Romeres
2019 arXiv   pre-print
We propose a trust region method for policy optimization that employs Quasi-Newton approximation for the Hessian, called Quasi-Newton Trust Region Policy Optimization QNTRPO.  ...  We investigate the use of a trust region method using dogleg step and a Quasi-Newton approximation for the Hessian for policy optimization.  ...  Quasi-Newton Trust Region Policy Optimization (QNTRPO) QNTRPO is the trust region algorithm that we propose in this paper for policy optimization, The algorithm differs from TRPO in the step that is computed  ... 
arXiv:1912.11912v1 fatcat:6b4oqzrqwzc23gmbzvp67s5jqe

Trust Region-Guided Proximal Policy Optimization [article]

Yuhui Wang, Hao He, Xiaoyang Tan, Yaozhong Gan
2019 arXiv   pre-print
To address these issues, we proposed a novel policy optimization method, named Trust Region-Guided PPO (TRGPPO), which adaptively adjusts the clipping range within the trust region.  ...  Proximal policy optimization (PPO) is one of the most popular deep reinforcement learning (RL) methods, achieving state-of-the-art performance across a wide range of challenging tasks.  ...  Trust region policy optimization (TRPO) [16] and proximal policy optimization (PPO) [18] are two representative methods to address this issue.  ... 
arXiv:1901.10314v2 fatcat:6wwgspeawjautar7mkgycwwhce

A Stochastic Trust-Region Framework for Policy Optimization [article]

Mingming Zhao, Yongfeng Li, Zaiwen Wen
2019 arXiv   pre-print
In this paper, we study a few challenging theoretical and numerical issues on the well known trust region policy optimization for deep reinforcement learning.  ...  The trust region subproblem is constructed with a surrogate function coherent to the total expected reward and a general distance constraint around the latest policy.  ...  A Trust Region Method for Policy Optimization.  ... 
arXiv:1911.11640v1 fatcat:esrh5rskdfai7llvauk2aumomm

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs [article]

Lior Shani and Yonathan Efroni and Shie Mannor
2019 arXiv   pre-print
Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies  ...  We show that the adaptive scaling mechanism used in TRPO is in fact the natural "RL version" of traditional trust-region methods from convex analysis.  ...  Acknowledgments We would like to thank Amir Beck for illuminating discussions regarding Convex Optimization and Nadav Merlis for helpful comments.  ... 
arXiv:1909.02769v2 fatcat:r2wzzvszevcsffpbcuxggtujgi

Uncertainty-Aware Policy Optimization: A Robust, Adaptive Trust Region Approach [article]

James Queeney, Ioannis Ch. Paschalidis, Christos G. Cassandras
2020 arXiv   pre-print
The resulting algorithm, Uncertainty-Aware Trust Region Policy Optimization, generates robust policy updates that adapt to the level of uncertainty present throughout the learning process.  ...  We leverage these techniques to propose a deep policy optimization approach designed to produce stable performance even when data is scarce.  ...  Trust Region Policy Optimization (TRPO) (Schulman et al. 2015) is one of the most popular methods that has been developed to address these issues, utilizing a trust region in policy space to generate  ... 
arXiv:2012.10791v1 fatcat:euqqdcoi4rea3i52ww7p3poyte

Policy Gradient in Partially Observable Environments: Approximation and Convergence [article]

Kamyar Azizzadenesheli, Yisong Yue, Animashree Anandkumar
2020 arXiv   pre-print
Policy gradient is a generic and flexible reinforcement learning approach that generally enjoys simplicity in analysis, implementation, and deployment.  ...  This study also sheds light on the understanding of policy gradient approaches in real-world applications which tend to be partially observable.  ...  Generalized Trust Region Policy Optimization We propose Generalized Trust Region Policy Optimization (GTRPO), a generalization of MDPbased trust region methods in Kakade and Langford [2002] , Schulman  ... 
arXiv:1810.07900v3 fatcat:6dstjbjajnf2tozszot33twzqy

Faded-Experience Trust Region Policy Optimization for Model-Free Power Allocation in Interference Channel [article]

Mohammad G. Khoshkholgh, Halim Yanikomeroglu
2020 arXiv   pre-print
We apply our method to the trust-region policy optimization (TRPO), primarily developed for locomotion tasks, and propose faded-experience (FE) TRPO.  ...  Policy gradient reinforcement learning techniques enable an agent to directly learn an optimal action policy through the interactions with the environment.  ...  Trust Region Policy Optimization (TRPO) To improve the stability, besides learning the policy, the value function needs to be learned [19] .  ... 
arXiv:2008.01705v1 fatcat:rimi5rekfffypjomdufo5hmfxy

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

Lior Shani, Yonathan Efroni, Shie Mannor
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies  ...  We show that the adaptive scaling mechanism used in TRPO is in fact the natural "RL version" of traditional trust-region methods from convex analysis.  ...  Acknowledgments We would like to thank Amir Beck for illuminating discussions regarding Convex Optimization and Nadav Merlis for helpful comments.  ... 
doi:10.1609/aaai.v34i04.6021 fatcat:4b6bmrimubejze4a6abanoz2f4

Hindsight Trust Region Policy Optimization

Hanbo Zhang, Site Bai, Xuguang Lan, David Hsu, Nanning Zheng
2021 Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence   unpublished
We pro- pose Hindsight Trust Region Policy Optimization (HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with hindsight to tackle the challenge of sparse rewards.  ...  We derive the hindsight form of TRPO, together with QKL, a quadratic approximation to the KL divergence constraint on the trust region.  ...  The optimization problem proposed in TRPO can be formalized as follows: Trust Region Policy Optimization max θ E s,a∼ρθ(s,a) π θ (a|s) πθ(a|s) Aθ(s, a) (1) s.t.  ... 
doi:10.24963/ijcai.2021/459 fatcat:vdp5jtfod5c3fjxehwx5ss4bbi
« Previous Showing results 1 — 15 out of 137,057 results