20 Hits in 3.9 sec

Trust-PCL: An Off-Policy Trust Region Method for Continuous Control [article]

Ofir Nachum, Mohammad Norouzi, Kelvin Xu, Dale Schuurmans
2018 arXiv   pre-print
To address this problem, we propose an off-policy trust region method, Trust-PCL.  ...  While current trust region strategies are effective for continuous control, they typically require a prohibitively large amount of on-policy interaction with the environment.  ...  The main advantage of Trust-PCL over existing trust region methods for continuous control is its ability to learn in an off-policy manner.  ... 
arXiv:1707.01891v3 fatcat:juu5x7ygdfbn7mvv7xagr3lzne

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor [article]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine
2018 arXiv   pre-print
on-policy and off-policy methods.  ...  By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior  ...  Trust-PCL experiments; and George Tucker for his valuable feedback on an early version of this paper.  ... 
arXiv:1801.01290v2 fatcat:5737bv4lmzdzxbv6xreow6phfy

Implicitly Regularized RL with Implicit Q-Values [article]

Nino Vieillard, Marcin Andrychowicz, Anton Raichuk, Olivier Pietquin, Matthieu Geist
2022 arXiv   pre-print
We then evaluate our algorithm on classic control tasks, where its results compete with state-of-the-art methods.  ...  We use the resulting parametrization to derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the Q-value.  ...  Trust-PCL (Nachum et al., 2018) , builds on PCL by adding a trust region constraint on the policy update, similar to our KL regularization term.  ... 
arXiv:2108.07041v2 fatcat:hb5ws467onbtjp5r5m7tgse3oe

On-Policy Trust Region Policy Optimisation with Replay Buffers [article]

Dmitry Kangin, Nicolas Pugeault
2019 arXiv   pre-print
In many cases, the method not only improves the results comparing to the state-of-the-art trust region on-policy learning algorithms such as PPO, ACKTR and TRPO, but also with respect to their off-policy  ...  The method uses trust region optimisation, while avoiding some of the common problems of the algorithms such as TRPO or ACKTR: it uses hyperparameters to replace the trust region selection heuristics,  ...  Nachum et al. (2018) propose an off-policy trust region method, Trust-PCL, which exploits off-policy data within the trust regions optimisation framework, while maintaining stability of optimisation by  ... 
arXiv:1901.06212v1 fatcat:6xn7a2z5h5brjoh3aipgcr4u6e

Dimension-Wise Importance Sampling Weight Clipping for Sample-Efficient Reinforcement Learning [article]

Seungyul Han, Youngchul Sung
2019 arXiv   pre-print
This new technique enables efficient learning for high action-dimensional tasks and reusing of old samples like in off-policy learning to increase the sample efficiency.  ...  large bias and adaptively controls the IS weight to bound policy update from the current policy.  ...  In particular, Trust-PCL (Nachum et al., 2017) applies path consistency learning to use off-policy data while maintaining the stability of trust region policy optimization.  ... 
arXiv:1905.02363v2 fatcat:a3ricidqorh6pdgt6a3wbkqoue

Dealing with Non-Stationarity in MARL via Trust-Region Decomposition [article]

Wenhao Li, Xiangfeng Wang, Bo Jin, Junjie Sheng, Hongyuan Zha
2022 arXiv   pre-print
The Multi-Agent Mirror descent policy algorithm with Trust region decomposition, called MAMT, is established by adjusting the trust-region of the local policies adaptively in an end-to-end manner.  ...  A straightforward but highly non-trivial way is to control the joint policies' divergence, which is difficult to estimate accurately by imposing the trust-region constraint on the joint policy.  ...  Industry Internet Software Collaborative Innovation Center, and the Fundamental Research Funds for the Central Universities.  ... 
arXiv:2102.10616v2 fatcat:rzla4wpuk5bpdb47f3t5gagl2u

On Principled Entropy Exploration in Policy Optimization

Jincheng Mei, Chenjun Xiao, Ruitong Huang, Dale Schuurmans, Martin Müller
2019 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence  
Experimental evaluations demonstrate that the proposed method significantly improves practical exploration and surpasses the empirical performance of state-of-the art policy optimization methods in a set  ...  In this paper, we investigate Exploratory Conservative Policy Optimization (ECPO), a policy optimization strategy that improves exploration behavior while assuring monotonic progress in a principled objective  ...  For the continuous control tasks, we compare ECAC with deep deterministic policy gradient (DDPG) [Lillicrap et al., 2015] , an efficient off-policy deep RL method; twin delayed deep deterministic policy  ... 
doi:10.24963/ijcai.2019/434 dblp:conf/ijcai/MeiXHS019 fatcat:37xj2p5vzfg27ijo3prdyutmoq

Cautious Actor-Critic [article]

Lingwei Zhu, Toshinori Kitamura, Takamitsu Matsubara
2021 arXiv   pre-print
We compare CAC to state-of-the-art AC methods on a set of challenging continuous control problems and demonstrate that CAC achieves comparable performance while significantly stabilizes learning.  ...  The oscillating performance of off-policy learning and persisting errors in the actor-critic (AC) setting call for algorithms that can conservatively learn to suit the stability-critical applications better  ...  Trust-pcl: An off- policy trust region method for continuous control. In International Conference on Learn- ing Representations, pages 1-11, 2018.  ... 
arXiv:2107.05217v2 fatcat:ll3keij23bgpjjapfx4vazea5u

Trusted Approximate Policy Iteration with Bisimulation Metrics [article]

Mete Kemertas, Allan Jepson
2022 arXiv   pre-print
Then we describe an approximate policy iteration (API) procedure that uses ϵ-aggregation with π-bisimulation and prove performance bounds for continuous state spaces.  ...  In addition, we propose a novel trust region approach which circumvents the requirement to explicitly solve a constrained optimization problem.  ...  To test this intuition in a continuous control setting, we developed a first-order trust region method for off-policy RL.  ... 
arXiv:2202.02881v2 fatcat:7mt66geetrbglizvjoxplken5m

Deep Reinforcement Learning [article]

Yuxi Li
2018 arXiv   pre-print
Next we discuss RL core elements, including value function, policy, reward, model, exploration vs. exploitation, and representation.  ...  We discuss deep reinforcement learning in an overview style. We draw a big picture, filled with details.  ...  The authors present an implementation with centralized training for decentralized execution, as discussed below. The authors experiment with grid world coordination, a partially observable game,  ... 
arXiv:1810.06339v1 fatcat:kp7atz5pdbeqta352e6b3nmuhy

Deep Reinforcement Learning for Vision-Based Robotic Grasping: A Simulated Comparative Evaluation of Off-Policy Methods [article]

Deirdre Quillen, Eric Jang, Ofir Nachum, Chelsea Finn, Julian Ibarz, Sergey Levine
2018 arXiv   pre-print
of Monte Carlo return estimation and an off-policy correction.  ...  To answer this question, we propose a simulated benchmark for robotic grasping that emphasizes off-policy learning and generalization to unseen objects.  ...  ACKNOWLEDGEMENTS We thank Laura Downs, Erwin Coumans, Ethan Holly, John-Michael Burke, and Peter Pastor for helping with experiments.  ... 
arXiv:1802.10264v2 fatcat:apk5d3vs5ne4zd7xhzcldhzd4e

Policy Optimization as Wasserstein Gradient Flows [article]

Ruiyi Zhang, Changyou Chen, Chunyuan Li, Lawrence Carin
2018 arXiv   pre-print
Policy optimization is a core component of reinforcement learning (RL), and most existing RL methods directly optimize parameters of a policy based on maximizing the expected total reward, or its surrogate  ...  We place policy optimization into the space of probability measures, and interpret it as Wasserstein gradient flows.  ...  Acknowledgements We acknowledge Tuomas Haarnoja et al. for making their code public and thank Ronald Parr for insightful advice. This research was supported in part by DARPA, DOE, NIH, ONR and NSF.  ... 
arXiv:1808.03030v1 fatcat:i3swiw5wrvdnnk7nry6ijir4rm

Continuous-action Reinforcement Learning for Playing Racing Games: Comparing SPG to PPO [article]

Mario S. Holubar, Marco A. Wiering
2020 arXiv   pre-print
This environment operates with continuous action- and state-spaces and requires agents to learn to control the acceleration and steering of a car while navigating a randomly generated racetrack.  ...  An extension of SPG is introduced that aims to improve learning performance by weighting action samples during the policy update step. The effect of using experience replay (ER) is also investigated.  ...  Additionally, the performance of SPG could be investigated when compared to other off-policy methods that allow for continuous action spaces, such as NAF (Gu et al. [2016] ), Trust-PCL (Nachum et al.  ... 
arXiv:2001.05270v1 fatcat:w2ajhjsoxvdbjjahiv54uty6a4

Attraction-Repulsion Actor-Critic for Continuous Control Reinforcement Learning [article]

Thang Doan, Bogdan Mazoure, Moloud Abdar, Audrey Durand, Joelle Pineau, R Devon Hjelm
2020 arXiv   pre-print
Continuous control tasks in reinforcement learning are important because they provide an important framework for learning in high-dimensional state spaces with deceptive rewards, where the agent can easily  ...  One way to avoid local optima is to use a population of agents to ensure coverage of the policy space, yet learning a population with the "best" coverage is still an open problem.  ...  We also thank Linda Petrini and Lucas Caccia for insightful discussions.  ... 
arXiv:1909.07543v3 fatcat:fghcnlqdqranrddpdmgyzuv7tq

Geometric Value Iteration: Dynamic Error-Aware KL Regularization for Reinforcement Learning [article]

Toshinori Kitamura, Lingwei Zhu, Takamitsu Matsubara
2021 arXiv   pre-print
Our experiments demonstrate that GVI can effectively exploit the trade-off between learning speed and robustness over uniform averaging of a constant KL coefficient.  ...  Based on the dynamic coefficient error bound, we propose an effective scheme to tune the coefficient according to the magnitude of error in favor of more robust learning.  ...  Trust-PCL: An off-policy trust region method for continuous control. In International Conference on Learning Representations, pages 1-14, 2018. Martin L Puterman and Moon Chirl Shin.  ... 
arXiv:2107.07659v2 fatcat:ko6utk5surho5l6khutywxgc4u
« Previous Showing results 1 — 15 out of 20 results