A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
Trust-PCL: An Off-Policy Trust Region Method for Continuous Control
[article]
2018
arXiv
pre-print
To address this problem, we propose an off-policy trust region method, Trust-PCL. ...
While current trust region strategies are effective for continuous control, they typically require a prohibitively large amount of on-policy interaction with the environment. ...
The main advantage of Trust-PCL over existing trust region methods for continuous control is its ability to learn in an off-policy manner. ...
arXiv:1707.01891v3
fatcat:juu5x7ygdfbn7mvv7xagr3lzne
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
[article]
2018
arXiv
pre-print
on-policy and off-policy methods. ...
By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior ...
Trust-PCL experiments; and George Tucker for his valuable feedback on an early version of this paper. ...
arXiv:1801.01290v2
fatcat:5737bv4lmzdzxbv6xreow6phfy
Implicitly Regularized RL with Implicit Q-Values
[article]
2022
arXiv
pre-print
We then evaluate our algorithm on classic control tasks, where its results compete with state-of-the-art methods. ...
We use the resulting parametrization to derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the Q-value. ...
Trust-PCL (Nachum et al., 2018) , builds on PCL by adding a trust region constraint on the policy update, similar to our KL regularization term. ...
arXiv:2108.07041v2
fatcat:hb5ws467onbtjp5r5m7tgse3oe
On-Policy Trust Region Policy Optimisation with Replay Buffers
[article]
2019
arXiv
pre-print
In many cases, the method not only improves the results comparing to the state-of-the-art trust region on-policy learning algorithms such as PPO, ACKTR and TRPO, but also with respect to their off-policy ...
The method uses trust region optimisation, while avoiding some of the common problems of the algorithms such as TRPO or ACKTR: it uses hyperparameters to replace the trust region selection heuristics, ...
Nachum et al. (2018) propose an off-policy trust region method, Trust-PCL, which exploits off-policy data within the trust regions optimisation framework, while maintaining stability of optimisation by ...
arXiv:1901.06212v1
fatcat:6xn7a2z5h5brjoh3aipgcr4u6e
Dimension-Wise Importance Sampling Weight Clipping for Sample-Efficient Reinforcement Learning
[article]
2019
arXiv
pre-print
This new technique enables efficient learning for high action-dimensional tasks and reusing of old samples like in off-policy learning to increase the sample efficiency. ...
large bias and adaptively controls the IS weight to bound policy update from the current policy. ...
In particular, Trust-PCL (Nachum et al., 2017) applies path consistency learning to use off-policy data while maintaining the stability of trust region policy optimization. ...
arXiv:1905.02363v2
fatcat:a3ricidqorh6pdgt6a3wbkqoue
Dealing with Non-Stationarity in MARL via Trust-Region Decomposition
[article]
2022
arXiv
pre-print
The Multi-Agent Mirror descent policy algorithm with Trust region decomposition, called MAMT, is established by adjusting the trust-region of the local policies adaptively in an end-to-end manner. ...
A straightforward but highly non-trivial way is to control the joint policies' divergence, which is difficult to estimate accurately by imposing the trust-region constraint on the joint policy. ...
Industry Internet Software Collaborative Innovation Center, and the Fundamental Research Funds for the Central Universities. ...
arXiv:2102.10616v2
fatcat:rzla4wpuk5bpdb47f3t5gagl2u
On Principled Entropy Exploration in Policy Optimization
2019
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence
Experimental evaluations demonstrate that the proposed method significantly improves practical exploration and surpasses the empirical performance of state-of-the art policy optimization methods in a set ...
In this paper, we investigate Exploratory Conservative Policy Optimization (ECPO), a policy optimization strategy that improves exploration behavior while assuring monotonic progress in a principled objective ...
For the continuous control tasks, we compare ECAC with deep deterministic policy gradient (DDPG) [Lillicrap et al., 2015] , an efficient off-policy deep RL method; twin delayed deep deterministic policy ...
doi:10.24963/ijcai.2019/434
dblp:conf/ijcai/MeiXHS019
fatcat:37xj2p5vzfg27ijo3prdyutmoq
Cautious Actor-Critic
[article]
2021
arXiv
pre-print
We compare CAC to state-of-the-art AC methods on a set of challenging continuous control problems and demonstrate that CAC achieves comparable performance while significantly stabilizes learning. ...
The oscillating performance of off-policy learning and persisting errors in the actor-critic (AC) setting call for algorithms that can conservatively learn to suit the stability-critical applications better ...
Trust-pcl: An off-
policy trust region method for continuous control. In International Conference on Learn-
ing Representations, pages 1-11, 2018. ...
arXiv:2107.05217v2
fatcat:ll3keij23bgpjjapfx4vazea5u
Trusted Approximate Policy Iteration with Bisimulation Metrics
[article]
2022
arXiv
pre-print
Then we describe an approximate policy iteration (API) procedure that uses ϵ-aggregation with π-bisimulation and prove performance bounds for continuous state spaces. ...
In addition, we propose a novel trust region approach which circumvents the requirement to explicitly solve a constrained optimization problem. ...
To test this intuition in a continuous control setting, we developed a first-order trust region method for off-policy RL. ...
arXiv:2202.02881v2
fatcat:7mt66geetrbglizvjoxplken5m
Deep Reinforcement Learning
[article]
2018
arXiv
pre-print
Next we discuss RL core elements, including value function, policy, reward, model, exploration vs. exploitation, and representation. ...
We discuss deep reinforcement learning in an overview style. We draw a big picture, filled with details. ...
The authors present an implementation with centralized training for decentralized execution, as discussed below. The authors experiment with grid world coordination, a partially observable game, ...
arXiv:1810.06339v1
fatcat:kp7atz5pdbeqta352e6b3nmuhy
Deep Reinforcement Learning for Vision-Based Robotic Grasping: A Simulated Comparative Evaluation of Off-Policy Methods
[article]
2018
arXiv
pre-print
of Monte Carlo return estimation and an off-policy correction. ...
To answer this question, we propose a simulated benchmark for robotic grasping that emphasizes off-policy learning and generalization to unseen objects. ...
ACKNOWLEDGEMENTS We thank Laura Downs, Erwin Coumans, Ethan Holly, John-Michael Burke, and Peter Pastor for helping with experiments. ...
arXiv:1802.10264v2
fatcat:apk5d3vs5ne4zd7xhzcldhzd4e
Policy Optimization as Wasserstein Gradient Flows
[article]
2018
arXiv
pre-print
Policy optimization is a core component of reinforcement learning (RL), and most existing RL methods directly optimize parameters of a policy based on maximizing the expected total reward, or its surrogate ...
We place policy optimization into the space of probability measures, and interpret it as Wasserstein gradient flows. ...
Acknowledgements We acknowledge Tuomas Haarnoja et al. for making their code public and thank Ronald Parr for insightful advice. This research was supported in part by DARPA, DOE, NIH, ONR and NSF. ...
arXiv:1808.03030v1
fatcat:i3swiw5wrvdnnk7nry6ijir4rm
Continuous-action Reinforcement Learning for Playing Racing Games: Comparing SPG to PPO
[article]
2020
arXiv
pre-print
This environment operates with continuous action- and state-spaces and requires agents to learn to control the acceleration and steering of a car while navigating a randomly generated racetrack. ...
An extension of SPG is introduced that aims to improve learning performance by weighting action samples during the policy update step. The effect of using experience replay (ER) is also investigated. ...
Additionally, the performance of SPG could be investigated when compared to other off-policy methods that allow for continuous action spaces, such as NAF (Gu et al. [2016] ), Trust-PCL (Nachum et al. ...
arXiv:2001.05270v1
fatcat:w2ajhjsoxvdbjjahiv54uty6a4
Attraction-Repulsion Actor-Critic for Continuous Control Reinforcement Learning
[article]
2020
arXiv
pre-print
Continuous control tasks in reinforcement learning are important because they provide an important framework for learning in high-dimensional state spaces with deceptive rewards, where the agent can easily ...
One way to avoid local optima is to use a population of agents to ensure coverage of the policy space, yet learning a population with the "best" coverage is still an open problem. ...
We also thank Linda Petrini and Lucas Caccia for insightful discussions. ...
arXiv:1909.07543v3
fatcat:fghcnlqdqranrddpdmgyzuv7tq
Geometric Value Iteration: Dynamic Error-Aware KL Regularization for Reinforcement Learning
[article]
2021
arXiv
pre-print
Our experiments demonstrate that GVI can effectively exploit the trade-off between learning speed and robustness over uniform averaging of a constant KL coefficient. ...
Based on the dynamic coefficient error bound, we propose an effective scheme to tune the coefficient according to the magnitude of error in favor of more robust learning. ...
Trust-PCL: An
off-policy trust region method for continuous control. In International Conference on
Learning Representations, pages 1-14, 2018.
Martin L Puterman and Moon Chirl Shin. ...
arXiv:2107.07659v2
fatcat:ko6utk5surho5l6khutywxgc4u
« Previous
Showing results 1 — 15 out of 20 results