4,689 Hits in 5.6 sec

Variance Penalized On-Policy and Off-Policy Actor-Critic [article]

Arushi Jain, Gandharv Patil, Ayush Jain, Khimya Khetarpal, Doina Precup
2021 arXiv   pre-print
In this paper, we propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return.  ...  Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.  ...  Acknowledgments The authors would like to thank Pierre-Luc Bacon, Emmanuel Bengio, Romain Laroche and anonymous AAAI reviewers for the valuable feedback on this paper draft.  ... 
arXiv:2102.01985v1 fatcat:vhhxpyhsmrfzxj3qez4ar3qicm

BRAC+: Improved Behavior Regularized Actor Critic for Offline Reinforcement Learning [article]

Chi Zhang, Sanmukh Rao Kuppannagari, Viktor K Prasanna
2021 arXiv   pre-print
We propose an analytical upper bound on the KL divergence as the behavior regularizer to reduce variance associated with sample based estimations.  ...  Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions and are hence unsuitable for Offline RL.  ...  National Science Foundation (NSF) under award number 2009057 and U.S. Army Research Office (ARO) under award number W911NF1910362.  ... 
arXiv:2110.00894v1 fatcat:v7lztgrn4vhe3fnb45cngyh3fm

Deep Reinforcement Learning with Robust and Smooth Policy [article]

Qianli Shen, Yan Li, Haoming Jiang, Zhaoran Wang, Tuo Zhao
2020 arXiv   pre-print
We apply the proposed framework to both on-policy (TRPO) and off-policy algorithm (DDPG).  ...  Such regularization effectively constrains the search space, and enforces smoothness in the learned policy.  ...  ., 2015) , which is an on-policy method, and the regularizer directly penalizes non-smoothness of the policy.  ... 
arXiv:2003.09534v4 fatcat:b5qinalozjhdpg6i6skou57rdq

TD-regularized actor-critic methods

Simone Parisi, Voot Tangkaratt, Jan Peters, Mohammad Emtiyaz Khan
2019 Machine Learning  
This is partly due to the interaction between the actor and critic during learning, e.g., an inaccurate step taken by one of them might adversely affect the other and destabilize the learning.  ...  The resulting method, which we call the TD-regularized actor-critic method, is a simple plug-and-play approach to improve stability and overall performance of the actor-critic methods.  ...  Only one episode is collected to update the critic and the policy.  ... 
doi:10.1007/s10994-019-05788-0 fatcat:osifv5utpnft5kjlmh2xfnxktu

Contrasting Centralized and Decentralized Critics in Multi-Agent Reinforcement Learning [article]

Xueguang Lyu, Yuchen Xiao, Brett Daley, Christopher Amato
2021 arXiv   pre-print
In particular, actor-critic methods with a centralized critic and decentralized actors are a common instance of this idea.  ...  We show that there exist misconceptions regarding centralized critics in the current literature and show that the centralized critic design is not strictly beneficial, but rather both centralized and decentralized  ...  We also thank Andrea Baisero and Linfeng Zhao for helpful comments and discussions. This research is supported in part by the U. S.  ... 
arXiv:2102.04402v2 fatcat:3zil665fizeodfhemmfqq3sbni

Zeroth-Order Actor-Critic [article]

Yuheng Lei, Jianyu Chen, Shengbo Eben Li, Sifa Zheng
2022 arXiv   pre-print
We propose Zeroth-Order Actor-Critic algorithm (ZOAC) that unifies these two methods into an on-policy actor-critic architecture to preserve the advantages from both.  ...  We evaluate our proposed method on a range of challenging continuous control benchmarks using different types of policies, where ZOAC outperforms zeroth-order and first-order baseline algorithms.  ...  Conclusion In this paper, we propose Zeroth-Order Actor-Critic algorithm (ZOAC) that unifies evolution based zeroth-order and policy gradient based first-order methods into an on-policy actor-critic architecture  ... 
arXiv:2201.12518v2 fatcat:spzqzjak7fhrlfbujjzxkh6as4

Risk-Averse Offline Reinforcement Learning [article]

Núria Armengol Urpí, Sebastian Curi, Andreas Krause
2021 arXiv   pre-print
In particular, we present the Offline Risk-Averse Actor-Critic (O-RAAC), a model-free RL algorithm that is able to learn risk-averse policies in a fully offline setting.  ...  While previous work considers optimizing the average performance using offline data, we focus on optimizing a risk-averse criteria, namely the CVaR.  ...  ACKNOWLEDGMENTS AND DISCLOSURE OF FUNDING This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation program grant agreement  ... 
arXiv:2102.05371v1 fatcat:noafjzmpjbcvplrs32njvp7upm

Exploiting the Sign of the Advantage Function to Learn Deterministic Policies in Continuous Domains

Matthieu Zimmer, Paul Weng
2019 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence  
Fitted Actor Critic (NFAC).  ...  In the context of learning deterministic policies in continuous domains, we revisit an approach, which was first proposed in Continuous Actor Critic Learning Automaton (CACLA) and later extended in Neural  ...  Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence  ... 
doi:10.24963/ijcai.2019/625 dblp:conf/ijcai/ZimmerW19 fatcat:ifbl7s4775erfolxcoxbbjfcbi

Learning Value Functions in Deep Policy Gradients using Residual Variance [article]

Yannis Flet-Berliac, Reda Ouhamma, Odalric-Ambrym Maillard, Philippe Preux
2021 arXiv   pre-print
Our work builds on recent studies indicating that traditional actor-critic algorithms do not succeed in fitting the true value function, calling for the need to identify a better objective for the critic  ...  the absolute value as in conventional actor-critic.  ...  In summary, this paper: (a) introduces Actor with Variance Estimated Critic (AVEC), an actor-critic method providing a new training objective for the critic based on the residual variance, (b) provides  ... 
arXiv:2010.04440v3 fatcat:z2d6bty4rvahvmuzvhjh2quo6y

Learning Cooperative Multi-Agent Policies with Partial Reward Decoupling

Benjamin Freed, Aditya Kapoor, Ian Abraham, Jeff Schneider, Howie Choset
2021 IEEE Robotics and Automation Letters  
We empirically demonstrate that decomposing the RL problem using PRD in an actor-critic algorithm results in lower variance policy gradient estimates, which improves data efficiency, learning stability  ...  , and asymptotic performance across a wide array of multi-agent RL tasks, compared to various other actor-critic approaches.  ...  In model-free policy gradient-style algorithms (such as Actor Critic [6] , Proximal Policy Optimization [7] , Trust Region Policy Optimization [8] , and Soft Actor-Critic [9] ), we argue that the credit  ... 
doi:10.1109/lra.2021.3135930 fatcat:jcvm5imgifcjphqikov74t7xu4

OffCon^3: What is state of the art anyway? [article]

Philip J. Ball, Stephen J. Roberts
2021 arXiv   pre-print
In reality, both approaches are remarkably similar, and belong to a family of approaches we call 'Off-Policy Continuous Generalized Policy Iteration'.  ...  To further remove any difference due to implementation, we provide OffCon^3 (Off-Policy Continuous Control: Consolidated), a code base featuring state-of-the-art versions of both algorithms.  ...  To explore this question, we split our analysis into two sections: the e ect on the Critic, and the e ect on the Actor.  ... 
arXiv:2101.11331v2 fatcat:swv6qbihkbguxdakt5snz5xrnm

Exploiting the Sign of the Advantage Function to Learn Deterministic Policies in Continuous Domains [article]

Matthieu Zimmer, Paul Weng
2019 arXiv   pre-print
Fitted Actor Critic (NFAC).  ...  In the context of learning deterministic policies in continuous domains, we revisit an approach, which was first proposed in Continuous Actor Critic Learning Automaton (CACLA) and later extended in Neural  ...  Experiments presented in this paper were carried out using the Grid'5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as  ... 
arXiv:1906.04556v2 fatcat:nsw46p4ubvcy5ilyfirxoxe3iy

Optimizing Medical Treatment for Sepsis in Intensive Care: from Reinforcement Learning to Pre-Trial Evaluation [article]

Luchen Li, Ignacio Albert-Smet, Aldo A. Faisal
2020 arXiv   pre-print
In our work, we build on RL approaches in healthcare ("AI Clinicians"), and learn off-policy continuous dosing policy of pharmaceuticals for sepsis treatment using historical intensive care data under  ...  We focus on infections in intensive care units which are one of the major causes of death and difficult to treat because of the complex and opaque patient dynamics, and the clinically debated, highly-divergent  ...  E OFF-POLICY EVALUATIONS We evaluate our learned policy in terms of off-policy evaluation (OPE), for which we choose three approaches with varying balances between evaluation bias and variance: weighted  ... 
arXiv:2003.06474v2 fatcat:m57z672no5evlpfnmovyziseme

Learning End-to-end Multimodal Sensor Policies for Autonomous Navigation [article]

Guan-Horng Liu, Avinash Siravuru, Sai Prabhakar, Manuela Veloso, George Kantor
2017 arXiv   pre-print
We also introduce an additional auxiliary loss on the policy network in order to reduce variance in the band of potential multi- and uni-sensory policies to reduce jerks during policy switching triggered  ...  In this work, we propose a specific customization of Dropout, called Sensor Dropout, to improve multisensory policy robustness and handle partial failure in the sensor-set.  ...  Acknowledgement The authors would like to thank Po-Wei Chou, Humphrey Hu, and Ming Hsiao for many helpful discussions, suggestions and comments on the paper.  ... 
arXiv:1705.10422v2 fatcat:xbntkh4tbvctvlzvrcmlfeb7eq

A Meta-Reinforcement Learning Approach to Process Control [article]

Daniel G. McClement, Nathan P. Lawrence, Philip D. Loewen, Michael G. Forbes, Johan U. Backström, R. Bhushan Gopaluni
2021 arXiv   pre-print
We test our meta-algorithm on its ability to adapt to new process dynamics as well as different control objectives on the same process.  ...  Meta-learning appears to be a promising approach for constructing more intelligent and sample-efficient controllers.  ...  ACKNOWLEDGEMENTS We gratefully acknowledge the financial support from Natural Sciences and Engineering Research Council of Canada (NSERC) and Honeywell Connected Plant.  ... 
arXiv:2103.14060v1 fatcat:o63ew5ux3beilhcysoyqgffv5q
« Previous Showing results 1 — 15 out of 4,689 results