56,390 Hits in 3.0 sec

Finite-Sample Analysis of Off-Policy Natural Actor-Critic with Linear Function Approximation [article]

Zaiwei Chen, Sajad Khodadadian, Siva Theja Maguluri
2022 arXiv   pre-print
In this paper, we develop a novel variant of off-policy natural actor-critic algorithm with linear function approximation and we establish a sample complexity of 𝒪(ϵ^-3), outperforming all the previously  ...  In order to overcome the divergence due to deadly triad in off-policy policy evaluation under function approximation, we develop a critic that employs n-step TD-learning algorithm with a properly chosen  ...  We develop a variant of NAC with off-policy sampling, where both the actor and the critic use linear function approximation, and the critic uses off-policy sampling.  ... 
arXiv:2105.12540v2 fatcat:s7uf2koeobcudfyvub5nnqtpdq

Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm [article]

Raghuram Bharadwaj Diddigi, Prateek Jain, Prabuchandran K.J., Shalabh Bhatnagar
2022 arXiv   pre-print
This work proposes an off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency  ...  We illustrate the benefit of the proposed off-policy natural gradient algorithm by comparing it with the vanilla gradient actor-critic algorithm on benchmark RL tasks.  ...  To the best of our knowledge, ours is the first off-policy natural actor-critic algorithm that utilizes non-linear or deep function approximation.  ... 
arXiv:2110.10017v2 fatcat:7bqw5rbevra3xibf4lnab6qjsu

Convergent Actor-Critic Algorithms Under Off-Policy Training and Function Approximation [article]

Hamid Reza Maei
2018 arXiv   pre-print
To our knowledge, this is the first time that convergent off-policy learning methods have been extended to classical Actor-Critic methods with function approximation.  ...  We present the first class of policy-gradient algorithms that work with both state-value and policy function-approximation, and are guaranteed to converge under off-policy training.  ...  of Actor-Critic with off-policy learning.  ... 
arXiv:1802.07842v1 fatcat:5nydoixyqve6fhkvml44jjqpjm

Off-Policy Actor-Critic [article]

Thomas Degris, Martha White, Richard S. Sutton
2013 arXiv   pre-print
This paper presents the first actor-critic algorithm for off-policy reinforcement learning.  ...  Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning.  ...  Appendix of Off-Policy Actor-Critic A.1.  ... 
arXiv:1205.4839v5 fatcat:sgvymosefjd5ja3wsm7p7v3yva

Decomposed Soft Actor-Critic Method for Cooperative Multi-Agent Reinforcement Learning [article]

Yuan Pu, Shaochen Wang, Rui Yang, Xin Yao, Bin Li
2021 arXiv   pre-print
In this paper, we propose a new decomposed multi-agent soft actor-critic (mSAC) method, which effectively combines the advantages of the aforementioned two methods.  ...  Theoretically, mSAC supports efficient off-policy learning and addresses credit assignment problem partially in both discrete and continuous action spaces.  ...  SAC is a popular single-agent off-policy actor-critic method using the maximum entropy reinforcement learning framework.  ... 
arXiv:2104.06655v2 fatcat:ntzvgwp2zzb5vfu45z2wnopmj4

Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation [article]

Shangtong Zhang, Bo Liu, Hengshuai Yao, Shimon Whiteson
2020 arXiv   pre-print
We present the first provably convergent two-timescale off-policy actor-critic algorithm (COF-PAC) with function approximation.  ...  With the help of the emphasis critic and the canonical value function critic, we show convergence for COF-PAC, where the critics are linear and the actor can be nonlinear.  ...  To address this issue, Degris et al. (2012) propose the Off-Policy Actor-Critic (Off-PAC) algorithm.  ... 
arXiv:1911.04384v9 fatcat:mpt7qwzonbb7pmwdy3qpmxh4cu

Value-Decomposition Multi-Agent Actor-Critics [article]

Jianyu Su, Stephen Adams, Peter A. Beling
2020 arXiv   pre-print
To obtain a reasonable trade-off between training efficiency and algorithm performance, we extend value-decomposition to actor-critics that are compatible with A2C and propose a novel actor-critic framework  ...  , value-decomposition actor-critics (VDACs).  ...  To bridge the gap between multi-agent Q-learning and multi-agent actor-critics, as well as offer a reasonable trade-off between training efficiency and algorithm performance, we propose a novel actor-critic  ... 
arXiv:2007.12306v4 fatcat:337h2yo2qjhtnggt6ggplmunse

Reinforcement Learning with Deep Quantum Neural Networks

Wei Hu, James Hu
2019 Journal of Quantum Information Science  
Using quantum photonic circuits, we implement Q learning and actor-critic algorithms with multilayer quantum neural networks and test them in the grid world environment.  ...  When an RL algorithm can use a different behavior policy from its target policy, it is called off-policy, otherwise on-policy. Q learning is off-policy.  ...  Off-policy methods can learn about an optimal policy while executing an exploratory policy.  ... 
doi:10.4236/jqis.2019.91001 fatcat:jls5a6knfbcftgig4q3cirjhru

Comparison of reinforcement learning algorithms applied to the cart-pole problem

Savinay Nagendra, Nikhil Podila, Rashmi Ugarakhod, Koshy George
2017 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI)  
RL algorithms such as temporal-difference, policy gradient actor-critic, and value function approximation are compared in this context with the standard LQR solution.  ...  Actor-critic policy gradient method The policy gradient actor-critic method involves the calculation of a value function by the critic network as well as the selection of the best action for the given  ...  Thus, policy-gradient actor-critic methods are considered with the following parts [12] : The critic which updates action-value function or their parameters, and the actor which updates the policy parameters  ... 
doi:10.1109/icacci.2017.8125811 dblp:conf/icacci/NagendraPUG17 fatcat:l4huizmwgzbtnooenzo5uqryly

Differentially Private Actor and Its Eligibility Trace

Kanghyeon Seo, Jihoon Yang
2020 Electronics  
We present a differentially private actor and its eligibility trace in an actor-critic approach, wherein an actor takes actions directly interacting with an environment; however, the critic estimates only  ...  In this paper, we confirm the applicability of differential privacy methods to the actors updated using the policy gradient algorithm and discuss the advantages of such an approach with regard to differentially  ...  Off-Policy Actor-Critic Off-policy actor-critic, introduced by Degris et al.  ... 
doi:10.3390/electronics9091486 fatcat:3gvviyvubrd4voe3uumkmxyhee

Parameter-Based Value Functions [article]

Francesco Faccio, Louis Kirsch, Jürgen Schmidhuber
2021 arXiv   pre-print
Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy.  ...  First we show how PBVFs yield novel off-policy policy gradient theorems. Then we derive off-policy actor-critic algorithms based on PBVFs trained by Monte Carlo or Temporal Difference methods.  ...  The first attempt to obtain a stable off-policy actor-critic algorithm under linear function approximation was called Off-PAC (Degris et al., 2012) , where the critic is updated using GTD(λ) (Maei, 2011  ... 
arXiv:2006.09226v4 fatcat:hddfxp3qyvertj7sert4uw63eq

A Batch, Off-Policy, Actor-Critic Algorithm for Optimizing the Average Reward [article]

S.A. Murphy, Y. Deng, E.B. Laber, H.R. Maei, R.S. Sutton, K. Witkiewitz
2016 arXiv   pre-print
We develop an off-policy actor-critic algorithm for learning an optimal policy from a training set composed of data from multiple individuals.  ...  The actor algorithm is represented by the maximization steps in the batch, off-policy, actor-critic algorithm given in Algorithm 2.  ...  Here too, we learn a policy from data from multiple individuals; to our knowledge, this paper is the first off-policy, batch, actor-critic algorithm for such use.  ... 
arXiv:1607.05047v1 fatcat:vc2csijpk5eozgx3hj3jsvg2wy

Off-Policy Multi-Agent Decomposed Policy Gradients [article]

Yihan Wang, Beining Han, Tonghan Wang, Heng Dong, Chongjie Zhang
2020 arXiv   pre-print
This method introduces the idea of value function decomposition into the multi-agent actor-critic framework.  ...  Based on this idea, DOP supports efficient off-policy learning and addresses the issue of centralized-decentralized mismatch and credit assignment in both discrete and continuous action spaces.  ...  Off-Policy Learning In our method, κ controls the "off-policyness" of training. For DOP, we set κ to 0.5.  ... 
arXiv:2007.12322v2 fatcat:h5fke67j5bfenercbqmquk7dta

An Off-policy Policy Gradient Theorem Using Emphatic Weightings [article]

Ehsan Imani, Eric Graves, Martha White
2019 arXiv   pre-print
We develop a new actor-critic algorithmx2014called Actor Critic with Emphatic weightings (ACE)x2014that approximates the simplified gradients provided by the theorem.  ...  In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphaticweightings.  ...  Actor-Critic with Emphatic Weightings In this section, we develop an incremental actor-critic algorithm with emphatic weightings, that uses the above off-policy policy gradient theorem.  ... 
arXiv:1811.09013v2 fatcat:kolr7jbvyrci5ogttus72xl3fu

Zeroth-Order Actor-Critic [article]

Yuheng Lei, Jianyu Chen, Shengbo Eben Li, Sifa Zheng
2022 arXiv   pre-print
We propose Zeroth-Order Actor-Critic algorithm (ZOAC) that unifies these two methods into an on-policy actor-critic architecture to preserve the advantages from both.  ...  sample efficient but restricted to differentiable policies and the learned policies are less robust.  ...  Zeroth-Order Actor-Critic From ES to ZOAC In this section, we will derive an improved zeroth-order gradient combining the actor-critic architecture for policy improvement.  ... 
arXiv:2201.12518v2 fatcat:spzqzjak7fhrlfbujjzxkh6as4
« Previous Showing results 1 — 15 out of 56,390 results