Filters








570 Hits in 5.2 sec

A general class of surrogate functions for stable and efficient reinforcement learning [article]

Sharan Vaswani, Olivier Bachem, Simone Totaro, Robert Mueller, Shivam Garg, Matthieu Geist, Marlos C. Machado, Pablo Samuel Castro, Nicolas Le Roux
2022 arXiv   pre-print
Common policy gradient methods rely on the maximization of a sequence of surrogate functions.  ...  Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions.  ...  Acknowledgements We would like to thank Veronica Chelu for suggesting the use of the log-sum-exp mirror map in Section 5. Marlos C. Machado and Nicolas Le Roux are funded by a CIFAR chair.  ... 
arXiv:2108.05828v3 fatcat:yl74owiuhzd2zht4v5whkn2vvm

On Connections between Constrained Optimization and Reinforcement Learning [article]

Nino Vieillard, Olivier Pietquin, Matthieu Geist
2019 arXiv   pre-print
We link Conservative Policy Iteration to Frank-Wolfe, Mirror-Descent Modified Policy Iteration to Mirror Descent, and Politex (Policy Iteration Using Expert Prediction) to Dual Averaging.  ...  These abstract DP schemes are representative of a number of (deep) Reinforcement Learning (RL) algorithms.  ...  With this point of view, the function q k plays a role similar to the gradient, giving the direction of improvement.  ... 
arXiv:1910.08476v2 fatcat:n3ajeezfcvgghnztbonzstukzy

Boosting the Actor with Dual Critic [article]

Bo Dai, Albert Shaw, Niao He, Lihong Li, Le Song
2017 arXiv   pre-print
It is derived in a principled way from the Lagrangian dual form of the Bellman optimality equation, which can be viewed as a two-player game between the actor and a critic-like function, which is named  ...  We then provide a concrete algorithm that can effectively solve the minimax optimization problem, using techniques of multi-step bootstrapping, path regularization, and stochastic dual ascent algorithm  ...  essentially parallels the view of (approximate) stochastic mirror descent algorithm (Nemirovski et al., 2009) in the primal space.  ... 
arXiv:1712.10282v1 fatcat:rd77xgj7m5f6rbke5pxf7nays4

Mirror Learning: A Unifying Framework of Policy Optimisation [article]

Jakub Grudzien Kuba, Christian Schroeder de Witt, Jakob Foerster
2022 arXiv   pre-print
Excitingly, we show that mirror learning opens up a whole new space of policy learning methods with convergence guarantees.  ...  We show that virtually all SOTA algorithms for RL are instances of mirror learning, and thus suggest that their empirical performance is a consequence of their theoretical properties, rather than of approximate  ...  A more general class of methods can be derived from the policy gradient theorem (Sutton et al., 2000) , where an agent optimises its policy parameters through gradient ascent.  ... 
arXiv:2201.02373v3 fatcat:d2oykjiuw5ekvjravz2mbwmlr4

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift [article]

Alekh Agarwal, Sham M. Kakade, Jason D. Lee, Gaurav Mahajan
2020 arXiv   pre-print
Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces.  ...  This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs).  ...  Sham Kakade and Alekh Agarwal gratefully acknowledge numerous helpful discussions with Wen Sun with regards to the Q-NPG algorithm and our notion of transfer error.  ... 
arXiv:1908.00261v5 fatcat:y4elqflzebgy7bwl7rbkt3xmb4

Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces [article]

Sridhar Mahadevan, Bo Liu, Philip Thomas, Will Dabney, Steve Giguere, Nicholas Jacek, Ian Gemp, Ji Liu
2014 arXiv   pre-print
of mirror descent methods.  ...  stable region of the parameter space (iii) how to design "off-policy" temporal difference learning algorithms in a reliable and stable manner, and finally (iv) how to integrate the study of reinforcement  ...  Acknowledgements We like to acknowledge the useful feedback of past and present members of the Autonomous Learning Laboratory at the University of Massachusetts, Amherst.  ... 
arXiv:1405.6757v1 fatcat:u77kqc6iyncy7fixlnrfcnqrmy

An operator view of policy gradient methods [article]

Dibya Ghosh, Marlos C. Machado, Nicolas Le Roux
2020 arXiv   pre-print
We cast policy gradient methods as the repeated application of two operators: a policy improvement operator ℐ, which maps any policy π to a better one ℐπ, and a projection operator 𝒫, which finds the  ...  best approximation of ℐπ in the set of realizable policies.  ...  Broader Impact As this work has a theoretical focus, it is unlikely to have a direct impact on society at large although it may guide future research with such an impact.  ... 
arXiv:2006.11266v3 fatcat:pfkddm36mnhgpdeib44cjff2qa

Reinforcement learning for non-prehensile manipulation: Transfer from simulation to physical system [article]

Kendall Lowrey, Svetoslav Kolev, Jeremy Dao, Aravind Rajeswaran, Emanuel Todorov
2018 arXiv   pre-print
We use a modified form of the natural policy gradient algorithm for learning, applied to a carefully identified simulation model.  ...  However, most results have been limited to simulation due to the need for a large number of samples and the lack of automated-yet-safe data collection methods.  ...  METHOD A. Natural Policy Gradient Policy gradient algorithms are a class of RL methods where the parameters of the policy are directly optimized typically using gradient based methods.  ... 
arXiv:1803.10371v1 fatcat:7smzux4lvnf4zknamzvrmb2igi

Quasi-Newton policy gradient algorithms [article]

Haoya Li, Samarth Gupta, Hsiangfu Yu, Lexing Ying, Inderjit Dhillon
2021 arXiv   pre-print
In this paper, we propose a quasi-Newton method for the policy gradient algorithm with entropy regularization.  ...  For other entropy functions, this method results in brand new policy gradient algorithms.  ...  be viewed as an approximate mirror descent method and the A3C method as an MD method for the dual-averaging (Nesterov, 2009) objective.  ... 
arXiv:2110.02398v2 fatcat:wrvwzaneczdtfegpbuhgqc5auu

Ascent Similarity Caching with Approximate Indexes [article]

T. Si Salem, G. Neglia, D. Carra
2021 arXiv   pre-print
and which to retrieve from the remote server, and (ii) a mirror ascent algorithm to update the set of local objects with strong guarantees even when the request process does not exhibit any statistical  ...  In this paper we present AÇAI, a new similarity caching policy which improves on the state of the art by using (i) an (approximate) index for the whole catalog to decide which objects to serve locally  ...  We deviate from these works by considering k > 1, large catalog size, and the more general family of online mirror ascent algorithms (of which the usual gradient ascent method is a particular instance)  ... 
arXiv:2107.00957v3 fatcat:gkwvs7kpcbfqva76wnkinnv6fq

Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity [article]

Bo Liu, Ian Gemp, Mohammad Ghavamzadeh, Ji Liu, Sridhar Mahadevan, Marek Petrik
2020 arXiv   pre-print
We show how gradient TD (GTD) reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point  ...  We provide experimental results showing the improved performance of our accelerated gradient TD methods.  ...  Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.  ... 
arXiv:2006.03976v1 fatcat:btdxyuh3obbq3fluc3vlofb6te

Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity

Bo Liu, Ian Gemp, Mohammad Ghavamzadeh, Ji Liu, Sridhar Mahadevan, Marek Petrik
2018 The Journal of Artificial Intelligence Research  
We show how gradient TD (GTD) reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point  ...  We provide experimental results showing the improved performance of our accelerated gradient TD methods.  ...  Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.  ... 
doi:10.1613/jair.1.11251 fatcat:axcp56nezbeovooraextarycki

Towards Piston Fine Tuning of Segmented Mirrors through Reinforcement Learning

Dailos Guerra-Ramos, Juan Trujillo-Sevilla, Jose Manuel Rodríguez-Ramos
2020 Applied Sciences  
This approach has been used in this paper to correct piston misalignment between segments in a segmented mirror telescope.  ...  Unlike supervised machine learning methods, reinforcement learning allows an entity to learn how to deploy a task from experience rather than labeled data.  ...  Acknowledgments: This work was supported by Wooptix, a spinoff company of the Universidad de La Laguna. Conflicts of Interest: The authors declare no conflict of interest.  ... 
doi:10.3390/app10093207 fatcat:rayxoh3ikngcfalpkdehgjn7u4

Stackelberg Actor-Critic: Game-Theoretic Reinforcement Learning Algorithms [article]

Liyuan Zheng, Tanner Fiez, Zane Alumbaugh, Benjamin Chasnov, Lillian J. Ratliff
2021 arXiv   pre-print
From a theoretical standpoint, we develop a policy gradient theorem for the refined update and provide a local convergence guarantee for the Stackelberg actor-critic algorithms to a local Stackelberg equilibrium  ...  Given this abstraction, we propose a meta-framework for Stackelberg actor-critic algorithms where the leader player follows the total derivative of its objective instead of the usual individual gradient  ...  Deep Deterministic Policy Gradient (DDPG). The DDPG algorithm is an off-policy method with subtly different objective functions for the actor and critic.  ... 
arXiv:2109.12286v1 fatcat:usntqn5alzdozpopbnb3ngn5zy

Efficient Baseline-Free Sampling in Parameter Exploring Policy Gradients: Super Symmetric PGPE [chapter]

Frank Sehnke
2013 Lecture Notes in Computer Science  
Policy Gradient methods that explore directly in parameter space are among the most effective and robust direct policy search methods and have drawn a lot of attention lately.  ...  The basic method from this field, Policy Gradients with Parameter-based Exploration, uses two samples that are symmetric around the current hypothesis to circumvent misleading reward in asymmetrical reward  ...  Acknowledgments This work was funded by the Zentrum für Sonnenenergie-und Wasserstoff-Forschung, MEXT Scholarship and Tianjin Natural Science Foundation of China: 13JCQNJC00200.  ... 
doi:10.1007/978-3-642-40728-4_17 fatcat:rqfpbz4sufc6bbcgns3r7t7skm
« Previous Showing results 1 — 15 out of 570 results