63,076 Hits in 2.9 sec

An operator view of policy gradient methods [article]

Dibya Ghosh, Marlos C. Machado, Nicolas Le Roux
2020 arXiv   pre-print
We cast policy gradient methods as the repeated application of two operators: a policy improvement operator ℐ, which maps any policy π to a better one ℐπ, and a projection operator 𝒫, which finds the  ...  We use this framework to introduce operator-based versions of traditional policy gradient methods such as REINFORCE and PPO, which leads to a better understanding of their original counterparts.  ...  Bellemare, Kavosh Asadi, Danny Tarlow, and members of the Brain Montreal team for enlightening discussions and feedback on earlier drafts of the paper. NLR is supported by a Canada CIFAR AI Chair.  ... 
arXiv:2006.11266v3 fatcat:pfkddm36mnhgpdeib44cjff2qa

Faster AutoAugment: Learning Augmentation Strategies using Backpropagation [article]

Ryuichiro Hataya, Jan Zdenek, Kazuki Yoshizoe, Hideki Nakayama
2019 arXiv   pre-print
We introduce approximate gradients for several transformation operations with discrete parameters as well as the differentiable mechanism for selecting operations.  ...  In this paper, we propose a differentiable policy search pipeline for data augmentation, which is much faster than previous methods.  ...  An operation O k operates a given image with probability p k and magnitude µ k . Figure 4 . 4 Schematic view of the selection of operations in a single sub-policy when K = 2.  ... 
arXiv:1911.06987v1 fatcat:euoxrg2coredvlsbfkqod3wr2i

Parametric value function approximation: A unified view

Matthieu Geist, Olivier Pietquin
2011 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)  
It consists of learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the so-called value function.  ...  Related algorithms are derived by considering one of the associated cost functions and a specific way to minimize it, almost always a stochastic gradient descent or a recursive least-squares approach.  ...  Such an approach as a clear advantage for methods based on stochastic gradient descent, as it speeds up the learning.  ... 
doi:10.1109/adprl.2011.5967355 dblp:conf/adprl/GeistP11 fatcat:hhntw5pucvafxcmxknr6da57ve

Policy Learning – A Unified Perspective with Applications in Robotics [chapter]

Jan Peters, Jens Kober, Duy Nguyen-Tuong
2008 Lecture Notes in Computer Science  
In this paper, we show two contributions: firstly, we show a unified perspective which allows us to derive several policy learning algorithms from a common point of view, i.e, policy gradient algorithms  ...  Policy Learning approaches are among the best suited methods for high-dimensional, continuous control systems such as anthropomorphic robot arms and humanoid robots.  ...  They are considered the fastest policy gradient methods to date and "the current method of choice" [1] .  ... 
doi:10.1007/978-3-540-89722-4_17 fatcat:ypyvk5detbagblvkfpclloxdia

Generalize Robot Learning From Demonstration to Variant Scenarios With Evolutionary Policy Gradient

Junjie Cao, Weiwei Liu, Yong Liu, Jian Yang
2020 Frontiers in Neurorobotics  
Our Evolutionary Policy Gradient combines parameter perturbation with policy gradient method in the framework of Evolutionary Algorithms (EAs) and can fuse the benefits of both, achieving effective and  ...  The experiments, carried out in robot control tasks in OpenAI Gym with dense and sparse rewards, show that our EPG is able to provide competitive performance over the original policy gradient methods and  ...  From this point of view, our EPG is an approximation to Thompson Sampling (Thompson, 1933) in the policy parameters.  ... 
doi:10.3389/fnbot.2020.00021 pmid:32372940 pmcid:PMC7188386 fatcat:lodwo6wq2ngvlcfccuhzaa5fay

Approximation Benefits of Policy Gradient Methods with Aggregated States [article]

Daniel Russo
2021 arXiv   pre-print
Theoretical results synthesize recent analysis of policy gradient methods with insights of Van Roy (2006) into the critical role of state-relevance weights in approximate dynamic programming.  ...  This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by ϵ, the largest difference between two elements of the state-action value function belonging to a common  ...  In recent years, an alternative class of algorithms known as policy gradient methods has surged in popularity.  ... 
arXiv:2007.11684v2 fatcat:sni66r52s5b4rknaax6tutwgna

On Linear Convergence of Policy Gradient Methods for Finite MDPs [article]

Jalaj Bhandari, Daniel Russo
2021 arXiv   pre-print
We revisit the finite time analysis of policy gradient methods in the one of the simplest settings: finite state and action MDPs with a policy class consisting of all stochastic policies and with exact  ...  Here, we take a different perspective based on connections with policy iteration and show that many variants of policy gradient methods succeed with large step-sizes and attain a linear rate of convergence  ...  This work was done in part when JB was participating in the Theory of Reinforcement Learning program at the Simons Institute for the Theory of Computing.  ... 
arXiv:2007.11120v2 fatcat:aom34njc3zgorg7pdrzlcn6ooe

Learning to Optimize [article]

Ke Li, Jitendra Malik
2016 arXiv   pre-print
We learn an optimization algorithm using guided policy search and demonstrate that the resulting algorithm outperforms existing hand-engineered algorithms in terms of convergence speed and/or the final  ...  In this paper, we explore automating algorithm design and present a method to learn an optimization algorithm, which we believe to be the first method that can automatically discover a better algorithm  ...  We observe that the execution of an optimization algorithm can be viewed as the execution of a fixed policy in an MDP: the state consists of the current location and the objective values and gradients  ... 
arXiv:1606.01885v1 fatcat:zz5awge6sreupdwb5azps3a7hi

TFPnP: Tuning-free Plug-and-Play Proximal Algorithm with Applications to Inverse Imaging Problems [article]

Kaixuan Wei, Angelica Aviles-Rivero, Jingwei Liang, Ying Fu, Hua Huang, Carola-Bibiane Schönlieb
2021 arXiv   pre-print
Moreover, we discuss several practical considerations of PnP denoisers, which together with our learned policy yield state-of-the-art results.  ...  Plug-and-Play (PnP) is a non-convex optimization framework that combines proximal algorithms, for example, the alternating direction method of multipliers (ADMM), with advanced denoising priors.  ...  Authors also gratefully acknowledge the financial support of the CMIH and CCIMI University of Cambridge, and Graduate school of Beijing Institute of Technology.  ... 
arXiv:2012.05703v3 fatcat:jcwekn62ira73lf5etemy3vd2q

Equivalence Between Policy Gradients and Soft Q-Learning [article]

John Schulman and Xi Chen and Pieter Abbeel
2018 arXiv   pre-print
Two of the leading approaches for model-free reinforcement learning are policy gradient methods and Q-learning methods.  ...  setting of entropy-regularized reinforcement learning, that "soft" (entropy-regularized) Q-learning is exactly equivalent to a policy gradient method.  ...  Acknowledgements We would like to thank Matthieu Geist for pointing out an error in the first version of this manuscript, Chao Gao for pointing out several errors in the second version, and colleagues  ... 
arXiv:1704.06440v4 fatcat:h4wh75zhmzb4jmvh5qwxlolz5i

Decentralized Reinforcement Learning for Multi-Target Search and Detection by a Team of Drones [article]

Roi Yehoshua, Juan Heredia-Juesas, Yushu Wu, Christopher Amato, Jose Martinez-Lorenzo
2021 arXiv   pre-print
In this paper we develop a multi-agent deep reinforcement learning (MADRL) method to coordinate a group of aerial vehicles (drones) for the purpose of locating a set of static targets in an unknown area  ...  Our reinforcement learning method, which utilized this simulator for training, was able to find near-optimal policies for the drones.  ...  All state-of-the-art policy gradient MADRL methods use some form of centralized learning.  ... 
arXiv:2103.09520v1 fatcat:3cxglomxszatdhrtt7vg3mxvvm

The Misbehavior of Reinforcement Learning

Gianluigi Mongillo, Hanan Shteingart, Yonatan Loewenstein
2014 Proceedings of the IEEE  
., via stochastic gradient ascent, without the need of an explicit representation of values.  ...  An alternative view questions the applicability of such a computational scheme to many real-life situations.  ...  An important advantage of policy-gradient methods over value-based methods is that they retain their convergence guarantees under very general conditions when applied to POMPDs [12] .  ... 
doi:10.1109/jproc.2014.2307022 fatcat:s46cu5dxe5ee5hsk2eq4u6fsyq

Visual Sensor Network Reconfiguration with Deep Reinforcement Learning [article]

Paul Jasek, Bernard Abayowa
2018 arXiv   pre-print
We present an approach for reconfiguration of dynamic visual sensor networks with deep reinforcement learning (RL).  ...  To address the issue of sample inefficiency in current approaches to model-free reinforcement learning, we train our system in an abstract simulation environment that represents inputs from a dynamic scene  ...  A3C is an example of a model-free policy-based method which trains an agent to maximize R t by updating the parameters θ of the policy π(a|s; θ).  ... 
arXiv:1808.04287v1 fatcat:2b2djgz62zd5dkzft5zposchdu

Reinforcement learning and human behavior

Hanan Shteingart, Yonatan Loewenstein
2014 Current Opinion in Neurobiology  
an explicit world model in terms of state-action pairs.  ...  In particular, we emphasize that learning a model of the world is an essential step prior or in parallel to learning the policy in RL and discuss alternative models that directly learn a policy without  ...  of states and stochastic gradient methods  ... 
doi:10.1016/j.conb.2013.12.004 pmid:24709606 fatcat:gy5qu5hn7vaojb4kbwtvafkvwm

Proximal Deterministic Policy Gradient [article]

Marco Maggipinto and Gian Antonio Susto and Pratik Chaudhari
2020 arXiv   pre-print
Second, we exploits the two value functions commonly employed in state-of-the-art off-policy algorithms to provide an improved action value estimate through bootstrapping with limited increase of computational  ...  The target network plays the role of the variable of optimization and the value network computes the proximal operator.  ...  CONCLUSIONS In this paper we proposed Proximal Deterministic Policy Gradient, an off-policy RL method for model free continuous control tasks that exploits proximal gradient methods and bootstrapping to  ... 
arXiv:2008.00759v1 fatcat:vw7jij6opzfjfpfifg4x3ikqnu
« Previous Showing results 1 — 15 out of 63,076 results