Filters








20,584 Hits in 5.2 sec

Infinite-horizon Off-Policy Policy Evaluation with Multiple Behavior Policies [article]

Xinyun Chen, Lu Wang, Yizhe Hang, Heng Ge, Hongyuan Zha
2019 arXiv   pre-print
We consider off-policy policy evaluation when the trajectory data are generated by multiple behavior policies.  ...  Recent work has shown the key role played by the state or state-action stationary distribution corrections in the infinite horizon context for off-policy policy evaluation.  ...  In this paper, we propose a partially policy-agnostic method, EMP (estimated mixture policy) for infinite-horizon off-policy policy evaluation with multiple known or unknown behavior policies.  ... 
arXiv:1910.04849v1 fatcat:opmzzszinbctbhw6gwo4hmcmaa

Behavior Regularized Offline Reinforcement Learning [article]

Yifan Wu, George Tucker, Ofir Nachum
2019 arXiv   pre-print
In this work, we introduce a general framework, behavior regularized actor critic (BRAC), to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline  ...  In reinforcement learning (RL) research, it is common to assume access to direct online interactions with the environment.  ...  Acknowledgements We thank Aviral Kumar, Ilya Kostrikov, Yinlam Chow, and others at Google Research for helpful thoughts and discussions.  ... 
arXiv:1911.11361v1 fatcat:jodhgl4tozayxdgi4owbk3suqm

Cooperative and Competitive Reinforcement and Imitation Learning for a Mixture of Heterogeneous Learning Modules

Eiji Uchibe
2018 Frontiers in Neurorobotics  
Each learning module has its own network architecture and improves the policy based on an off-policy reinforcement learning algorithm and behavior cloning from samples collected by a behavior policy that  ...  This paper proposes Cooperative and competitive Reinforcement And Imitation Learning (CRAIL) for selecting an appropriate policy from a set of multiple heterogeneous modules and training all of them in  ...  In addition, Mix & Match uses a mixture of policies and optimizes the mixing weights by a kind of evolutionary computation. Since Mix & Match needs multiple simulators, it is sample-inefficient.  ... 
doi:10.3389/fnbot.2018.00061 pmid:30319389 pmcid:PMC6170616 fatcat:2vbrrtjg7rgibng3crtj2p42iy

Offline RL Without Off-Policy Evaluation [article]

David Brandfonbrener, William F. Whitney, Rajesh Ranganath, Joan Bruna
2021 arXiv   pre-print
We argue that the relatively poor performance of iterative approaches is a result of the high variance inherent in doing off-policy evaluation and magnified by the repeated optimization of policies against  ...  Most prior approaches to offline reinforcement learning (RL) have taken an iterative actor-critic approach involving off-policy evaluation.  ...  Sloan Foundation, NSF RI-1816753, NSF CAREER CIF 1845360, NSF CHS-1901091, Samsung Electronics, and the Institute for Advanced Study.  ... 
arXiv:2106.08909v3 fatcat:3ipj5t6vhvagrpk4kudflekq6i

On Multi-objective Policy Optimization as a Tool for Reinforcement Learning [article]

Abbas Abdolmaleki, Sandy H. Huang, Giulia Vezzani, Bobak Shahriari, Jost Tobias Springenberg, Shruti Mishra, Dhruva TB, Arunkumar Byravan, Konstantinos Bousmalis, Andras Gyorgy, Csaba Szepesvari, Raia Hadsell (+2 others)
2021 arXiv   pre-print
For offline RL, we use the MO perspective to derive a simple algorithm, that optimizes for the standard RL objective plus a behavioral cloning term.  ...  Often, task reward and auxiliary objectives are in conflict with each other and it is therefore natural to treat these examples as instances of multi-objective (MO) optimization problems.  ...  support for this paper.  ... 
arXiv:2106.08199v1 fatcat:uqpsvp7u4ranvk2psymtf3ugse

Top-K Off-Policy Correction for a REINFORCE Recommender System [article]

Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, Ed Chi
2021 arXiv   pre-print
learning from logged feedback collected from multiple behavior policies; (3) proposing a novel top-K off-policy correction to account for our policy recommending multiple items at a time; (4) showcasing  ...  The contributions of the paper are: (1) scaling REINFORCE to a production recommender system with an action space on the orders of millions; (2) applying off-policy correction to address data biases in  ...  ACKNOWLEDGEMENTS We thank Craig Boutilier for his valuable comments and discussions.  ... 
arXiv:1812.02353v3 fatcat:uwdjn66aercizjnek63rjfwwiu

Data-efficient Hindsight Off-policy Option Learning [article]

Markus Wulfmeier, Dushyant Rao, Roland Hafner, Thomas Lampe, Abbas Abdolmaleki, Tim Hertweck, Michael Neunert, Dhruva Tirumala, Noah Siegel, Nicolas Heess, Martin Riedmiller
2021 arXiv   pre-print
To better understand the option framework and disentangle benefits from both temporal and action abstraction, we evaluate ablations with flat policies and mixture policies with comparable optimization.  ...  We introduce Hindsight Off-policy Options (HO2), a data-efficient option learning algorithm.  ...  We additionally like to acknowledge the support of the DeepMind robotics lab for infrastructure and engineering support.  ... 
arXiv:2007.15588v2 fatcat:ilf6pndw5zhnphh6drdgk2kc4q

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning [article]

Xue Bin Peng, Aviral Kumar, Grace Zhang, Sergey Levine
2019 arXiv   pre-print
to regress onto weighted target actions for the policy.  ...  AWR is also able to acquire more effective policies than most off-policy algorithms when learning from purely static datasets with no additional environmental interactions.  ...  ACKNOWLEDGEMENTS We thank Abhishek Gupta and Aurick Zhou for insightful discussions. This research was supported an NSERC Postgraduate Scholarship, a Berkeley Fellowship for Graduate Study, Berkeley  ... 
arXiv:1910.00177v3 fatcat:tbrlxwnen5c5viroqsmopoqvcq

Off-policy Maximum Entropy Reinforcement Learning : Soft Actor-Critic with Advantage Weighted Mixture Policy(SAC-AWMP) [article]

Zhimin Hou and Kuangen Zhang and Yi Wan and Dongyu Li and Chenglong Fu and Haoyong Yu
2020 arXiv   pre-print
The optimal policy of a reinforcement learning problem is often discontinuous and non-smooth. I.e., for two states with similar representations, their optimal policies can be significantly different.  ...  A common way to solve this problem, known as Mixture-of-Experts, is to represent the policy as the weighted sum of multiple components, where different components perform well on different parts of the  ...  Acknowledgments This work was supported by Agency for Science, Technology and Research, Singapore, under the National Robotics Program, with A*star SERC Grant No.: 192 25 00054.  ... 
arXiv:2002.02829v1 fatcat:rahjd3ta45guvglcgvtiynvrdu

A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems [article]

Rafael Figueiredo Prudencio, Marcos R. O. A. Maximo, Esther Luna Colombini
2022 arXiv   pre-print
pixel observations, sustaining conversations with humans, and controlling robotic agents.  ...  Finally, we provide our perspective on open problems and propose future research directions for this rapidly growing field.  ...  Here, we formalize importance sampling for offline RL as a means to evaluate our policy π θ with samples from our behavior policy π β .  ... 
arXiv:2203.01387v2 fatcat:euobvze7kre3fi7blalnbbgefm

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets [article]

Ashvin Nair, Abhishek Gupta, Murtaza Dalal, Sergey Levine
2021 arXiv   pre-print
While a number of prior methods have either used optimal demonstrations to bootstrap RL, or have used sub-optimal data to train purely offline, it remains exceptionally difficult to train a policy with  ...  Reinforcement learning (RL) provides an appealing formalism for learning control policies from experience.  ...  [41] , with the key difference from our method being that AWR uses TD(λ) on the replay buffer for policy evaluation. Monotonic Advantage Re-Weighted Imitation Learning (MARWIL).  ... 
arXiv:2006.09359v6 fatcat:yvtbbrzrmvfrpeibmwdu2h3j5q

GenDICE: Generalized Offline Estimation of Stationary Values [article]

Ruiyi Zhang, Bo Dai, Lihong Li, Dale Schuurmans
2020 arXiv   pre-print
We prove its consistency under general conditions, provide an error analysis, and demonstrate strong empirical performance on benchmark problems, including off-line PageRank and off-policy policy evaluation  ...  In many real-world applications, access to the underlying transition operator is limited to a fixed set of data that has already been collected, without additional interaction with the environment being  ...  ACKNOWLEDGMENTS The authors would like to thank Ofir Nachum, the rest of the Google Brain team and the anonymous reviewers for helpful discussions and feedback.  ... 
arXiv:2002.09072v1 fatcat:ad36kutetfg4xflh2wdaxgyi3q

Batch Policy Learning under Constraints [article]

Hoang M. Le, Cameron Voloshin, Yisong Yue
2019 arXiv   pre-print
When learning policies for real-world domains, two important questions arise: (i) how to efficiently use pre-collected off-policy, non-optimal behavior data; and (ii) how to mediate among different competing  ...  To certify constraint satisfaction, we propose a new and simple method for off-policy policy evaluation (OPE) and derive PAC-style bounds.  ...  More robust doubly robust off-policy evaluation. arXiv preprint arXiv:1802.03493, 2018. Freund, Y. and Schapire, R. E. Adaptive game playing using multiplicative weights.  ... 
arXiv:1903.08738v1 fatcat:6rydrj3xcjgylmqvb5sbq72rey

Diverse Exploration for Fast and Safe Policy Improvement [article]

Andrew Cohen, Lei Yu, Robert Wright
2018 arXiv   pre-print
We provide DE theory explaining why diversity in behavior policies enables effective exploration without sacrificing exploitation.  ...  Our empirical study shows that an online policy improvement algorithm framework implementing the DE strategy can achieve both fast policy improvement and safe online performance.  ...  Acknowledgements The authors would like to thank Xingye Qiao for providing insights and feedback on theorem proofs and the Watson School of Engineering for computing support.  ... 
arXiv:1802.08331v1 fatcat:wl7nrni235cr5ktujumqdwaaue

Diverse Exploration for Fast and Safe Policy Improvement

Andrew Cohen, Lei Yu, Robert Wright
2018 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
We provide DE theory explaining why diversity in behavior policies enables effective exploration without sacrificing exploitation.  ...  Our empirical study shows that an online policy improvement algorithm framework implementing the DE strategy can achieve both fast policy improvement and safe online performance.  ...  Acknowledgements The authors would like to thank Xingye Qiao for providing insights and feedback on theorem proofs and the Watson School of Engineering for computing support.  ... 
doi:10.1609/aaai.v32i1.11758 fatcat:rab4rdrvajafhnek6x3kvd6nj4
« Previous Showing results 1 — 15 out of 20,584 results