A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Infinite-horizon Off-Policy Policy Evaluation with Multiple Behavior Policies
[article]
2019
arXiv
pre-print
We consider off-policy policy evaluation when the trajectory data are generated by multiple behavior policies. ...
Recent work has shown the key role played by the state or state-action stationary distribution corrections in the infinite horizon context for off-policy policy evaluation. ...
In this paper, we propose a partially policy-agnostic method, EMP (estimated mixture policy) for infinite-horizon off-policy policy evaluation with multiple known or unknown behavior policies. ...
arXiv:1910.04849v1
fatcat:opmzzszinbctbhw6gwo4hmcmaa
Behavior Regularized Offline Reinforcement Learning
[article]
2019
arXiv
pre-print
In this work, we introduce a general framework, behavior regularized actor critic (BRAC), to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline ...
In reinforcement learning (RL) research, it is common to assume access to direct online interactions with the environment. ...
Acknowledgements We thank Aviral Kumar, Ilya Kostrikov, Yinlam Chow, and others at Google Research for helpful thoughts and discussions. ...
arXiv:1911.11361v1
fatcat:jodhgl4tozayxdgi4owbk3suqm
Cooperative and Competitive Reinforcement and Imitation Learning for a Mixture of Heterogeneous Learning Modules
2018
Frontiers in Neurorobotics
Each learning module has its own network architecture and improves the policy based on an off-policy reinforcement learning algorithm and behavior cloning from samples collected by a behavior policy that ...
This paper proposes Cooperative and competitive Reinforcement And Imitation Learning (CRAIL) for selecting an appropriate policy from a set of multiple heterogeneous modules and training all of them in ...
In addition, Mix & Match uses a mixture of policies and optimizes the mixing weights by a kind of evolutionary computation. Since Mix & Match needs multiple simulators, it is sample-inefficient. ...
doi:10.3389/fnbot.2018.00061
pmid:30319389
pmcid:PMC6170616
fatcat:2vbrrtjg7rgibng3crtj2p42iy
Offline RL Without Off-Policy Evaluation
[article]
2021
arXiv
pre-print
We argue that the relatively poor performance of iterative approaches is a result of the high variance inherent in doing off-policy evaluation and magnified by the repeated optimization of policies against ...
Most prior approaches to offline reinforcement learning (RL) have taken an iterative actor-critic approach involving off-policy evaluation. ...
Sloan Foundation, NSF RI-1816753, NSF CAREER CIF 1845360, NSF CHS-1901091, Samsung Electronics, and the Institute for Advanced Study. ...
arXiv:2106.08909v3
fatcat:3ipj5t6vhvagrpk4kudflekq6i
On Multi-objective Policy Optimization as a Tool for Reinforcement Learning
[article]
2021
arXiv
pre-print
For offline RL, we use the MO perspective to derive a simple algorithm, that optimizes for the standard RL objective plus a behavioral cloning term. ...
Often, task reward and auxiliary objectives are in conflict with each other and it is therefore natural to treat these examples as instances of multi-objective (MO) optimization problems. ...
support for this paper. ...
arXiv:2106.08199v1
fatcat:uqpsvp7u4ranvk2psymtf3ugse
Top-K Off-Policy Correction for a REINFORCE Recommender System
[article]
2021
arXiv
pre-print
learning from logged feedback collected from multiple behavior policies; (3) proposing a novel top-K off-policy correction to account for our policy recommending multiple items at a time; (4) showcasing ...
The contributions of the paper are: (1) scaling REINFORCE to a production recommender system with an action space on the orders of millions; (2) applying off-policy correction to address data biases in ...
ACKNOWLEDGEMENTS We thank Craig Boutilier for his valuable comments and discussions. ...
arXiv:1812.02353v3
fatcat:uwdjn66aercizjnek63rjfwwiu
Data-efficient Hindsight Off-policy Option Learning
[article]
2021
arXiv
pre-print
To better understand the option framework and disentangle benefits from both temporal and action abstraction, we evaluate ablations with flat policies and mixture policies with comparable optimization. ...
We introduce Hindsight Off-policy Options (HO2), a data-efficient option learning algorithm. ...
We additionally like to acknowledge the support of the DeepMind robotics lab for infrastructure and engineering support. ...
arXiv:2007.15588v2
fatcat:ilf6pndw5zhnphh6drdgk2kc4q
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
[article]
2019
arXiv
pre-print
to regress onto weighted target actions for the policy. ...
AWR is also able to acquire more effective policies than most off-policy algorithms when learning from purely static datasets with no additional environmental interactions. ...
ACKNOWLEDGEMENTS We thank Abhishek Gupta and Aurick Zhou for insightful discussions. This research was supported an NSERC Postgraduate Scholarship, a Berkeley Fellowship for Graduate Study, Berkeley ...
arXiv:1910.00177v3
fatcat:tbrlxwnen5c5viroqsmopoqvcq
Off-policy Maximum Entropy Reinforcement Learning : Soft Actor-Critic with Advantage Weighted Mixture Policy(SAC-AWMP)
[article]
2020
arXiv
pre-print
The optimal policy of a reinforcement learning problem is often discontinuous and non-smooth. I.e., for two states with similar representations, their optimal policies can be significantly different. ...
A common way to solve this problem, known as Mixture-of-Experts, is to represent the policy as the weighted sum of multiple components, where different components perform well on different parts of the ...
Acknowledgments This work was supported by Agency for Science, Technology and Research, Singapore, under the National Robotics Program, with A*star SERC Grant No.: 192 25 00054. ...
arXiv:2002.02829v1
fatcat:rahjd3ta45guvglcgvtiynvrdu
A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems
[article]
2022
arXiv
pre-print
pixel observations, sustaining conversations with humans, and controlling robotic agents. ...
Finally, we provide our perspective on open problems and propose future research directions for this rapidly growing field. ...
Here, we formalize importance sampling for offline RL as a means to evaluate our policy π θ with samples from our behavior policy π β . ...
arXiv:2203.01387v2
fatcat:euobvze7kre3fi7blalnbbgefm
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets
[article]
2021
arXiv
pre-print
While a number of prior methods have either used optimal demonstrations to bootstrap RL, or have used sub-optimal data to train purely offline, it remains exceptionally difficult to train a policy with ...
Reinforcement learning (RL) provides an appealing formalism for learning control policies from experience. ...
[41] , with the key difference from our method being that AWR uses TD(λ) on the replay buffer for policy evaluation. Monotonic Advantage Re-Weighted Imitation Learning (MARWIL). ...
arXiv:2006.09359v6
fatcat:yvtbbrzrmvfrpeibmwdu2h3j5q
GenDICE: Generalized Offline Estimation of Stationary Values
[article]
2020
arXiv
pre-print
We prove its consistency under general conditions, provide an error analysis, and demonstrate strong empirical performance on benchmark problems, including off-line PageRank and off-policy policy evaluation ...
In many real-world applications, access to the underlying transition operator is limited to a fixed set of data that has already been collected, without additional interaction with the environment being ...
ACKNOWLEDGMENTS The authors would like to thank Ofir Nachum, the rest of the Google Brain team and the anonymous reviewers for helpful discussions and feedback. ...
arXiv:2002.09072v1
fatcat:ad36kutetfg4xflh2wdaxgyi3q
Batch Policy Learning under Constraints
[article]
2019
arXiv
pre-print
When learning policies for real-world domains, two important questions arise: (i) how to efficiently use pre-collected off-policy, non-optimal behavior data; and (ii) how to mediate among different competing ...
To certify constraint satisfaction, we propose a new and simple method for off-policy policy evaluation (OPE) and derive PAC-style bounds. ...
More robust doubly robust off-policy evaluation. arXiv preprint arXiv:1802.03493, 2018. Freund, Y. and Schapire, R. E. Adaptive game playing using multiplicative weights. ...
arXiv:1903.08738v1
fatcat:6rydrj3xcjgylmqvb5sbq72rey
Diverse Exploration for Fast and Safe Policy Improvement
[article]
2018
arXiv
pre-print
We provide DE theory explaining why diversity in behavior policies enables effective exploration without sacrificing exploitation. ...
Our empirical study shows that an online policy improvement algorithm framework implementing the DE strategy can achieve both fast policy improvement and safe online performance. ...
Acknowledgements The authors would like to thank Xingye Qiao for providing insights and feedback on theorem proofs and the Watson School of Engineering for computing support. ...
arXiv:1802.08331v1
fatcat:wl7nrni235cr5ktujumqdwaaue
Diverse Exploration for Fast and Safe Policy Improvement
2018
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
We provide DE theory explaining why diversity in behavior policies enables effective exploration without sacrificing exploitation. ...
Our empirical study shows that an online policy improvement algorithm framework implementing the DE strategy can achieve both fast policy improvement and safe online performance. ...
Acknowledgements The authors would like to thank Xingye Qiao for providing insights and feedback on theorem proofs and the Watson School of Engineering for computing support. ...
doi:10.1609/aaai.v32i1.11758
fatcat:rab4rdrvajafhnek6x3kvd6nj4
« Previous
Showing results 1 — 15 out of 20,584 results