253,785 Hits in 5.1 sec

Conditional Importance Sampling for Off-Policy Learning [article]

Mark Rowland, Anna Harutyunyan, Hado van Hasselt, Diana Borsa, Tom Schaul, Rémi Munos, Will Dabney
2020 arXiv   pre-print
The principal contribution of this paper is a conceptual framework for off-policy reinforcement learning, based on conditional expectations of importance sampling ratios.  ...  This framework yields new perspectives and understanding of existing off-policy algorithms, and reveals a broad space of unexplored algorithms.  ...  Acknowledgements We thank Adam White for detailed feedback on an earlier version of this paper, and the anonymous reviewers for helpful comments during the review process.  ... 
arXiv:1910.07479v2 fatcat:zic3vjldffgq7bevyzflwd4y2q

sj-pdf-1-adb-10.1177_1059712321999421 – Supplemental material for Affordance as general value function: a computational model

Daniel Graves, Johannes Günther, Jun Luo
2021 Figshare  
Supplemental material, sj-pdf-1-adb-10.1177_1059712321999421 for Affordance as general value function: a computational model by Daniel Graves, Johannes Günther and Jun Luo in Adaptive Behavior  ...  Temporal different learning is better suited for off-policy corrections since the importance sampling ratio is only applied for the immediate transition and not the entire sequence.  ...  In addition, learning GVFs off-policy where τ is different from the policy used to collect the data is very challenging with supervised learning since the importance sampling ratios at each time step are  ... 
doi:10.25384/sage.14251552.v1 fatcat:wyunoscfqjcnbiak5rsexbojfq

C-Learning: Learning to Achieve Goals via Recursive Classification [article]

Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine
2021 arXiv   pre-print
While conceptually similar to Q-learning, our work lays a principled foundation for goal-conditioned RL as density estimation, providing justification for goal-conditioned methods used in prior work.  ...  Importantly, an off-policy variant of our algorithm allows us to predict the future state distribution of a new policy, without collecting new experience.  ...  ACKNOWLEDGEMENTS We thank Dibya Ghosh and Vitchyr Pong for discussions about this work, and thank Vincent Vanhouke, Ofir Nachum, and anonymous reviewers for providing feedback on early versions of this  ... 
arXiv:2011.08909v2 fatcat:c4gd4qhb6jc7jekk3jvktbxj5a

Scaling life-long off-policy learning

Adam White, Joseph Modayil, Richard S. Sutton
2012 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL)  
Finally, we use the more efficient of the two estimators to demonstrate off-policy learning at scale-the learning of value functions for one thousand policies in real time on a physical robot.  ...  This ability constitutes a significant step towards scaling life-long off-policy learning.  ...  The importance sampling ratio, π (i) (φ,a) b(φ,a) , can be used to account for these effects.  ... 
doi:10.1109/devlrn.2012.6400860 dblp:conf/icdl-epirob/WhiteMS12 fatcat:y5zsoyqwdrcvdamf753252squm

Scaling Life-long Off-policy Learning [article]

Adam White, Joseph Modayil, Richard S. Sutton
2012 arXiv   pre-print
Finally, we use the more efficient of the two estimators to demonstrate off-policy learning at scale - the learning of value functions for one thousand policies in real time on a physical robot.  ...  This ability constitutes a significant step towards scaling life-long off-policy learning.  ...  The importance sampling ratio, π (i) (φ,a) b(φ,a) , can be used to account for these effects.  ... 
arXiv:1206.6262v1 fatcat:tduqvmqhy5df7ooj24hb2lrggm

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables [article]

Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, Sergey Levine
2019 arXiv   pre-print
This probabilistic interpretation enables posterior sampling for structured and efficient exploration.  ...  Current methods rely heavily on on-policy experience, limiting their sample efficiency.  ...  We thank Ignasi Clavera, Abhishek Gupta, and Rowan McAllister for insightful discussions, and Coline Devin, Kelvin Xu, Vitchyr Pong, and Karol Hausman for feedback on early drafts.  ... 
arXiv:1903.08254v1 fatcat:xkvl64r7cfdlvkcaxywczmytha

A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems [article]

Rafael Figueiredo Prudencio, Marcos R. O. A. Maximo, Esther Luna Colombini
2022 arXiv   pre-print
Offline RL is a paradigm that learns exclusively from static datasets of previously collected interactions, making it feasible to extract policies from large and diverse training datasets.  ...  With the widespread adoption of deep learning, reinforcement learning (RL) has experienced a dramatic increase in popularity, scaling to previously intractable problems, such as playing complex games from  ...  Importance Sampling Importance sampling is commonly used in RL to compute off-policy policy gradients.  ... 
arXiv:2203.01387v2 fatcat:euobvze7kre3fi7blalnbbgefm

Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning [article]

Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip H. S. Torr, Pushmeet Kohli, Shimon Whiteson
2018 arXiv   pre-print
This paper proposes two methods that address this problem: 1) using a multi-agent variant of importance sampling to naturally decay obsolete data and 2) conditioning each agent's value function on a fingerprint  ...  that disambiguates the age of the data sampled from the replay memory.  ...  Cloud computing GPU resources were provided through a Microsoft Azure for Research award. We thank Nando de Freitas, Yannis Assael, and Brendan Shillingford for the helpful comments and discussion.  ... 
arXiv:1702.08887v3 fatcat:6cqtkoxb7rfq7pjg2zcnasiyde

Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning [article]

Archit Sharma, Michael Ahn, Sergey Levine, Vikash Kumar, Karol Hausman, Shixiang Gu
2020 arXiv   pre-print
In this paper, we demonstrate that a recently proposed unsupervised skill discovery algorithm can be extended into an efficient off-policy method, making it suitable for performing unsupervised reinforcement  ...  Reinforcement learning provides a general framework for learning robotic skills while minimizing engineering effort.  ...  To re-use off policy data for learning q φ , we have to consider importance sampling corrections, as the data has been sampled from a different distribution.  ... 
arXiv:2004.12974v1 fatcat:xsroacyjfvevbjjfyaztff7tfu

Relative Importance Sampling For Off-Policy Actor-Critic in Deep Reinforcement Learning [article]

Mahammad Humayoo, Xueqi Cheng
2019 arXiv   pre-print
One reason for the instability of off-policy learning is a discrepancy between the target (π) and behavior (b) policy distributions.  ...  Off-policy learning is more unstable compared to on-policy learning in reinforcement learning (RL).  ...  Acknowledgements We would like to thank editors, referees for their valuable suggestions and comments.  ... 
arXiv:1810.12558v6 fatcat:p5sqt3uckfhytbb4bej65evvzq

Per-decision Multi-step Temporal Difference Learning with Control Variates [article]

Kristopher De Asis, Richard S. Sutton
2018 arXiv   pre-print
Especially in the off-policy setting, where the agent aims to learn about a policy different from the one generating its behaviour, the variance in the updates can cause learning to diverge as the number  ...  Our results show that including the control variates can greatly improve performance on both on and off-policy multi-step temporal difference learning tasks.  ...  Acknowledgements The authors thank Yi Wan for insights and discussions contributing to the results presented in this paper, and the entire Reinforcement Learning and Artificial Intelligence research group  ... 
arXiv:1807.01830v1 fatcat:7ojrfl74ljhnnjsqovkn34vye4

Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting [article]

Aditya Grover, Jiaming Song, Alekh Agarwal, Kenneth Tran, Ashish Kapoor, Eric Horvitz, Stefano Ermon
2019 arXiv   pre-print
Finally, we demonstrate its utility on representative applications in a) data augmentation for classification using generative adversarial networks, and b) model-based policy evaluation using off-policy  ...  We employ this likelihood-free importance weighting method to correct for the bias in generative models.  ...  We are thankful to Daniel Levy, Rui Shu, Yang Song, and members of the Reinforcement Learning, Deep Learning, and Adaptive Systems and Interaction groups at Microsoft Research for helpful discussions and  ... 
arXiv:1906.09531v2 fatcat:d3x5xmsc6rhrheduyi67wjqijy

One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL [article]

Tom Le Paine, Sergio Gómez Colmenarejo, Ziyu Wang, Scott Reed, Yusuf Aytar, Tobias Pfaff, Matt W. Hoffman, Gabriel Barth-Maron, Serkan Cabi, David Budden, Nando de Freitas
2018 arXiv   pre-print
MetaMimic relies on the principle of storing all experiences in a memory and replaying these to learn massive deep neural network policies by off-policy RL.  ...  MetaMimic can learn both (i) policies for high-fidelity one-shot imitation of diverse novel skills, and (ii) policies that enable the agent to solve tasks more efficiently than the demonstrators.  ...  An important characteristic of D4PG is that it maintains a replay memory M (possibility prioritized ) that stores SARS tuples which allows for off-policy learning.  ... 
arXiv:1810.05017v1 fatcat:dgihpbshovbidopkze2rwdr7jq

SMIX(λ): Enhancing Centralized Value Functions for Cooperative Multi-Agent Reinforcement Learning

Chao Wen, Xinghu Yao, Yuhui Wang, Xiaoyang Tan
Interestingly, it is revealed that there exists inherent connection between SMIX(λ) and previous off-policy Q(λ) approach for single-agent learning.  ...  This work presents a sample efficient and effective value-based method, named SMIX(λ), for reinforcement learning in multi-agent environments (MARL) within the paradigm of centralized training with decentralized  ...  Off-Policy Learning without Importance Sampling One way to alleviate the curse of dimensionality issue of joint action space and to improve exploration is the off-policy learning.  ... 
doi:10.1609/aaai.v34i05.6223 fatcat:v2sdwy7faracjnk4aedanlnlpa

SOPE: Spectrum of Off-Policy Estimators [article]

Christina J. Yuan, Yash Chandak, Stephen Giguere, Philip S. Thomas, Scott Niekum
2021 arXiv   pre-print
Many sequential decision making problems are high-stakes and require off-policy evaluation (OPE) of a new policy using historical data collected using some other policy.  ...  One of the most common OPE techniques that provides unbiased estimates is trajectory based importance sampling (IS).  ...  Conditional importance sampling for off-policy learning. In International Conference on Artificial Intelligence and Statistics, pages 45–55. PMLR, 2020.  ... 
arXiv:2111.03936v3 fatcat:2xnufhiq6jgf7pgmvpzb5ruhl4
« Previous Showing results 1 — 15 out of 253,785 results