201 Hits in 6.1 sec

Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs [article]

Naman Agarwal, Syomantak Chaudhuri, Prateek Jain, Dheeraj Nagaraj, Praneeth Netrapalli
2021 arXiv   pre-print
We show that Q-Rex efficiently finds the optimal policy for linear MDPs (or more generally for MDPs with zero inherent Bellman error with linear approximation (ZIBEL)) and provide non-asymptotic bounds  ...  (online target learning, or OTL) , and (ii) experience replay (ER) (Mnih et al., 2015).  ...  introduction, we incorporate RER and OTL into Q-learning and introduce the algorithms Q-Rex (Online Target Q-learning with reverse experience replay, Algorithm 1), its sample efficient variant Q-RexDaRe  ... 
arXiv:2110.08440v2 fatcat:6jlqjmydtbfcdclybj7q76siha

Reconciling λ-Returns with Experience Replay [article]

Brett Daley, Christopher Amato
2020 arXiv   pre-print
In particular, off-policy methods that utilize experience replay remain problematic because their random sampling of minibatches is not conducive to the efficient calculation of λ-returns.  ...  By promoting short sequences of past transitions into a small cache within the replay memory, adjacent λ-returns can be efficiently precomputed by sharing Q-values.  ...  Acknowledgments We would like to thank the anonymous reviewers for their valuable feedback. We also gratefully acknowledge NVIDIA Corporation for its GPU donation.  ... 
arXiv:1810.09967v3 fatcat:k5ngdmtpbzeaznr4aftai7xuj4

Prioritized memory access explains planning and hippocampal replay [article]

Marcelo Gomes Mattar, Nathaniel D Daw
2017 bioRxiv   pre-print
We show that this theory offers a unifying account of a range of hitherto disconnected findings in the place cell literature such as the balance of forward and reverse replay, biases in the replayed content  ...  , and effects of experience.  ...  Acknowledgements We thank Máté Lengyel, Daphna Shohamy, and Daniel Acosta-Kane for many helpful discussions, and Dylan Rich for his comments on an earlier draft of the manuscript.  ... 
doi:10.1101/225664 fatcat:gzqplo6o4bf6jcu3wkn5eymmgi

Prioritized memory access explains planning and hippocampal replay

Marcelo G. Mattar, Nathaniel D. Daw
2018 Nature Neuroscience  
Our theory offers a simple explanation for numerous findings about place cells; unifies seemingly disparate proposed functions of replay including planning, learning, and consolidation; and posits a mechanism  ...  We propose a normative theory predicting which memories should be accessed at each moment to optimize future decisions.  ...  Acknowledgements We thank Máté Lengyel, Daphna Shohamy, and Daniel Acosta-Kane for many helpful discussions, and Dylan Rich for his comments on an earlier draft of the manuscript.  ... 
doi:10.1038/s41593-018-0232-z pmid:30349103 pmcid:PMC6203620 fatcat:yfxdyrvy6bgnvgb6ksm5gncsly

Deep Reinforcement Learning for Autonomous Search and Rescue

Juan Gonzalo Carcamo Zuluaga, Jonathan P. Leidig, Christian Trefftz, Greg Wolffe
2018 NAECON 2018 - IEEE National Aerospace and Electronics Conference  
However, these systems were developed before advances such as Google Deepmind's breakthrough with the Deep Q-Network (DQN) technology.  ...  The main approach investigated in this research is the Deep Q-Network. 3  ...  The core problem of MDPs is to find an optimal policy π*; to find the π(s) that maximizes the cumulative reward while walking the MDP.  ... 
doi:10.1109/naecon.2018.8556642 fatcat:doraqoalezddte35idjns4dzra

Deep Inverse Q-learning with Constraints [article]

Gabriel Kalweit, Maria Huegle, Moritz Werling, Joschka Boedecker
2020 arXiv   pre-print
Popular Maximum Entropy Inverse Reinforcement Learning approaches require the computation of expected state visitation frequencies for the optimal policy under an estimate of the reward function.  ...  This is possible through a formulation that exploits a probabilistic behavior assumption for the demonstrations within the structure of Q-learning.  ...  Going through the MDP once in reverse topological order based on its model M, we can compute Q * for the succeeding states and actions, leading to a reward function for which the induced optimal action-value  ... 
arXiv:2008.01712v1 fatcat:jta6rr7a5ramrdosqtvszfa5du

UAV Autonomous Aerial Combat Maneuver Strategy Generation with Observation Error Based on State-Adversarial Deep Deterministic Policy Gradient and Inverse Reinforcement Learning

Weiren Kong, Deyun Zhou, Zhen Yang, Yiyang Zhao, Kai Zhang
2020 Electronics  
Finally, the efficiency of the aerial combat strategy generation algorithm and the performance and robustness of the resulting aerial combat strategy is verified by simulation experiments.  ...  At the same time, a reward shaping method based on maximum entropy (MaxEnt) inverse reinforcement learning algorithm (IRL) is proposed to improve the aerial combat strategy generation algorithm's efficiency  ...  Q and π with weights θ Q ← θ Q , θ π ← θ π 3: Initialize replay buffer R. 4 : for episode = 1 to M do 5: Initialize a random process N for action exploration 6: Receive initial observation state  ... 
doi:10.3390/electronics9071121 fatcat:cbu5qoteuzdovkyhhyyo2q6db4

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey [article]

Sanmit Narvekar and Bei Peng and Matteo Leonetti and Jivko Sinapov and Matthew E. Taylor and Peter Stone
2020 arXiv   pre-print
Finally, we use our framework to find open problems and suggest directions for future RL curriculum learning research.  ...  To address this problem, transfer learning has been applied to reinforcement learning such that experience gained in one task can be leveraged when starting to learn the next, harder task.  ...  Part of this work has taken place in the Learning Agents Research Group (LARG) at the Artificial Intelligence Laboratory, The University of Texas at Austin. LARG re-  ... 
arXiv:2003.04960v2 fatcat:iacmqeb7jjeezpo27jsnzuqb7u

Batch Reinforcement Learning [chapter]

Sascha Lange, Thomas Gabel, Martin Riedmiller
2012 Adaptation, Learning, and Optimization  
online case, where the agent interacts with the environment while learning.  ...  Due to the efficient use of collected data and the stability of the learning process, this research area has attracted a lot of attention recently.  ...  For example, the growing batch approach could be classified as an online method-it interacts with the system like an online method and incrementally improves its policy as new experience becomes available-as  ... 
doi:10.1007/978-3-642-27645-3_2 fatcat:2iifzorcb5cwrfka73joocam5e

Reversely Discovering and Modifying Properties Based on Active Deep Q-Learning

Yu Lei, Huo Zhifa
2020 IEEE Access  
The TD loss is: L(θ) = ( r + γ × Q(s ′ , argmax ′ ∈ Q(s ′ , a ′ ; θ); θ ̅ ) −Q(s, a; θ) ) 2 (2) 2) PRIORITIZED EXPERIENCE REPLAY: Prioritized experience replay samples more frequently those transitions  ...  Double DQN [32] , dueling architecture [33] and prioritized experience replay are able to enhance learning efficiency and stability, which allow agents to effectively use their experiences.  ... 
doi:10.1109/access.2020.3019278 fatcat:vbcdj2zsnzaxpcbu3uae2awz4e

Self-Consistent Models and Values [article]

Gregory Farquhar, Kate Baumli, Zita Marinho, Angelos Filos, Matteo Hessel, Hado van Hasselt, David Silver
2021 arXiv   pre-print
Learned models of the environment provide reinforcement learning (RL) agents with flexible ways of making predictions about the environment.  ...  We propose multiple self-consistency updates, evaluate these in both tabular and function approximation settings, and find that, with appropriate choices, self-consistency helps both policy evaluation  ...  Thanks also to the developers of Jax [9] and the DeepMind Jax ecosystem [3] which were invaluable to this project. The authors received no specific funding for this work.  ... 
arXiv:2110.12840v1 fatcat:5ott7uqvavhodldt6nimv2ussu

DDPG-Based Energy-Efficient Flow Scheduling Algorithm in Software-Defined Data Centers

Zan Yao, Ying Wang, Luoming Meng, Xuesong Qiu, Peng Yu, Yan Huang
2021 Wireless Communications and Mobile Computing  
To cope with a large solution space, we design a DDPG-EEFS algorithm to find the optimal scheduling scheme for flows.  ...  The flow scheduling optimization problem can be modeled as a Markov decision process (MDP).  ...  Acknowledgments This work was supported by the National Key R&D Program of China (2018YFE0205502).  ... 
doi:10.1155/2021/6629852 fatcat:kjzh4kzlyrbktfm5s62raxxlce

A Reinforcement Learning Approach for Transient Control of Liquid Rocket Engines

Gunther Waxenegger-Wilfing, Kai Dresia, Jan Deeken, Michael Oschwald
2021 IEEE Transactions on Aerospace and Electronic Systems  
In this paper, we study a deep reinforcement learning approach for optimal control of a generic gas-generator engine's continuous start-up phase.  ...  It is shown that the learned policy can reach different steady-state operating points and convincingly adapt to changing system parameters.  ...  ACKNOWLEDGMENT The authors would like to thank Wolfgang Kitsche, Robson Dos Santos Hahn, and Michael Börner for valuable discussions concerning the start-up of a gas-generator cycle liquid rocket engine  ... 
doi:10.1109/taes.2021.3074134 fatcat:tykhhkd2b5gpjnzhfaiw5zaba4

Adaptive and Multiple Time-scale Eligibility Traces for Online Deep Reinforcement Learning [article]

Taisuke Kobayashi
2022 arXiv   pre-print
The eligibility traces method is well known as an online learning technique for improving sample efficiency in traditional reinforcement learning with linear regressors rather than DRL.  ...  Because methods that directly reuse the stored experience data cannot follow the change of the environment in robotic problems with a time-varying environment, online DRL is required.  ...  Acknowledgements This work was supported by JSPS KAKENHI, Grant-in-Aid for Scientific Research (B), Grant Number 20H04265.  ... 
arXiv:2008.10040v2 fatcat:h5eb4f7gpfgyva6bezf6za3thq

A Survey of Deep Network Solutions for Learning Control in Robotics: From Reinforcement to Imitation [article]

Lei Tai and Jingwei Zhang and Ming Liu and Joschka Boedecker and Wolfram Burgard
2018 arXiv   pre-print
This survey focuses on deep learning solutions that target learning control policies for robotics applications.  ...  We carry out our discussions on the two main paradigms for learning control with deep networks: deep reinforcement learning and imitation learning.  ...  in DQN to stabilize learning: target-network and experience replay.  ... 
arXiv:1612.07139v4 fatcat:znbcze2jzjeshaciko7amxwxc4
« Previous Showing results 1 — 15 out of 201 results