71 Hits in 2.5 sec

RUDDER: Return Decomposition for Delayed Rewards [article]

Jose A. Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, Sepp Hochreiter
2019 arXiv   pre-print
We propose RUDDER, a novel reinforcement learning approach for delayed rewards in finite Markov decision processes (MDPs).  ...  (ii) Return decomposition via contribution analysis which transforms the reinforcement learning task into a regression task at which deep learning excels.  ...  We propose RUDDER (RetUrn Decomposition for DElayed Rewards) for learning with reward redistributions that are obtained via return decompositions.  ... 
arXiv:1806.07857v3 fatcat:hu3dl6eezzbrdc3in67gcbmbau

Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution [article]

Vihang P. Patil, Markus Hofmarcher, Marius-Constantin Dinu, Matthias Dorfer, Patrick M. Blies, Johannes Brandstetter, Jose A. Arjona-Medina, Sepp Hochreiter
2020 arXiv   pre-print
However, for complex tasks, current exploration strategies as deployed in RUDDER struggle with discovering episodes with high rewards.  ...  Align-RUDDER inherits the concept of reward redistribution, which considerably reduces the delay of rewards, thus speeding up learning.  ...  Reward redistribution using multiple sequence alignment. RUDDER uses an LSTM model for reward redistribution via return decomposition.  ... 
arXiv:2009.14108v1 fatcat:2agkgxiw6vbyvewf7u6mylcmqu

Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER [article]

Markus Holzleitner, Lukas Gruber, José Arjona-Medina, Johannes Brandstetter, Sepp Hochreiter
2020 arXiv   pre-print
Our framework allows showing convergence of the well known Proximal Policy Optimization (PPO) and of the recently introduced RUDDER.  ...  For the convergence proof we employ recently introduced techniques from the two time-scale stochastic approximation theory.  ...  Acknowledgments The ELLIS Unit Linz, the LIT AI Lab, the Institute for Machine Learning, are supported by the Federal State Upper Austria. IARAI is supported by Here Technologies.  ... 
arXiv:2012.01399v1 fatcat:xtrzqeoucnbsvfvcflhqwjueoq

Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement Learning [article]

Baicen Xiao, Bhaskar Ramasubramanian, Radha Poovendran
2022 arXiv   pre-print
The delayed nature of this reward affects the ability of the agents to assess the quality of their actions at intermediate time-steps.  ...  In this paper, we introduce Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement Learning (AREL) to address these two challenges.  ...  An approach named RUDDER [2] used contribution analysis to decompose episodic rewards by computing the difference between predicted returns at successive time-steps.  ... 
arXiv:2201.04612v1 fatcat:wwfce65wn5d6niihkqurlvudqu

Zero-shot Policy Learning with Spatial Temporal RewardDecomposition on Contingency-aware Observation [article]

Huazhe Xu, Boyuan Chen, Yang Gao, Trevor Darrell
2021 arXiv   pre-print
We find this setting natural for biological creatures and at the same time, challenging for previous methods.  ...  Please refer to the project page for more visualized results.  ...  We use the latent representation as the observations for a behavioral cloning agent. RUDDER [32] RUDDER proposes to use LSTM to decompose the delayed sparse reward to dense rewards.  ... 
arXiv:1910.08143v2 fatcat:tvjcaayl6zcx7f5x3pe3uahyge

Off-Policy Reinforcement Learning with Delayed Rewards [article]

Beining Han, Zhizhou Ren, Zuofan Wu, Yuan Zhou, Jian Peng
2021 arXiv   pre-print
We study deep reinforcement learning (RL) algorithms with delayed rewards.  ...  For practical tasks with high dimensional state spaces, we further introduce the HC-decomposition rule of the Q-function in our framework which naturally leads to an approximation scheme that helps boost  ...  . • RUDDER [18] . It decomposes the delayed and sparse reward to a surrogate per-step reward via regression.  ... 
arXiv:2106.11854v1 fatcat:t2dbyb4bhjdebflox3pcvdjzvu

Synthetic Returns for Long-Term Credit Assignment [article]

David Raposo, Sam Ritter, Adam Santoro, Greg Wayne, Theophane Weber, Matt Botvinick, Hado van Hasselt, Francis Song
2021 arXiv   pre-print
This approach suffers when delays between actions and rewards are long and when intervening unrelated events contribute variance to long-term returns.  ...  Finally, we show that our IMPALA-based SR agent solves Atari Skiing -- a game with a lengthy reward delay that posed a major hurdle to deep-RL agents -- 25 times faster than the published state-of-the-art  ...  Acknowledgements We thank Anna Harutyunyan for insightful feedback and discussion of our writeup and formal aspects of the work; Pablo Sprechmann, Adrià Puigdomènech Badia, and Steven Kapturowski for discussion  ... 
arXiv:2102.12425v1 fatcat:pr2y5v33qze7beq53tdhc7foey

Learning Guidance Rewards with Trajectory-space Smoothing [article]

Tanmay Gangwani, Yuan Zhou, Jian Peng
2020 arXiv   pre-print
the benefit of our approach when the environmental rewards are sparse or delayed.  ...  However, they struggle to solve tasks with delays between an action and the corresponding rewarding feedback.  ...  In contrast with these, our computation of the guidance rewards does not require training auxiliary networks and could be viewed as a simple uniform return decomposition.  ... 
arXiv:2010.12718v1 fatcat:7wvtolwa3nblnal2ds4h5zlc5u

A Survey of Deep Reinforcement Learning in Video Games [article]

Kun Shao, Zhentao Tang, Yuanheng Zhu, Nannan Li, Dongbin Zhao
2019 arXiv   pre-print
spare rewards, as well as some research directions.  ...  This learning mechanism updates the policy to maximize the return with an end-to-end method.  ...  ACKNOWLEDGMENT The authors would like to thank Qichao Zhang, Dong Li and Weifan Li for the helpful comments and discussions about this work.  ... 
arXiv:1912.10944v2 fatcat:fsuzp2sjrfcgfkyclrsyzflax4

Q-value Path Decomposition for Deep Multiagent Reinforcement Learning [article]

Yaodong Yang, Jianye Hao, Guangyong Chen, Hongyao Tang, Yingfeng Chen, Yujing Hu, Changjie Fan, Zhongyu Wei
2020 arXiv   pre-print
During centralized training, one key challenge is the multiagent credit assignment: how to allocate the global rewards for individual agent policies for better coordination towards maximizing system-level's  ...  In this paper, we propose a new method called Q-value Path Decomposition (QPD) to decompose the system's global Q-values into individual agents' Q-values.  ...  Applying integrated gradients into RL was first studied in RUDDER (Arjona-Medina et al., 2018) to ad-dress the sparse delayed reward problem in single-agent RL and has shown excellent performance.  ... 
arXiv:2002.03950v1 fatcat:f3badwvmvjao5gekvekt54smju

Optimizing Agent Behavior over Long Time Scales by Transporting Value [article]

Chia-Chun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, Greg Wayne
2018 arXiv   pre-print
Existing approaches to shorter-term credit assignment in AI cannot solve tasks with long delays between actions and consequences.  ...  Here, we introduce a new paradigm for reinforcement learning where agents use recall of specific memories to credit actions from the past, allowing them to solve problems that are intractable for existing  ...  Rudder: Return decomposition for delayed rewards. arXiv preprint arXiv:1806.07857 (2018). 29. Li, S., Cullen, W. K., Anwyl, R. & Rowan, M. J.  ... 
arXiv:1810.06721v2 fatcat:sos65kc5s5dcnj7q7t4ng6oxge

Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update [article]

Su Young Lee, Sungik Choi, Sae-Young Chung
2019 arXiv   pre-print
Our computationally efficient recursive algorithm allows sparse and delayed rewards to propagate directly through all transitions of the sampled episode.  ...  RUDDER [1] introduces an LSTM network with contribution analysis for an efficient return decomposition.  ...  In many practical problems, an RL agent observes sparse and delayed rewards.  ... 
arXiv:1805.12375v3 fatcat:4jwwpp6pp5gdncwxauducl2gme

Optimality, Accuracy, and Efficiency of an Exact Functional Test

Hien H. Nguyen, Hua Zhong, Mingzhou Song
2020 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence  
Most asymmetric methods for causal direction inference are not driven by the function-versus-independence question.  ...  RUDDER [Arjona-Medina et al., 2019] is an online method for credit assignment based on return decomposition. The focus of RUDDER is on online credit assignment while ours is on transfer.  ...  In the reward shaping formalism, the potential function φ depends only on the state. To stay within its bounds, we define φ as the forwarded redistributed return.  ... 
doi:10.24963/ijcai.2020/368 dblp:conf/ijcai/FerretMGP20 fatcat:fpj7xo2t4naevhsiwozdwrj6sq

Technologies for distributed flight control systems: A review

M. Segvic, K. Krajcek, E. Ivanjko
2015 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)  
Described systems and technologies are represented with examples of real systems including swarms of small Unmanned Aerial Vehicles and distributed networks for Fault Detection and Isolation. I.  ...  These systems have the potential to be more economic than a centralized system due to the simplicity of individual components, and possibility of using the same production unit for a different role within  ...  For aircraft's FCS, there are many requirements that make implementation of PLC difficult, such as using negative return wires on the power bus instead of chassis return as usual.  ... 
doi:10.1109/mipro.2015.7160432 dblp:conf/mipro/SegvicKI15 fatcat:kyl3ddyxjnfzrj7clhvk7j24ue

Recent Advances in Deep Reinforcement Learning Applications for Solving Partially Observable Markov Decision Processes (POMDP) Problems Part 2—Applications in Transportation, Industries, Communications and Networking and More Topics

Xuanchen Xiang, Simon Foo, Huanyu Zang
2021 Machine Learning and Knowledge Extraction  
The two-part series of papers provides a survey on recent advances in Deep Reinforcement Learning (DRL) for solving partially observable Markov decision processes (POMDP) problems.  ...  The first part of the overview introduces Markov Decision Processes (MDP) problems and Reinforcement Learning and applications of DRL for solving POMDP problems in games, robotics, and natural language  ...  [128] introduced RUDDER (Return Decomposition for Delayed Rewards) to learn long-term credit assignments for delayed rewards.  ... 
doi:10.3390/make3040043 doaj:45bf00de595c44d186fa3d200589c1c5 fatcat:qx4srh7qabgjvd5l6lj6nulhxa
« Previous Showing results 1 — 15 out of 71 results