3,657 Hits in 5.8 sec

META-Learning Eligibility Traces for More Sample Efficient Temporal Difference Learning [article]

Mingde Zhao
2020 arXiv   pre-print
To improve the sample efficiency of TD-learning, we propose a meta-learning method for adjusting the eligibility trace parameter, in a state-dependent manner.  ...  TD-learning with eligibility traces provides a way to do temporal credit assignment, i.e. decide which portion of a reward should be assigned to predecessor states that occurred at different previous times  ...  · ∇ θ ln(π(A|S; θ))// accumulating traces for z θ w = w + α w δ · z w θ = θ + α θ δ · z θ I = I · γ(S ) S = S Chapter 3 Sample Efficiency of Temporal Difference Learning Learning faster and more accurately  ... 
arXiv:2006.08906v1 fatcat:z4vsafqmm5dqlluues7bv26buy

Reinforcement Learning and its Connections with Neuroscience and Psychology [article]

Ajay Subramanian, Sharad Chitlangia, Veeky Baths
2021 arXiv   pre-print
In this paper, we comprehensively review a large number of findings in both neuroscience and psychology that evidence reinforcement learning as a promising candidate for modeling learning and decision  ...  While there certainly has been considerable independent innovation to produce such results, many core ideas in reinforcement learning are inspired by phenomena in animal learning, psychology and neuroscience  ...  in sample efficiency and robustness  ... 
arXiv:2007.01099v5 fatcat:mjpkztlmqnfjba3dtcwqwmmlvu

META-Learning State-based Eligibility Traces for More Sample-Efficient Policy Evaluation [article]

Mingde Zhao, Sitao Luan, Ian Porada, Xiao-Wen Chang, Doina Precup
2020 arXiv   pre-print
For better sample efficiency of TD-learning, we propose a meta-learning method for adjusting the eligibility trace parameter, in a state-dependent manner.  ...  TD-learning with eligibility traces provides a way to boost sample efficiency by temporal credit assignment, i.e. deciding which portion of a reward should be assigned to predecessor states that occurred  ...  We are grateful to Compute Canada for providing a shared cluster for experimentation.  ... 
arXiv:1904.11439v6 fatcat:ivaaiiqsx5dbrjo3wcfljb2amm

Learning to learn online with neuromodulated synaptic plasticity in spiking neural networks [article]

Samuel Schmidgall, Joe Hays
2022 arXiv   pre-print
We propose that in order to harness our understanding of neuroscience toward machine learning, we must first have powerful tools for training brain-like models of learning.  ...  with a framework of learning to learn through gradient descent to address challenging online learning problems.  ...  Acknowledgments The program is funded by Office of the Under Secretary of Defense (OUSD) through the Applied Research for Advancement of S&T Priorities (ARAP) Program work unit 1U64.  ... 
arXiv:2206.12520v2 fatcat:tqohncoyvrdf5n7xumenrwlwle

TIDBD: Adapting Temporal-difference Step-sizes Through Stochastic Meta-descent [article]

Alex Kearney, Vivek Veeriah, Jaden B. Travnik, Richard S. Sutton, Patrick M. Pilarski
2018 arXiv   pre-print
In this paper, we introduce a method for adapting the step-sizes of temporal difference (TD) learning.  ...  Furthermore, adapting parameters at different rates has the added benefit of being a simple form of representation learning.  ...  Step-size adaptation in temporal-difference learning The problem of how to set step-sizes automatically is an important one for machine learning.  ... 
arXiv:1804.03334v1 fatcat:vspu4e3mg5dw3okjbfdj2mybie

A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning [article]

Martha White, Adam White
2016 arXiv   pre-print
There are no meta-learning method for λ that can achieve (1) incremental updating, (2) compatibility with function approximation, and (3) maintain stability of learning under both on and off-policy sampling  ...  For temporal-difference learning algorithms which we study here, there is yet another parameter, λ, that similarly impacts learning speed and stability in practice.  ...  Acknowledgements We would like to thank David Silver for helpful discussions and the reviewers for helpful comments.  ... 
arXiv:1607.00446v2 fatcat:dvvdietczjaslc4jjurey55fum

Meta-Gradient Reinforcement Learning [article]

Zhongwen Xu, Hado van Hasselt, David Silver
2018 arXiv   pre-print
Instead, the majority of reinforcement learning algorithms estimate and/or optimise a proxy for the value function.  ...  We discuss a gradient-based meta-learning algorithm that is able to adapt the nature of the return, online, whilst interacting and learning from the environment.  ...  for their suggestions and comments on an early version of the paper.  ... 
arXiv:1805.09801v1 fatcat:mls5nqcgprbcpkdazc7fmnsuk4

Off-policy Learning with Eligibility Traces: A Survey [article]

Matthieu Geist, Bruno Scherrer
2013 arXiv   pre-print
Then, we highlight a systematic approach for adapting them to off-policy learning with eligibility traces.  ...  In the framework of Markov Decision Processes, off-policy learning, that is the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated  ...  Given samples, well-known methods for estimating a value function are temporal difference (TD) learning and Monte Carlo (Sutton and Barto, 1998) .  ... 
arXiv:1304.3999v1 fatcat:kagkj4bs7vd7nkt37nprlyogz4

Evolving interpretable plasticity for spiking networks

Jakob Jordan, Maximilian Schmidt, Walter Senn, Mihai A Petrovici
2021 eLife  
We successfully apply our approach to typical learning scenarios and discover previously unknown mechanisms for learning efficiently from rewards, recover efficient gradient-descent methods for learning  ...  How these changes can be mathematically described at the phenomenological level, as so-called 'plasticity rules', is essential both for understanding biological information processing and for developing  ...  the individual components of the eligibility trace.  ... 
doi:10.7554/elife.66273 pmid:34709176 pmcid:PMC8553337 fatcat:ekwsjg3gcndzbcgbuzb6y32q3y

Brain-inspired global-local learning incorporated with neuromorphic computing [article]

Yujie Wu, Rong Zhao, Jun Zhu, Feng Chen, Mingkun Xu, Guoqi Li, Sen Song, Lei Deng, Guanrui Wang, Hao Zheng, Jing Pei, Youhui Zhang (+2 others)
2021 arXiv   pre-print
It can meta-learn local plasticity and receive top-down supervision information for multiscale synergic learning.  ...  We demonstrate the advantages of this model in multiple different tasks, including few-shot learning, continual learning, and fault-tolerance learning in neuromorphic vision sensors.  ...  Data and code availability All data used in this paper are publicly available and can be accessed at for the MNIST dataset,  ... 
arXiv:2006.03226v3 fatcat:rpx3rt56lzbzrhffdcipfxtuji

Online Off-policy Prediction [article]

Sina Ghiassian, Andrew Patterson, Martha White, Richard S. Sutton, Adam White
2018 arXiv   pre-print
The issue lies with the temporal difference (TD) learning update at the heart of most prediction algorithms: combining bootstrapping, off-policy sampling and function approximation may cause the value  ...  for decades.  ...  Sample efficient actor-critic with experience replay. ArXiv:1611.01224. Yu, H. (2015). On convergence of emphatic temporal-difference learning.  ... 
arXiv:1811.02597v1 fatcat:qqkbocmp2bbjxlcb5r5wbou3vq

Selective Credit Assignment [article]

Veronica Chelu, Diana Borsa, Doina Precup, Hado van Hasselt
2022 arXiv   pre-print
We describe a unified view on temporal-difference algorithms for selective credit assignment. These selective algorithms apply weightings to quantify the contribution of learning updates.  ...  Efficient credit assignment is essential for reinforcement learning algorithms in both prediction and control settings.  ...  The Q-learning algorithms illustrated in Fig. 1 use a form of temporal difference (TD) learning (Sutton, 1988a) to learn predictions online from sampled experience by bootstrapping on other predictions  ... 
arXiv:2202.09699v1 fatcat:26zcp3tku5hqfmhtiuojcjxw4a

Darwinian embodied evolution of the learning ability for survival

Stefan Elfwing, Eiji Uchibe, Kenji Doya, Henrik I Christensen
2011 Adaptive Behavior  
Q-learning is more difficult than Sarsa to combine with eligibility traces, because the learned policy, the greedy policy, is different than the policy used for selecting actions.  ...  Examples of this approach are the Dyna algorithm by Sutton (1990) , and Prioritized Sweeping by Moore and Atkeson (1993) Eligibility Traces Eligibility traces is a basic mechanism for temporal credit  ... 
doi:10.1177/1059712310397633 fatcat:2r5mx4nh3rdvliamrqo5ve6ttq

One-shot learning with spiking neural networks [article]

Franz Scherr, Christoph Stoeckl, Wolfgang Maass
2020 bioRxiv   pre-print
in RSNNs for large families of learning tasks.  ...  The same learning approach also supports fast spike-based learning of posterior probabilities of potential input sources, thereby providing a new basis for probabilistic reasoning in RSNNs.  ...  We would like to thank Sandra Diaz from the SimLab at the FZ Jülich for enabling the use of CSCS.  ... 
doi:10.1101/2020.06.17.156513 fatcat:q2iim666rvclbpfa2ngjuyuppe

Deep Reinforcement Learning Overview of the state of the Art

Youssef Fenjiro, Houda Benbrahim
2018 Journal of Automation, Mobile Robotics & Intelligent Systems  
Artificial intelligence has made big steps forward with reinforcement learning (RL) in the last century, and with the advent of deep learning (DL) in the 90s, especially, the breakthrough of convolutional  ...  In the end, we will discuss some potential research directions in the field of deep RL, for which we have great expectations that will lead to a real human level of intelligence.  ...  more efficient way.  ... 
doi:10.14313/jamris_3-2018/15 fatcat:wn5i7y7tgfhvnhz3u5xkqlgvpe
« Previous Showing results 1 — 15 out of 3,657 results