Filters








39,541 Hits in 2.5 sec

TD-learning with exploration

Sean P. Meyn, Amit Surana
2011 IEEE Conference on Decision and Control and European Control Conference  
We introduce exploration in the TD-learning algorithm to approximate the value function for a given policy.  ...  We are interested in performance bounds for approximate policy iteration for TD-learning or SARSA with exploration.  ...  CONCLUSIONS We have shown how a simple restart mechanism leads to a new version of TD-learning that allows for exploration.  ... 
doi:10.1109/cdc.2011.6160851 dblp:conf/cdc/MeynS11 fatcat:an6r4ssu3vfddggcg6fp6by7nu

The improvement of Q-learning applied to imperfect information game

Jing Lin, Xuan Wang, Lijiao Han, Jiajia Zhang, Xinxin Xu
2009 2009 IEEE International Conference on Systems, Man and Cybernetics  
Truncated TD estimate returns efficiency and simulated annealing algorithm increase the chance of exploration.  ...  To accelerate the algorithm convergence speed and to avoid results in local optimum, this paper combines Q-learning algorithm, truncated TD estimation and simulated annealing algorithm.  ...  To tackle these problems, we combine Q-learning with TD method to accelerate convergence speed and thus can quickly learn in the initial stages of learning.  ... 
doi:10.1109/icsmc.2009.5346316 dblp:conf/smc/LinWHZX09 fatcat:nf3vru3nebgkxhpo2mrwmsiebe

Learning While Exploring: Bridging the Gaps in the Eligibility Traces [chapter]

Fredrik A. Dahl, Ole Martin Halck
2001 Lecture Notes in Computer Science  
With the usual implementations of exploration in TD-learning, the feedback signals are either distorted or discarded, so that the exploration hurts the algorithm's learning.  ...  The present article gives a modification of the TD-learning algorithm that allows exploration without cost to the accuracy or speed of learning.  ...  The usual way of dealing with this problem in TD-learning for control is not carrying the feedback across the exploring action -see for instance the tic-tac-toe example in the introduction to [4] .  ... 
doi:10.1007/3-540-44795-4_7 fatcat:tyusjeixobgv3cy5alf42tsly4

A comparison of learning speed and ability to cope without exploration between DHP and TD(0)

Michael Fairbank, Eduardo Alonso
2012 The 2012 International Joint Conference on Neural Networks (IJCNN)  
In a simple experiment, the learning speed of DHP is shown to be around 1700 times faster than TD(0).  ...  DHP solves the problem without any exploration, whereas TD(0) cannot solve it without explicit exploration. DHP requires knowledge of, and differentiability of, the environment's model functions.  ...  The critic was trained by either the DHP or TD(0) weight update, with a learning rate of α = 0.1.  ... 
doi:10.1109/ijcnn.2012.6252569 dblp:conf/ijcnn/FairbankA12 fatcat:6ps3qj6igzdincz5sxjp7kj5hy

Reinforcement Learning Framework for Deep Brain Stimulation Study

Dmitrii Krylov, Remi Tachet des Combes, Romain Laroche, Michael Rosenblum, Dmitry V. Dylov
2020 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence  
Suppression and control of this collective synchronous activity are therefore of great importance for neuroscience, and can only rely on limited engineering trials due to the need to experiment with live  ...  We present the first Reinforcement Learning (RL) gym framework that emulates this collective behavior of neurons and allows us to find suppression parameters for the environment of synthetic degenerate  ...  This shows that off-policy Q-learning combined with TD-error exploration can result in sample efficient as well as flexible exploration.  ... 
doi:10.24963/ijcai.2020/390 dblp:conf/ijcai/Simmons-EdlerEY20 fatcat:z3q4ncdsfrdmhdcu3jdwsyn6pq

Episodic Multi-agent Reinforcement Learning with Curiosity-Driven Exploration [article]

Lulu Zheng, Jiarui Chen, Jianhao Wang, Jiamin He, Yujing Hu, Yingfeng Chen, Changjie Fan, Yang Gao, Chongjie Zhang
2021 arXiv   pre-print
In this paper, we introduce a novel Episodic Multi-agent reinforcement learning with Curiosity-driven exploration, called EMC.  ...  Efficient exploration in deep cooperative multi-agent reinforcement learning (MARL) still remains challenging in complex coordination problems.  ...  Therefore, we call our method Episodic Multi-agent reinforcement learning with Curiosity-driven exploration, called EMC.  ... 
arXiv:2111.11032v1 fatcat:r5kbcl6imvg2rkaeczi7gnyc24

Optimistic Temporal Difference Learning for 2048

Hung Guei, Lung-Pin Chen, I-Chen Wu
2021 IEEE Transactions on Games  
Temporal difference (TD) learning and its variants, such as multistage TD (MS-TD) learning and temporal coherence (TC) learning, have been successfully applied to 2048.  ...  Our experiments show that both TD and TC learning with OI significantly improve the performance. As a result, the network size required to achieve the same performance is significantly reduced.  ...  OTD+TC learning first performs TD learning with a fixed learning rate to further encourage exploration for a while, and then, in the second phase, continues with TC fine-tuning for exploitation.  ... 
doi:10.1109/tg.2021.3109887 fatcat:slzqxm4denbnxbx4u4y4hjclty

Reward Prediction Error as an Exploration Objective in Deep RL [article]

Riley Simmons-Edler, Ben Eisner, Daniel Yang, Anthony Bisulco, Eric Mitchell, Sebastian Seung, Daniel Lee
2021 arXiv   pre-print
However, while state-novelty exploration methods are suitable for tasks where novel observations correlate well with improved reward, they may not explore more efficiently than epsilon-greedy approaches  ...  A major challenge in reinforcement learning is exploration, when local dithering methods such as epsilon-greedy sampling are insufficient to solve a given task.  ...  This shows that off-policy Q-learning combined with TD-error exploration can result in sample efficient as well as flexible exploration.  ... 
arXiv:1906.08189v5 fatcat:rmwxf3qqp5hejaxxj4uhkepk3m

Deep Exploration for Recommendation Systems [article]

Zheqing Zhu, Benjamin Van Roy
2021 arXiv   pre-print
We investigate the design of recommendation systems that can efficiently learn from sparse and delayed feedback.  ...  We design an algorithm based on Thompson Sampling that carries out Deep Exploration.  ...  NCF Deep Q Learning: NCF TD(0) A less common approach for RSs is TD(0), temporal difference learning with trace decay factor of 0.  ... 
arXiv:2109.12509v1 fatcat:a244qn4k3zc5rhvxai6y2md2tu

Temporal difference learning

Andrew Barto
2007 Scholarpedia  
Value Estimate Update Rules Monte Carlo (every-visit): TD Control With On-Policy and Off-Policy Methods On-Policy TD Control: Sarsa State values evaluated for the behaviour policy The soft behaviour  ...  policy being followed Q-Learning Off-policy Temporal Difference control: model-free, on-line, exploration insensitive, easy to implement One of the most important breakthroughs in reinforcement learning  ... 
doi:10.4249/scholarpedia.1604 fatcat:7yhrvmeoffd4zmvmxfoidocw4y

Deep Reinforcement Learning by Balancing Offline Monte Carlo and Online Temporal Difference Use Based on Environment Experiences

Chayoung Kim
2020 Symmetry  
Therefore, we exploited the balance between the offline Monte Carlo (MC) technique and online temporal difference (TD) with on-policy (state-action–reward-state-action, Sarsa) and an off-policy (Q-learning  ...  The proposed balance of MC (offline) and TD (online) use, which is simple and applicable without a well-designed reward, is suitable for real-time online learning.  ...  Q-learning has an off-policy TD, whereas Sarsa, an alternative to Q-learning, has an on-policy TD [2] .  ... 
doi:10.3390/sym12101685 fatcat:kzwiywbhfzgo3lbuo4qjvcayfu

Introduction

Leslie Pack Kaelbling
1996 Machine Learning  
In particular, there are still important questions of scaling up, of exploration in general environments, of other kinds of bias, and of learning control policies with internal state.  ...  One oft-heard complaint about the TD and Q-learning algorithms is that they are slow to propagate rewards through the state space. Two of the papers in this issue address this problem with traces.  ... 
doi:10.1007/bf00114721 fatcat:vweynsjrh5hnpdpyj7zad75i6i

Procedural Sequence Learning in Attention Deficit Hyperactivity Disorder: A Meta-Analysis

Teenu Sanjeevan, Robyn E. Cardy, Evdokia Anagnostou
2020 Frontiers in Psychology  
A handful of studies have explored procedural sequence learning in ADHD, but findings have been inconsistent.  ...  The results of seven studies comprising 213 participants with ADHD and 257 participants with typical development (TD) generated an average standardized mean difference of 0.02 (CI95 -0.35, 0.39) that was  ...  children with TD.  ... 
doi:10.3389/fpsyg.2020.560064 pmid:33192824 pmcid:PMC7655644 fatcat:sn73irf2ibfnvnflhqckzt2u7i

Can We Enhance Statistical Learning? Exploring Statistical Learning Improvement in Children with Vocabulary Delay

Dongsun Yim, Yoonhee Yang
2021 Communication Sciences & Disorders  
conditions, and with visual and auditory domains; and also explores the relationship among SL, vocabulary, and quick incidental learning (QUIL).Methods: A total of 132 children between 3 to 8 years participated  ...  The present study investigated whether children with and without vocabulary delay (VD) show a difference in improving statistical learning (SL) tasks manipulated with implicit, implicit*2 and explicit  ...  VD = children with Vocabulary Delay; TD = Typically Developing children. . • Exploring statistical learning improvement Dongsun Yim, et al.• Exploring statistical learning improvement tween the VD  ... 
doi:10.12963/csd.21804 fatcat:azdsynfgzfggfjxxspoyo4aqc4

Context-aware Active Multi-Step Reinforcement Learning [article]

Gang Chen, Dingcheng Li, Ran Xu
2019 arXiv   pre-print
In this paper, we propose an active multi-step TD algorithm with adaptive stepsizes to learn actor and critic.  ...  Specifically, our model consists of two components: active stepsize learning and adaptive multi-step TD algorithm.  ...  The Learning curves with exploration noise on the Halfcheetah, Hopper and Walker2d environments.  ... 
arXiv:1911.04107v2 fatcat:gt4hiphzjrhvncsontoe3obzvy
« Previous Showing results 1 — 15 out of 39,541 results