21,175 Hits in 5.3 sec

Addressing Environment Non-Stationarity by Repeating Q-learning Updates

Sherief Abdallah, Michael Kaisers
2016 Journal of machine learning research  
Here, we introduce Repeated Update Q-learning (RUQL), a learning algorithm that resolves the undesirable artifact of Q-learning while maintaining simplicity.  ...  Q-learning (QL) is a popular reinforcement learning algorithm that is guaranteed to converge to optimal policies in Markov decision processes.  ...  Acknowledgments We would like to acknowledge support for this project from the Emirates Foundation Ref No: 2010-107 Sience & Engineering Research grant and British University in Dubai INF009 Grant.  ... 
dblp:journals/jmlr/AbdallahK16 fatcat:33vbw36cq5b3va3uh6thpr3jum

Addressing Function Approximation Error in Actor-Critic Methods [article]

Scott Fujimoto, Herke van Hoof, David Meger
2018 arXiv   pre-print
Our algorithm builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation.  ...  We draw the connection between target networks and overestimation bias, and suggest delaying policy updates to reduce per-update error and further improve performance.  ...  To address this problem, we propose to simply upper-bound the less biased value estimate Q θ2 by the biased estimate Q θ1 .  ... 
arXiv:1802.09477v3 fatcat:jeo6m2vj4bh3bnidp5hnypovoe

Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors [article]

Jingliang Duan, Yang Guan, Shengbo Eben Li, Yangang Ren, Bo Cheng
2020 arXiv   pre-print
We show that the learning of a state-action return distribution function can be used to improve the Q-value estimation accuracy.  ...  In current reinforcement learning (RL) methods, function approximation errors are known to lead to the overestimated or underestimated Q-value estimates, thus resulting in suboptimal policies.  ...  overestimation bias and suboptimal policy updates.  ... 
arXiv:2001.02811v2 fatcat:zswqzteo4fbfpgpqk2m5q4lsye

Reducing Estimation Bias via Weighted Delayed Deep Deterministic Policy Gradient [article]

Qiang He, Xinwen Hou
2020 arXiv   pre-print
To address this issue, TD3 takes the minimum value between a pair of critics, which introduces underestimation bias.  ...  The overestimation phenomenon caused by function approximation is a well-known issue in value-based reinforcement learning algorithms such as deep Q-networks and DDPG, which could lead to suboptimal policies  ...  ACKNOWLEDGMENT Xinwen Hou is the corresponding author.  ... 
arXiv:2006.12622v1 fatcat:n7zcjyw2ejd5hpm57ti22vzby4

Weighted Double Deep Multiagent Reinforcement Learning in Stochastic Cooperative Environments [article]

Yan Zheng, Jianye Hao, Zongzhang Zhang
2018 arXiv   pre-print
Experiments show that the WDDQN outperforms the existing DRL and multiaent DRL algorithms, i.e., double DQN and lenient Q-learning, in terms of the average reward and the convergence rate in stochastic  ...  By utilizing the weighted double estimator and the deep neural network, WDDQN can not only reduce the bias effectively but also be extended to scenarios with raw visual inputs.  ...  Lenient Q-learning [Potter and De Jong, 1994] updates the policies of multiple agents towards an optimal joint policy simultaneously by letting each agent adopt an optimistic dispose at the initial exploration  ... 
arXiv:1802.08534v2 fatcat:qhpbcuswezcb3oyrhf4qw5fgwy

Sampling Efficient Deep Reinforcement Learning through Preference-Guided Stochastic Exploration [article]

Wenhui Huang, Cong Zhang, Jingda Wu, Xiangkun He, Jie Zhang, Chen Lv
2022 arXiv   pre-print
Massive practical works addressed by Deep Q-network (DQN) algorithm have indicated that stochastic policy, despite its simplicity, is the most frequently used exploration approach.  ...  However, most existing stochastic exploration approaches either explore new actions heuristically regardless of Q-values or inevitably introduce bias into the learning process to couple the sampling with  ...  We formally prove that the proposed policy has an appealing property: it preserves the policy improvement guarantee of the Q-learning framework. 2) We show that the preference of actions inferred by the  ... 
arXiv:2206.09627v1 fatcat:ft53gp3ngbh2liq7bnyrtzg3ay

Multi-Agent Reinforcement Learning: A Survey

Lucian Busoniu, Robert Babuska, Bart De Schutter
2006 2006 9th International Conference on Control, Automation, Robotics and Vision  
Many tasks arising in these domains require that the agents learn behaviors online. A significant part of the research on multi-agent learning concerns reinforcement learning techniques.  ...  In this paper we aim to present an integrated survey of the field. First, the issue of the multi-agent learning goal is discussed, after which a representative selection of algorithms is reviewed.  ...  ACKNOWLEDGEMENT This research is financially supported by Senter, Ministry of Economic Affairs of the Netherlands within the BSIK-ICIS project "Interactive Collaborative Information Systems" (grant no.  ... 
doi:10.1109/icarcv.2006.345353 dblp:conf/icarcv/BusoniuBS06 fatcat:5lo6wzdlbncybbb2uu4bdrzsqq

Decorrelated Double Q-learning [article]

Gang Chen
2020 arXiv   pre-print
Inspired by the recent advance of deep reinforcement learning and Double Q-learning, we introduce the decorrelated double Q-learning (D2Q).  ...  Q-learning with value function approximation may have the poor performance because of overestimation bias and imprecise estimate.  ...  A clipped Double Q-learning called TD3 [11] extends the deterministic policy gradient [5, 6] to address overestimation bias.  ... 
arXiv:2006.06956v1 fatcat:zkfov4qzd5h3hp5vvlycuz5s2u

Two-Sample Testing in Reinforcement Learning [article]

Martin Waltz, Ostap Okhrin
2022 arXiv   pre-print
It subsequently performs updates by adjusting the current Q-estimate towards the observed reward and the maximum of the Q-estimates of the next state.  ...  The procedure introduces maximization bias with approaches like Double Q-Learning.  ...  Acknowledgements We would like to thank Niklas Paulig for fruitful discussions in the early stages of this work.  ... 
arXiv:2201.08078v2 fatcat:u3tfriidjrhrfe5cm5bzru6jgm

Randomized Ensembled Double Q-Learning: Learning Fast Without a Model [article]

Xinyue Chen, Che Wang, Zijian Zhou, Keith Ross
2021 arXiv   pre-print
of Q functions from the ensemble.  ...  In this paper, we introduce a simple model-free algorithm, Randomized Ensembled Double Q-Learning (REDQ), and show that its performance is just as good as, if not better than, a state-of-the-art model-based  ...  Unlike Maxmin Q-learning, REDQ controls over-estimation bias and variance of the Q estimate by separately setting M and N .  ... 
arXiv:2101.05982v2 fatcat:7h5rpq4zkvf5bhr4ao2ruzstrq

Temporal Cross-Selling Optimization Using Action Proxy-Driven Reinforcement Learning

Nan Li, Naoki Abe
2011 2011 IEEE 11th International Conference on Data Mining Workshops  
Since the changes are directly tied to the reward, unconstrained formulation would result in unbounded behavior, leading us to constrain the learned policy.  ...  We propose a variant of reinforcement learning, enhanced with the notion of "action proxy", which is applicable to the cross-selling pattern discovery even in the absence of actions.  ...  A variant of RL is Q-learning [7] . Q-learning uses the idea of experiencebased updating. It can be done either online or in batch [17] , [18] .  ... 
doi:10.1109/icdmw.2011.163 dblp:conf/icdm/LiA11 fatcat:adyzc6edarb4xgz4yxeo3ley64

An Information-Theoretic Optimality Principle for Deep Reinforcement Learning [article]

Felix Leibfried, Jordi Grau-Moya, Haitham Bou-Ammar
2018 arXiv   pre-print
We methodologically address the problem of Q-value overestimation in deep reinforcement learning to handle high-dimensional state spaces efficiently.  ...  The resultant algorithm encompasses a wide range of learning outcomes containing deep Q-networks as a special case.  ...  Optimistic Overestimation: Upon careful investigation of Equation (1) , one comes to recognize that Q-learning updates introduce a bias to the learning process caused by an overestimation of the optimal  ... 
arXiv:1708.01867v5 fatcat:eluvryslufcstmysxpcybtivny

Application of twin delayed deep deterministic policy gradient learning for the control of transesterification process [article]

Tanuja Joshi, Shikhar Makker, Hariprasad Kodamana, Harikumar Kandath
2021 arXiv   pre-print
It is expected that some of these challenges can be addressed by developing control strategies that directly interact with the process and learning from the experiences.  ...  One of the promising and renewable alternatives to fossil fuels is bio-diesel produced by means of the batch transesterification process.  ...  (i) To address the overestimation bias problem, the concept of clipped double Q-learning is used wherein two Q-values are learned, and the minimum of them is used for the approximation of the target Q-value  ... 
arXiv:2102.13012v1 fatcat:fjndr7lmybhzfe3so3rftqdlaq

Non-delusional Q-learning and value-iteration

Tyler Lu, Dale Schuurmans, Craig Boutilier
2018 Neural Information Processing Systems  
Delusional bias arises when the approximation architecture limits the class of expressible greedy policies.  ...  Since standard Q-updates make globally uncoordinated action choices with respect to the expressible policy class, inconsistent or even conflicting Q-value estimates can result, leading to pathological  ...  The key update in Alg. 1 is Line 6, which jointly updates all Q-values of the relevant sets of policies in G(Θ).  ... 
dblp:conf/nips/LuSB18 fatcat:zfzkdrhmszgcblmaco5mnxxode

Self-Imitation Learning via Generalized Lower Bound Q-learning [article]

Yunhao Tang
2021 arXiv   pre-print
Self-imitation learning motivated by lower-bound Q-learning is a novel and effective approach for off-policy learning.  ...  To provide a formal motivation for the potential performance gains provided by self-imitation learning, we show that n-step lower bound Q-learning achieves a trade-off between fixed point bias and contraction  ...  Due to the max operator, sampled updates of Q-learning naturally incur over-estimation bias, which potentially leads to unstable learning.  ... 
arXiv:2006.07442v3 fatcat:lwuisb33u5ca5cx6f4w3vepwf4
« Previous Showing results 1 — 15 out of 21,175 results