Filters








13,560 Hits in 6.2 sec

Average Reward Adjusted Discounted Reinforcement Learning: Near-Blackwell-Optimal Policies for Real-World Applications [article]

Manuel Schneckenreither
2020 arXiv   pre-print
In contrast to standard discounted reinforcement learning our algorithm infers the optimal policy on all tested problems.  ...  Second, we establish a novel near-Blackwell-optimal reinforcement learning algorithm.  ...  Similarly, Tadepalli and Ok [37] present an average reward RL algorithm called H-Learning, which under certain assumptions finds bias-optimal values.  ... 
arXiv:2004.00857v1 fatcat:m5zd24kfpvbtzg6qoxpfqpx5wa

Average reward reinforcement learning: Foundations, algorithms, and empirical results

Sridhar Mahadevan
1996 Machine Learning  
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework  ...  reliably filter these to produce bias-optimal (or T-optimal) policies that also maximize the finite reward to absorbing goal states.  ...  I thank Prasad Tadepalli for many discussions on average reward reinforcement learning, and for his detailed comments on this paper.  ... 
doi:10.1007/bf00114727 fatcat:btcbqaec3bexbnhd4qcybqqrya

Use of Reinforcement Learning as a Challenge: A Review

Rashmi Sharma, Manish Prateek, Ashok K. Sinha
2013 International Journal of Computer Applications  
This paper gives an introduction of reinforcement learning, discusses its basic model, the optimal policies used in RL , the main reinforcement optimal policy that are used to reward the agent including  ...  model free and model based policies -Temporal difference method, Q-learning , average reward, certainty equivalent methods, Dyna , prioritized sweeping , queue Dyna .  ...  Mahadevan [12] showed that existing reinforcement learning algorithms for average reward does not always produce bias-optimal policies.  ... 
doi:10.5120/12105-8332 fatcat:ghtmafph2jdunbqqss5zinzyg4

An Actor-critic Algorithm Using Cross Evaluation of Value Functions

Hui Wang, Peng Zhang, Quan Liu
2018 IAES International Journal of Robotics and Automation  
In order to overcome the difficulty of learning a global optimal policy caused by maximization bias in a continuous space, an actor-critic algorithm for cross evaluation of double value function is proposed  ...  The algorithm is more robust than CACLA learning algorithm, and the experimental results show that our algorithm is smoother and the stability of policy is improved obviously under the condition that the  ...  INTRODUCTION Temporal difference algorithm in reinforcement learning (RL) usually uses a maximizing operation to solve the optimal policy.  ... 
doi:10.11591/ijra.v7i1.pp39-47 fatcat:rr3n335dazfwvg4tlyoojvowyi

Page 3276 of Psychological Abstracts Vol. 89, Issue 8 [page]

2002 Psychological Abstracts  
The paper also reports a comparison of discounted- reward against average-reward Q-learning in an infinite horizon robotics task. 25628. Munakata, Yuko & Stedron, Jennifer Merva.  ...  (Massachusetts Inst of Technology, Lab for Information & Decision Systems, Cambridge, MA) On average versus discounted reward temporal-difference learn- ing.  ... 

The reward-complexity trade-off in schizophrenia [article]

Samuel J Gershman, Lucy Lai
2020 bioRxiv   pre-print
Schizophrenia patients adopt lower complexity policies on average, and these policies are more strongly biased away from the optimal reward-complexity trade-off curve compared to healthy controls.  ...  If there is a capacity limit for policy complexity, then there will also be a trade-off between reward and complexity, since some reward will need to be sacrificed in order to satisfy the capacity constraint  ...  Acknowledgments We are indebted to Anne Collins for making her data available.  ... 
doi:10.1101/2020.11.16.385013 fatcat:c4quwyclinfitflcmclleaksje

Estimation Error Correction in Deep Reinforcement Learning for Deterministic Actor-Critic Methods [article]

Baturay Saglam, Enes Duran, Dogan C. Cicek, Furkan B. Mutlu, Suleyman S. Kozat
2021 arXiv   pre-print
In value-based deep reinforcement learning methods, approximation of value functions induces overestimation bias and leads to suboptimal policies.  ...  Our Q-value update rule combines the notions behind Clipped Double Q-learning and Maxmin Q-learning by computing the critic objective through the nested combination of maximum and minimum operators to  ...  BACKGROUND Reinforcement learning paradigm considers an agent interacting with its environment to learn the optimal, reward-maximizing behavior.  ... 
arXiv:2109.10736v2 fatcat:tzgthr72ufhx5jocineq255y3q

Deep Ordinal Reinforcement Learning [article]

Alexander Zap, Tobias Joppen, Johannes Fürnkranz
2019 arXiv   pre-print
We show how to convert common reinforcement learning algorithms to an ordinal variation by the example of Q-learning and introduce Ordinal Deep Q-Networks, which adapt deep reinforcement learning to ordinal  ...  Using rewards on an ordinal scale (ordinal rewards) is an alternative to numerical rewards that has received more attention in recent years.  ...  Calculations for this research were conducted on the Lichtenberg high performance computer of the TU Darmstadt.  ... 
arXiv:1905.02005v2 fatcat:bdeic5veezhy5elhx5oygi2lhm

The National Science Foundation Workshop on Reinforcement Learning

Sridhar Mahadevan, Leslie Pack Kaelbling
1996 The AI Magazine  
maximize total reward among all gainoptimal policies (bias optimality).  ...  He also described the standard dynamic programming algorithms, such as policy iteration and value iteration for computing optimal policies; variations on these algorithms, for example, modified policy  ... 
doi:10.1609/aimag.v17i4.1244 dblp:journals/aim/MahadevanK96 fatcat:vrz3h6o2cnb6njmtnohflsfksa

Reinforcement Learning: A Survey

L. P. Kaelbling, M. L. Littman, A. W. Moore
1996 The Journal of Artificial Intelligence Research  
It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.  ...  This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning.  ...  Also thanks to our many colleagues in the reinforcement-learning community who have done this work and explained it to us.  ... 
doi:10.1613/jair.301 fatcat:nbo23vmu6rfz3ctpjbk7sdcnt4

Reinforcement Learning: A Survey [article]

L. P. Kaelbling, M. L. Littman, A. W. Moore
1996 arXiv   pre-print
It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.  ...  This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning.  ...  Also thanks to our many colleagues in the reinforcement-learning community who have done this work and explained it to us.  ... 
arXiv:cs/9605103v1 fatcat:ze737h6wnfdhjf52hiz4gpogxq

Parameter-free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients [article]

Baturay Saglam, Furkan Burak Mutlu, Dogan Can Cicek, Suleyman Serdar Kozat
2022 arXiv   pre-print
Approximation of the value functions in value-based deep reinforcement learning induces overestimation bias, resulting in suboptimal policies.  ...  Then, through extensive statistical analysis, we introduce a novel, parameter-free Deep Q-learning variant to reduce this underestimation bias in deterministic policy gradients.  ...  BACKGROUND Reinforcement learning paradigm considers an agent that interacts with its environment to learn the optimal, reward-maximizing behavior.  ... 
arXiv:2109.11788v3 fatcat:j3fbcrzgjbcvpp5r3sai5hal7q

On the Reduction of Variance and Overestimation of Deep Q-Learning [article]

Mohammed Sabry, Amr M. A. Khalifa
2019 arXiv   pre-print
The breakthrough of deep Q-Learning on different types of environments revolutionized the algorithmic design of Reinforcement Learning to introduce more stable and robust algorithms, to that end many extensions  ...  to deep Q-Learning algorithm have been proposed to reduce the variance of the target values and the overestimation phenomena.  ...  Finding an optimal policy is the main concern of Reinforcement Learning for that reason many algorithms have been introduced over a course of time, e.g, Q-learning [4] , SARSA [5] , and policy gradient  ... 
arXiv:1910.05983v1 fatcat:x6avbd6wfncu3oro6sultvomqm

Some Insights into Lifelong Reinforcement Learning Systems [article]

Changjian Li
2020 arXiv   pre-print
Some insights into lifelong reinforcement learning are provided, along with a simplistic prototype lifelong reinforcement learning system.  ...  A lifelong reinforcement learning system is a learning system that has the ability to learn through trail-and-error interaction with the environment over its lifetime.  ...  Acknowledgements The author would like to thank Gaurav Sharma (Borealis AI) for his comments on a draft of the paper.  ... 
arXiv:2001.09608v1 fatcat:f56hobcawfbbfnxsm3dtscmduy

Automatic tuning of hyper-parameters of reinforcement learning algorithms using Bayesian optimization with behavioral cloning [article]

Juan Cruz Barsce, Jorge A. Palombarini, Ernesto C. Martínez
2021 arXiv   pre-print
Also, by tightly integrating Bayesian optimization in a reinforcement learning agent design, the number of state transitions needed to converge to the optimal policy for a given task is reduced.  ...  Therefore, the user of an RL algorithm has to rely on search-based optimization methods, such as grid search or the Nelder-Mead simplex algorithm, that are very inefficient for most RL tasks, slows down  ...  the average rewards for the other optimizers.  ... 
arXiv:2112.08094v1 fatcat:hd4bvvjrpzgn3oweaq2l747k6m
« Previous Showing results 1 — 15 out of 13,560 results