11,452 Hits in 2.9 sec

Regret Bounds for Reinforcement Learning with Policy Advice [article]

Mohammad Gheshlaghi Azar and Alessandro Lazaric and Emma Brunskill
2013 arXiv   pre-print
We present a reinforcement learning with policy advice (RLPA) algorithm which leverages this input set and learns to use the best policy in the set for the reinforcement learning task at hand.  ...  In some reinforcement learning problems an agent may be provided with a set of input policies, perhaps learned from prior experience or provided by advisors.  ...  We contribute a reinforcement learning with policy advice (RLPA) algorithm.  ... 
arXiv:1305.1027v2 fatcat:ngfawqhphrg6pdmzidmbq5e3mq

Regret Bounds for Reinforcement Learning with Policy Advice [chapter]

Mohammad Gheshlaghi Azar, Alessandro Lazaric, Emma Brunskill
2013 Lecture Notes in Computer Science  
We present a reinforcement learning with policy advice (RLPA) algorithm which leverages this input set and learns to use the best policy in the set for the reinforcement learning task at hand.  ...  In some reinforcement learning problems an agent may be provided with a set of input policies, perhaps learned from prior experience or provided by advisors.  ...  We contribute a reinforcement learning with policy advice (RLPA) algorithm.  ... 
doi:10.1007/978-3-642-40988-2_7 fatcat:pvh4zm63qjbhnomepyqpsgrxd4

Theoretically-Grounded Policy Advice from Multiple Teachers in Reinforcement Learning Settings with Applications to Negative Transfer [article]

Yusen Zhan, Haitham Bou Ammar, Matthew E. taylor
2016 arXiv   pre-print
Policy advice is a transfer learning method where a student agent is able to learn faster via advice from a teacher.  ...  Our regret bounds justify the intuition that good teachers help while bad teachers hurt.  ...  Acknowledgements This research has taken place in part at the Intelligent Robot Learning (IRL) Lab, Washington State University.  ... 
arXiv:1604.03986v1 fatcat:ozffopv7gjgadg3jdawqt7ikka

The offset tree for learning with partial labels

Alina Beygelzimer, John Langford
2009 Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '09  
We present an algorithm, called the offset tree, for learning in situations where a loss associated with different decisions is not known, but was randomly probed.  ...  In particular, it has regret at most (k − 1) times the regret of the binary classifier it uses, where k is the number of decisions, and no reduction to binary classification can do better.  ...  First we formalize a learning reduction, which relies upon a binary classification oracle. The lower bound we prove below holds for all such learning reductions. Advice.  ... 
doi:10.1145/1557019.1557040 dblp:conf/kdd/BeygelzimerL09 fatcat:3n27nfdjwbhljaebdspe6qjsre

Generalizing Policy Advice with Gaussian Process Bandits for Dynamic Skill Improvement

Jared Glover, Charlotte Zhu
We present a ping-pong-playing robot that learns to improve its swings with human advice.  ...  Multimodal stochastic polices can also easily be learned with this approach when the reward function is multimodal in the policy parameters.  ...  We can achieve similar bounds for the noisy task parameter case with a modified definition of regret.  ... 
doi:10.1609/aaai.v28i1.9059 fatcat:zfgeo3456bejthmaciimuueooy

Adaptive Probabilistic Policy Reuse [chapter]

Yann Chevaleyre, Aydano Machado Pamponet
2012 Lecture Notes in Computer Science  
Recently, many complex reinforcement learning problems have been successfully solved by efficient transfer learners.  ...  Transfer algorithms allow the use of knowledge previously learned on related tasks to speed-up learning of the current task.  ...  Proposition 8 , 8 For any MDP M in which rewards are bounded by r max , any policies π andπ, and a starting states 0 , we have d dϕ V (ϕ) ≤ 2rmax (1−γ) 2 .Finally, combining the above result with the regret  ... 
doi:10.1007/978-3-642-34487-9_73 fatcat:xbmvvvmsffaktcqafm4kbiz4ny

Reinforcement learning with value advice

Mayank Daswani, Peter Sunehag, Marcus Hutter
2014 Asian Conference on Machine Learning  
The problem we consider in this paper is reinforcement learning with value advice.  ...  In this setting, the agent is given limited access to an oracle that can tell it the expected return (value) of any state-action pair with respect to the optimal policy.  ...  We thank the Australian Research Council for support under grant DP120100950 and J.E. Brand for doing the voice-overs on the videos.  ... 
dblp:conf/acml/DaswaniSH14 fatcat:3onccvde75b7dmxurxyjhslsgi

The Offset Tree for Learning with Partial Labels [article]

Alina Beygelzimer, John Langford
2016 arXiv   pre-print
We present an algorithm, called the Offset Tree, for learning to make decisions in situations where the payoff of only one choice is observed, rather than all choices.  ...  Experiments with the Offset Tree show that it generally performs better than several alternative approaches.  ...  We would also like to thank Shai Shalev-Shwartz for providing data and helping setup a clean comparison with the Banditron.  ... 
arXiv:0812.4044v3 fatcat:k72xczke4bg6bmpzclihofkmwm

Reinforcement Learning Algorithm Selection [article]

Romain Laroche, Raphael Feraud
2017 arXiv   pre-print
This paper formalises the problem of online algorithm selection in the context of Reinforcement Learning.  ...  ESBAS is then adapted to a true online setting where algorithms update their policies after each transition, which we call SSBAS.  ...  )) and Rexp3 in O(T 2/3 ), or the RL with Policy Advice's regret bounds of O( √ T log(T ))  ... 
arXiv:1701.08810v3 fatcat:fjr5azqqbvdwxdzv33sdnln5he

Cache Replacement as a MAB with Delayed Feedback and Decaying Costs [article]

Farzana Beente Yusuf, Vitalii Stebliankin, Giuseppe Vietri, Giri Narasimhan
2021 arXiv   pre-print
We present an improved adaptive version of LeCaR, called OLeCaR, with the learning rate set as determined by the theoretical derivation presented here to minimize regret for EXP4-DFDC.  ...  As an application, we show that LeCaR, a recent top-performing machine learning algorithm for cache replacement, can be enhanced with adaptive learning using our formulations.  ...  We acknowledge support for this project from Dr. Camilo Valdes and the rest of the OLeCaR and Cacheus groups for their insightful feedback.  ... 
arXiv:2009.11330v4 fatcat:4sg3zxbeg5duha5ev5pbylmrmm

Learning to Teach Reinforcement Learning Agents

Anestis Fachantidis, Matthew Taylor, Ioannis Vlahavas
2017 Machine Learning and Knowledge Extraction  
Second, the article studies policy learning for distributing advice under a budget.  ...  Whereas most methods in the relevant literature rely on heuristics for advice distribution, we formulate the problem as a learning one and propose a novel reinforcement learning algorithm capable of learning  ...  One possible goal for any teacher advising with a finite amount of advice would be to help minimize student's regret with respect to the reward obtained by an optimal policy.  ... 
doi:10.3390/make1010002 dblp:journals/make/FachantidisTV19 fatcat:u3vj5zzrkncg3dv62rbg2yzv5e

Online Transfer Learning in Reinforcement Learning Domains [article]

Yusen Zhan, Matthew E. Taylor
2015 arXiv   pre-print
First, the convergence of Q-learning and Sarsa with tabular representation with a finite budget is proven.  ...  This paper proposes an online transfer framework to capture the interaction among agents and shows that current transfer learning in reinforcement learning is a special case of online transfer.  ...  Acknowledgments This research has taken place in the Intelligent Robot Learning (IRL) Lab, Washington State University. IRL research is support in part by grants from AFRL FA8750-14-  ... 
arXiv:1507.00436v2 fatcat:tl6czeen6bephcechk46bbbshm

Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization [article]

Zheng Wen, Benjamin Van Roy
2016 arXiv   pre-print
We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize  ...  We establish further efficiency and asymptotic performance guarantees that apply even if the true value function does not lie in the given hypothesis class, for the special case where the hypothesis class  ...  Regret bounds for reinforce- ment learning with policy advice. Machine Learning and Knowledge Discovery in Databases. Springer, 97–112. [4] Bartlett, Peter L., Ambuj Tewari. 2009.  ... 
arXiv:1307.4847v4 fatcat:pmzmgknlujg5fjbnossmkoxqlu

An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits

Isaac Sledge, José Príncipe
2018 Entropy  
In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret.  ...  We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to a regret that is logarithmic with respect to the number of arm pulls.  ...  The time-averaged regret for reinforcement comparison is relatively high in the beginning and catches up with that from VoIMix.  ... 
doi:10.3390/e20030155 pmid:33265246 fatcat:bh5csw4agzgc3gpug2mfiekzmu

Contextual Bandit Algorithms with Supervised Learning Guarantees [article]

Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, Robert E. Schapire
2011 arXiv   pre-print
Second, we give a new algorithm called VE that competes with a possibly infinite set of policies of VC-dimension d while incurring regret at most O(√(T(d(T) + (1/δ)))) with probability 1-δ.  ...  These guarantees improve on those of all previous algorithms, whether in a stochastic or adversarial environment, and bring us closer to providing supervised learning type guarantees for the contextual  ...  Acknowledgments We thank Wei Chu for assistance with the experiments and Kishore Papineni for helpful discussions. This work was done while Lev Reyzin and Robert E. Schapire were at Yahoo!  ... 
arXiv:1002.4058v3 fatcat:z53vlri3x5g2ncmul2odlek4iq
« Previous Showing results 1 — 15 out of 11,452 results