The Internet Archive has a preservation copy of this work in our general collections.
The file type is application/pdf
.
Filters
Regret Bounds for Reinforcement Learning with Policy Advice
[article]
2013
arXiv
pre-print
We present a reinforcement learning with policy advice (RLPA) algorithm which leverages this input set and learns to use the best policy in the set for the reinforcement learning task at hand. ...
In some reinforcement learning problems an agent may be provided with a set of input policies, perhaps learned from prior experience or provided by advisors. ...
We contribute a reinforcement learning with policy advice (RLPA) algorithm. ...
arXiv:1305.1027v2
fatcat:ngfawqhphrg6pdmzidmbq5e3mq
Regret Bounds for Reinforcement Learning with Policy Advice
[chapter]
2013
Lecture Notes in Computer Science
We present a reinforcement learning with policy advice (RLPA) algorithm which leverages this input set and learns to use the best policy in the set for the reinforcement learning task at hand. ...
In some reinforcement learning problems an agent may be provided with a set of input policies, perhaps learned from prior experience or provided by advisors. ...
We contribute a reinforcement learning with policy advice (RLPA) algorithm. ...
doi:10.1007/978-3-642-40988-2_7
fatcat:pvh4zm63qjbhnomepyqpsgrxd4
Theoretically-Grounded Policy Advice from Multiple Teachers in Reinforcement Learning Settings with Applications to Negative Transfer
[article]
2016
arXiv
pre-print
Policy advice is a transfer learning method where a student agent is able to learn faster via advice from a teacher. ...
Our regret bounds justify the intuition that good teachers help while bad teachers hurt. ...
Acknowledgements This research has taken place in part at the Intelligent Robot Learning (IRL) Lab, Washington State University. ...
arXiv:1604.03986v1
fatcat:ozffopv7gjgadg3jdawqt7ikka
The offset tree for learning with partial labels
2009
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '09
We present an algorithm, called the offset tree, for learning in situations where a loss associated with different decisions is not known, but was randomly probed. ...
In particular, it has regret at most (k − 1) times the regret of the binary classifier it uses, where k is the number of decisions, and no reduction to binary classification can do better. ...
First we formalize a learning reduction, which relies upon a binary classification oracle. The lower bound we prove below holds for all such learning reductions. Advice. ...
doi:10.1145/1557019.1557040
dblp:conf/kdd/BeygelzimerL09
fatcat:3n27nfdjwbhljaebdspe6qjsre
Generalizing Policy Advice with Gaussian Process Bandits for Dynamic Skill Improvement
2014
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
We present a ping-pong-playing robot that learns to improve its swings with human advice. ...
Multimodal stochastic polices can also easily be learned with this approach when the reward function is multimodal in the policy parameters. ...
We can achieve similar bounds for the noisy task parameter case with a modified definition of regret. ...
doi:10.1609/aaai.v28i1.9059
fatcat:zfgeo3456bejthmaciimuueooy
Adaptive Probabilistic Policy Reuse
[chapter]
2012
Lecture Notes in Computer Science
Recently, many complex reinforcement learning problems have been successfully solved by efficient transfer learners. ...
Transfer algorithms allow the use of knowledge previously learned on related tasks to speed-up learning of the current task. ...
Proposition 8 , 8 For any MDP M in which rewards are bounded by r max , any policies π andπ, and a starting states 0 , we have d dϕ V (ϕ) ≤ 2rmax (1−γ) 2 .Finally, combining the above result with the regret ...
doi:10.1007/978-3-642-34487-9_73
fatcat:xbmvvvmsffaktcqafm4kbiz4ny
Reinforcement learning with value advice
2014
Asian Conference on Machine Learning
The problem we consider in this paper is reinforcement learning with value advice. ...
In this setting, the agent is given limited access to an oracle that can tell it the expected return (value) of any state-action pair with respect to the optimal policy. ...
We thank the Australian Research Council for support under grant DP120100950 and J.E. Brand for doing the voice-overs on the videos. ...
dblp:conf/acml/DaswaniSH14
fatcat:3onccvde75b7dmxurxyjhslsgi
The Offset Tree for Learning with Partial Labels
[article]
2016
arXiv
pre-print
We present an algorithm, called the Offset Tree, for learning to make decisions in situations where the payoff of only one choice is observed, rather than all choices. ...
Experiments with the Offset Tree show that it generally performs better than several alternative approaches. ...
We would also like to thank Shai Shalev-Shwartz for providing data and helping setup a clean comparison with the Banditron. ...
arXiv:0812.4044v3
fatcat:k72xczke4bg6bmpzclihofkmwm
Reinforcement Learning Algorithm Selection
[article]
2017
arXiv
pre-print
This paper formalises the problem of online algorithm selection in the context of Reinforcement Learning. ...
ESBAS is then adapted to a true online setting where algorithms update their policies after each transition, which we call SSBAS. ...
)) and Rexp3 in O(T 2/3 ), or the RL with Policy Advice's regret bounds of O( √ T log(T )) ...
arXiv:1701.08810v3
fatcat:fjr5azqqbvdwxdzv33sdnln5he
Cache Replacement as a MAB with Delayed Feedback and Decaying Costs
[article]
2021
arXiv
pre-print
We present an improved adaptive version of LeCaR, called OLeCaR, with the learning rate set as determined by the theoretical derivation presented here to minimize regret for EXP4-DFDC. ...
As an application, we show that LeCaR, a recent top-performing machine learning algorithm for cache replacement, can be enhanced with adaptive learning using our formulations. ...
We acknowledge support for this project from Dr. Camilo Valdes and the rest of the OLeCaR and Cacheus groups for their insightful feedback. ...
arXiv:2009.11330v4
fatcat:4sg3zxbeg5duha5ev5pbylmrmm
Learning to Teach Reinforcement Learning Agents
2017
Machine Learning and Knowledge Extraction
Second, the article studies policy learning for distributing advice under a budget. ...
Whereas most methods in the relevant literature rely on heuristics for advice distribution, we formulate the problem as a learning one and propose a novel reinforcement learning algorithm capable of learning ...
One possible goal for any teacher advising with a finite amount of advice would be to help minimize student's regret with respect to the reward obtained by an optimal policy. ...
doi:10.3390/make1010002
dblp:journals/make/FachantidisTV19
fatcat:u3vj5zzrkncg3dv62rbg2yzv5e
Online Transfer Learning in Reinforcement Learning Domains
[article]
2015
arXiv
pre-print
First, the convergence of Q-learning and Sarsa with tabular representation with a finite budget is proven. ...
This paper proposes an online transfer framework to capture the interaction among agents and shows that current transfer learning in reinforcement learning is a special case of online transfer. ...
Acknowledgments This research has taken place in the Intelligent Robot Learning (IRL) Lab, Washington State University. IRL research is support in part by grants from AFRL FA8750-14- ...
arXiv:1507.00436v2
fatcat:tl6czeen6bephcechk46bbbshm
Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization
[article]
2016
arXiv
pre-print
We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize ...
We establish further efficiency and asymptotic performance guarantees that apply even if the true value function does not lie in the given hypothesis class, for the special case where the hypothesis class ...
Regret bounds for reinforce-
ment learning with policy advice. Machine Learning and Knowledge Discovery in Databases. Springer,
97–112.
[4] Bartlett, Peter L., Ambuj Tewari. 2009. ...
arXiv:1307.4847v4
fatcat:pmzmgknlujg5fjbnossmkoxqlu
An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits
2018
Entropy
In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. ...
We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to a regret that is logarithmic with respect to the number of arm pulls. ...
The time-averaged regret for reinforcement comparison is relatively high in the beginning and catches up with that from VoIMix. ...
doi:10.3390/e20030155
pmid:33265246
fatcat:bh5csw4agzgc3gpug2mfiekzmu
Contextual Bandit Algorithms with Supervised Learning Guarantees
[article]
2011
arXiv
pre-print
Second, we give a new algorithm called VE that competes with a possibly infinite set of policies of VC-dimension d while incurring regret at most O(√(T(d(T) + (1/δ)))) with probability 1-δ. ...
These guarantees improve on those of all previous algorithms, whether in a stochastic or adversarial environment, and bring us closer to providing supervised learning type guarantees for the contextual ...
Acknowledgments We thank Wei Chu for assistance with the experiments and Kishore Papineni for helpful discussions. This work was done while Lev Reyzin and Robert E. Schapire were at Yahoo! ...
arXiv:1002.4058v3
fatcat:z53vlri3x5g2ncmul2odlek4iq
« Previous
Showing results 1 — 15 out of 11,452 results