4,802 Hits in 5.1 sec

Learning from an Exploring Demonstrator: Optimal Reward Estimation for Bandits [article]

Wenshuo Guo, Kumar Krishna Agrawal, Aditya Grover, Vidya Muthukumar, Ashwin Pananjady
2022 arXiv   pre-print
We introduce the "inverse bandit" problem of estimating the rewards of a multi-armed bandit instance from observing the learning process of a low-regret demonstrator.  ...  In contrast, we propose to leverage the demonstrator's behavior en route to optimality, and in particular, the exploration phase, for reward estimation.  ...  AG, VM, and AP were supported by research fellowships from the Simons Institute for the Theory of Computing when part of this work was performed.  ... 
arXiv:2106.14866v2 fatcat:tqxcfjrbm5cgdpatkcikmob6yu

Generalize Robot Learning From Demonstration to Variant Scenarios With Evolutionary Policy Gradient

Junjie Cao, Weiwei Liu, Yong Liu, Jian Yang
2020 Frontiers in Neurorobotics  
Robot learning for automation from human demonstration is central to such situation.  ...  In this paper, we present Evolutionary Policy Gradient (EPG) to make robot learn from demonstration and perform goal oriented exploration efficiently.  ...  Thompson Sampling, originated from bandits problems, provides an elegant approach that tackles the exploration-exploitation dilemma.  ... 
doi:10.3389/fnbot.2020.00021 pmid:32372940 pmcid:PMC7188386 fatcat:lodwo6wq2ngvlcfccuhzaa5fay

A Framework for Learning from Demonstration with Minimal Human Effort

Marc Rigter, Bruno Lacerda, Nick Hawes
2020 IEEE Robotics and Automation Letters  
In our approach, we learn to predict the success probability for each controller, given the initial state of an episode.  ...  In this setting we address reinforcement learning, and learning from demonstration, where there is a cost associated with human time.  ...  Multi-Armed Bandits In a Multi-Armed Bandit (MAB), at each episode an agent must choose from a finite set Φ of arms, with unknown reward distributions r φ for choosing each arm φ ∈ Φ.  ... 
doi:10.1109/lra.2020.2970619 fatcat:a4oflyy55bd7vcelwjsyvunbsa

Experimental demonstration of channel order recognition in wireless communications by laser chaos time series and confidence intervals

Mitsuhiko Shimomura, Nicolas Chauvet, Mikio Hasegawa, Makoto Naruse
2022 Nonlinear Theory and Its Applications IEICE  
Recently, a fast decision-making algorithm for a multi-arm bandit problem by utilizing laser chaos time series has been demonstrated.  ...  Furthermore, the arms order recognition of the reward expectation for each arm has been successfully developed by incorporating the notion of the confidence interval regarding the reward estimate.  ...  Acknowledgments The authors appreciate Shungo Takeuchi for his kind and critical supports for experimental system constructions.  ... 
doi:10.1587/nolta.13.101 fatcat:ycomxokgczbbdlz7lb4iffzpuu

Demonstrator skill modulates observational aversive learning

Ida Selbing, Björn Lindström, Andreas Olsson
2014 Cognition  
An inability to discriminate threatening from safe stimuli is typical for individuals suffering from anxiety.  ...  Although learning through others is likely an efficient way of learning, observational learning also has to be applied critically, for instance by not copying the choices of someone that performs poorly  ...  This also means that for learning to be optimal, the learning rate should be sensitive to the agent's uncertainty in estimating the expected value.  ... 
doi:10.1016/j.cognition.2014.06.010 pmid:25016187 fatcat:3tbaew7mlfbtfcqapsaejt56y4

Can Q-learning solve Multi Armed Bantids? [article]

Refael Vivanti
2021 arXiv   pre-print
When a reinforcement learning (RL) method has to decide between several optional policies by solely looking at the received reward, it has to implicitly optimize a Multi-Armed-Bandit (MAB) problem.  ...  on its rewards variance, and leaving a boring, or low variance, policy is less likely due to its low implicit exploration.  ...  Each policy have an implicit exploration rate, which is derived from its rewards variance.  ... 
arXiv:2110.10934v1 fatcat:aiy3xbjgu5duziw6gwkvygchvm

Deep Contextual Multi-armed Bandits [article]

Mark Collier, Hector Urdiales Llorens
2018 arXiv   pre-print
Here we present a deep learning framework for contextual multi-armed bandits that is both non-linear and enables principled exploration at the same time.  ...  We tackle the exploration vs. exploitation trade-off through Thompson sampling by exploiting the connection between inference time dropout and sampling from the posterior over the weights of a Bayesian  ...  Acknowledgements The authors would like to thank Marco Lagi, Adam Starikiewicz, Vedant Misra and George Banis for their helpful comments on drafts of this paper.  ... 
arXiv:1807.09809v1 fatcat:btkzma64dne3ro65ltqyifg44u

Bayesian Unification of Gradient and Bandit-Based Learning for Accelerated Global Optimisation

Ole-Christoffer Granmo
2016 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA)  
We further propose an accompanying bandit driven exploration scheme that uses Bayesian credible bounds to trade off exploration against exploitation.  ...  However, for continuous optimisation problems or problems with a large number of actions, bandit based approaches can be hindered by slow learning.  ...  probability of each arm, and an "optimistic reward probability estimate" is identified for each arm.  ... 
doi:10.1109/icmla.2016.0044 dblp:conf/icmla/Granmo16 fatcat:3ep5f5abnnho7awhdrgcchfjou

Self-Supervised Contextual Bandits in Computer Vision [article]

Aniket Anand Deshmukh, Abhimanu Kumar, Levi Boyles, Denis Charles, Eren Manavoglu, Urun Dogan
2020 arXiv   pre-print
In the usual self-supervision, we learn implicit labels from the training data for a secondary task.  ...  We provide cases where the proposed scheme doesn't perform optimally and give alternative methods for better learning in these cases.  ...  A good feature representation learning can lead to good gains in reward optimization in bandit tasks.  ... 
arXiv:2003.08485v1 fatcat:mj4cnvp4fzbbfenbtqk7dkd4hq

On-Line Adaptation of Exploration in the One-Armed Bandit with Covariates Problem

Adam M. Sykulski, Niall M. Adams, Nicholas R. Jennings
2010 2010 Ninth International Conference on Machine Learning and Applications  
We provide simulation results for the onearmed bandit with covariates problem, which demonstrate the effectiveness of -ADAPT to correctly control the amount of exploration in finite-time problems and yield  ...  Many sequential decision making problems require an agent to balance exploration and exploitation to maximise long-term reward.  ...  ACKNOWLEDGEMENTS This research was undertaken as part of the ALADDIN (Autonomous Learning Agents for Decentralised Data and Information Networks) project and is jointly funded by a BAE Systems and EPSRC  ... 
doi:10.1109/icmla.2010.74 dblp:conf/icmla/SykulskiAJ10 fatcat:ayw4h7wjq5hqzaoulkacea4dpm

Bayesian Optimal Experimental Design for Simulator Models of Cognition [article]

Simon Valentin, Steven Kleinegesse, Neil R. Bramley, Michael U. Gutmann, Christopher G. Lucas
2021 arXiv   pre-print
In this work, we combine recent advances in BOED and approximate inference for intractable models, using machine-learning methods to find optimal experimental designs, approximate sufficient summary statistics  ...  Our simulation experiments on multi-armed bandit tasks show that our method results in improved model discrimination and parameter estimation, as compared to experimental designs commonly used in the literature  ...  Experiments In this section we demonstrate the optimization of reward probabilities for multi-armed bandit tasks, with the scientific goals of (1) model discrimination (MD) and (2) parameter estimation  ... 
arXiv:2110.15632v1 fatcat:y7ymrq6hufeb3ek3prdkhsxh5q

Output-Weighted Sampling for Multi-Armed Bandits with Extreme Payoffs [article]

Yibo Yang, Antoine Blanchard, Themistoklis Sapsis, Paris Perdikaris
2021 arXiv   pre-print
Finally, we provide a JAX library for efficient bandit optimization using Gaussian processes.  ...  We present a new type of acquisition functions for online decision making in multi-armed and contextual bandit problems with extreme payoffs.  ...  We have developed an open-source Python package for bandit optimization using Gaussian processes 1 .  ... 
arXiv:2102.10085v2 fatcat:3r2cyu5enrhrra6pl6adsx6gk4

Top-K Ranking Deep Contextual Bandits for Information Selection Systems [article]

Jade Freeman, Michael Rawson
2022 arXiv   pre-print
We demonstrate the approach and evaluate the the performance of learning from the experiments using real world data sets in simulated scenarios.  ...  Contextual multi-armed bandit has been widely used for learning to filter contents and prioritize according to user interest or relevance.  ...  ACKNOWLEDGMENT We thank the anonymous referees for their helpful suggestions.  ... 
arXiv:2201.13287v1 fatcat:uwotvgfdezajvjqc3zvujczdk4

Solving Non-Stationary Bandit Problems by Random Sampling from Sibling Kalman Filters [chapter]

Ole-Christoffer Granmo, Stian Berg
2010 Lecture Notes in Computer Science  
The multi-armed bandit problem is a classical optimization problem where an agent sequentially pulls one of multiple arms attached to a gambling machine, with each pull resulting in a random reward.  ...  This paper proposes a novel solution scheme for bandit problems with non-stationary normally distributed rewards.  ...  A promising line of thought is the interval estimation methods, where a confidence interval for the unperturbed reward of each arm is estimated, and an "optimistic reward estimate" is identified for each  ... 
doi:10.1007/978-3-642-13033-5_21 fatcat:2r3a7qhekref3fxa6wqrxhbwfa

Bridging Computational Neuroscience and Machine Learning on Non-Stationary Multi-Armed Bandits [article]

George Velentzas, Costas Tzafestas, Mehdi Khamassi
2017 bioRxiv   pre-print
Fast adaptation to changes in the environment requires both natural and artificial agents to be able to dynamically tune an exploration-exploitation trade-off during learning.  ...  The problem of finding an efficient exploration-exploitation trade-off has been well studied both in the Machine Learning and Computational Neuroscience fields.  ...  Nationale de la Recherche (ANR-12-CORD-0030 Roboergosum Project and ANR-11-IDEX-0004-02 Sorbonne-Universités SU-15-R-PERSU-14 Robot Parallearning Project), and by Labex SMART (ANR-11-LABX-65 Online Budgeted Learning  ... 
doi:10.1101/117598 fatcat:qb6qicn46ffuhf7dp42kirhqd4
« Previous Showing results 1 — 15 out of 4,802 results