494 Hits in 5.7 sec

A Note on KL-UCB+ Policy for the Stochastic Bandit [article]

Junya Honda
2019 arXiv   pre-print
A classic setting of the stochastic K-armed bandit problem is considered in this note.  ...  This note demonstrates that a simple proof of the asymptotic optimality of the KL-UCB+ policy can be given by the same technique as those used for analyses of other known policies.  ...  Acknowledgement The author thanks Mr. Kohei Takagi for his survey on the analysis of KL-UCB+. The author also thanks Professor Vincent Y. F. Tan for finding many typos in the first version.  ... 
arXiv:1903.07839v2 fatcat:hxlahohmrbayfbtgo5giiugple

The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond [article]

Aurélien Garivier, Olivier Cappé
2013 arXiv   pre-print
This paper presents a finite-time analysis of the KL-UCB algorithm, an online, horizon-free index policy for stochastic bandit problems.  ...  KL-UCB is also the only method that always performs better than the basic UCB policy. Our regret bounds rely on deviations results of independent interest which are stated and proved in the Appendix.  ...  In the stochastic 1 bandit problem, the agent sequentially chooses, for t = 1, 2, . . . , n, an arm A t ∈ {1, . . . , K}, and receives a reward X t such that, conditionally on the arm choices A 1 , A 2  ... 
arXiv:1102.2490v5 fatcat:mwn5khrgundyje4phayggllzvi

UCBoost: A Boosting Approach to Tame Complexity and Optimality for Stochastic Bandits

Fang Liu, Sinong Wang, Swapna Buccapatnam, Ness Shroff
2018 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence  
We propose a boosting approach to Upper Confidence Bound based algorithms for stochastic bandits, that we call UCBoost. Specifically, we propose two types of UCBoost algorithms.  ...  Finally, we present numerical results which show that UCBoost(epsilon) can achieve the same regret performance as the standard kl-UCB while incurring only 1% of the computational cost of kl-UCB.  ...  Acknowledgments This work has been supported in part by grants from the Army Research Office W911NF-14-1-0368 and MURI W911NF-12-1-0385, and grants from the Office of Naval Research N00014-17-1-2417 and  ... 
doi:10.24963/ijcai.2018/338 dblp:conf/ijcai/LiuWBS18 fatcat:qc3pvjugbrhhffp35aaeb66tji

Collaborative Learning of Stochastic Bandits over a Social Network [article]

Ravi Kumar Kolla, Krishna Jagannathan, Aditya Gopalan
2016 arXiv   pre-print
We consider a collaborative online learning paradigm, wherein a group of agents connected through a social network are engaged in playing a stochastic multi-armed bandit game.  ...  We also derive networkwide regret bounds for the algorithm applied to general networks. We conduct numerical experiments on a variety of networks to corroborate our analytical results.  ...  Algorithm 1 Upper-Confidence-Bound-Network (UCB-Network) Each user in G follows UCB-user policy UCB-user policy for a user v: Initialization: For 1 ≤ t ≤ K -play arm t Loop: For K ≤ t ≤ n -a v (t + 1)  ... 
arXiv:1602.08886v2 fatcat:lyvrltf6xnarnjxkx2zbtwfggy

Selfish Bandit based Cognitive Anti-jamming Strategy for Aeronautic Swarm network in Presence of Multiple Jammert

Haitao Li, Jiawei Luo, Changjun Liu
2019 IEEE Access  
The simulation results validate that the aggregate average throughput, cumulative regret obtained with the proposed anti-jamming strategy outperform the well-known UCB, kl-UCB ++ bandit algorithm.  ...  Finally, using the jamming sensing output to calculate reward and with the objective of maximizing the throughput of each airborne radio, a decentralized selfish doubling trick kl-UCB ++ anti-jamming strategy  ...  KL-UCB ++ ALGORITHM The KL-UCB ++ algorithm is a slight modification of algorithm KL-UCB + . We first present some definition of a bandit problem with K actions indexed by a ∈ {a 1 , . . . , a K }.  ... 
doi:10.1109/access.2019.2896709 fatcat:xzml5h2akbakvbx2ioyzcmwyfm

Pareto Upper Confidence Bounds algorithms: An empirical study

Madalina M Drugan, Ann Nowe, Bernard Manderick
2014 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)  
We propose a new regret metric based on the Kullback-Leibler divergence to measure the performance of a multi-objective multi-armed bandit algorithm.  ...  The goal of the improved Pareto UCB algorithm, i.e. iPUCB, is to identify the set of best arms, or the Pareto front, in a fixed budget of arm pulls.  ...  We prove logarithmic upper regret bounds for the improved Pareto UCB algorithm and we propose a KL regret metric as a performance metric for the experiments.  ... 
doi:10.1109/adprl.2014.7010620 dblp:conf/adprl/DruganNM14 fatcat:4km4qmz4kzdnzdlvehpkwtfa7m

Fairness of Exposure in Stochastic Bandits [article]

Lequn Wang, Yiwei Bai, Wen Sun, Thorsten Joachims
2021 arXiv   pre-print
We formulate fairness regret and reward regret in this setting, and present algorithms for both stochastic multi-armed bandits and stochastic linear bandits.  ...  To remedy this problem, we propose a new bandit objective that guarantees merit-based fairness of exposure to the items while optimizing utility to the users.  ...  All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.  ... 
arXiv:2103.02735v2 fatcat:qk6nkmsoqbaxjmx5gzxihped3m

Cascading Bandits: Learning to Rank in the Cascade Model [article]

Branislav Kveton, Csaba Szepesvari, Zheng Wen, Azin Ashkan
2015 arXiv   pre-print
We formulate our problem as a stochastic combinatorial partial monitoring problem. We propose two algorithms for solving it, CascadeUCB1 and CascadeKL-UCB.  ...  We also prove gap-dependent upper bounds on the regret of these algorithms and derive a lower bound on the regret in cascading bandits.  ...  Ranked Bandits In our final experiment, we compare CascadeKL-UCB to a ranked bandit (Section 6) where the base bandit algorithm is KL-UCB. We refer to this method as RankedKL-UCB.  ... 
arXiv:1502.02763v2 fatcat:o2kila5lx5afpfmqe4en4eg5ia

A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit [article]

Giuseppe Burtini, Jason Loeppky, Ramon Lawrence
2015 arXiv   pre-print
We first explore the traditional stochastic model of a multi-armed bandit, then explore a taxonomic scheme of complications to that model, for each complication relating it to a specific requirement or  ...  We survey and synthesize the work of the online statistical learning paradigm referred to as multi-armed bandits integrating the existing research as a resource for a certain class of online experiments  ...  KL-UCB KL-UCB [94] presents a modern approach to UCB for the standard stochastic bandits problem where the padding function is derived from the Kullback-Leibler (K-L) divergence.  ... 
arXiv:1510.00757v4 fatcat:eyxqdq3yl5fpdbv53wtnkfa25a

Greedy Confidence Pursuit: A Pragmatic Approach to Multi-bandit Optimization [chapter]

Philip Bachman, Doina Precup
2013 Lecture Notes in Computer Science  
We formalize this problem in the framework of bandit optimization as follows: given a set of multiple multi-armed bandits and a budget on the total number of trials allocated among them, select the top-m  ...  arms (with high confidence) for as many of the bandits as possible.  ...  For bandit problems, posterior sampling policies π p select arms as follows: π p (a ij |H) ∝ p a ij = arg max a kl f H θ (a kl ) H , (4) in which π p (a ij |H) is the probability of π p selecting a ij  ... 
doi:10.1007/978-3-642-40988-2_16 fatcat:3gp4ikabnvgvrjdaetq4x3th64

Regret vs. Communication: Distributed Stochastic Multi-Armed Bandits and Beyond [article]

Shuang Liu, Cheng Chen, Zhihua Zhang
2015 arXiv   pre-print
In this paper, we consider the distributed stochastic multi-armed bandit problem, where a global arm set can be accessed by multiple players independently.  ...  When the time horizon is known, we propose the Over-Exploration strategy, which only requires one-round communication and whose regret does not scale with the number of players.  ...  Note that the KL-UCB adaptation can be seen as a special case of the DKLUCB policy. In fact, when α(C) = 1, DKLUCB is identical to the KL-UCB adaptation. Theorem 9.  ... 
arXiv:1504.03509v2 fatcat:cdoxxkfmbrgszma46o5qltrrna

Bandits with Budgets

Richard Combes, Chong Jiang, Rayadurgam Srikant
2015 Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems - SIGMETRICS '15  
Further, we show that B-KL-UCB, a natural variant of KL-UCB, is asymptotically optimal for these cases.  ...  Numerical experiments (based on a real-world data set) further suggest that B-KL-UCB also has the same or better finite-time performance when compared to various previously proposed (UCB-like) algorithms  ...  Under policy π = B-KL-UCB, for all 0 < ǫ < ∆ the regret admits the upper bound: THEOREM 5. 2 . 2 (i) Under policy π = B-KL-UCB, for all 0 < ǫ < ∆ the regret admits the upper bound: THEOREM 5. 3 . 3  ... 
doi:10.1145/2745844.2745847 dblp:conf/sigmetrics/CombesJS15 fatcat:mwatvqtcmff5dhlqk3n4shifjq

Regional Multi-Armed Bandits [article]

Zhiyang Wang, Ruida Zhou, Cong Shen
2018 arXiv   pre-print
Moreover, we propose SW-UCB-g, which is an extension of UCB-g for a non-stationary environment where the parameters slowly vary over time.  ...  This regional bandit model naturally bridges the non-informative bandit setting where the player can only learn the chosen arm, and the global bandit model where sampling one arms reveals information of  ...  Acknowledgements This work has been supported by Natural Science Foundation of China (NSFC) under Grant 61572455, and the 100 Talent Program of Chinese Academy of Sciences.  ... 
arXiv:1802.07917v1 fatcat:mf2mfsrsubfojpwedkwxzrdedq

Meta-Learning of Exploration/Exploitation Strategies: The Multi-Armed Bandit Case [article]

Francis Maes and Damien Ernst and Louis Wehenkel
2012 arXiv   pre-print
KL-UCB and epsilon greedy); they also evaluate the robustness of the learnt E/E strategies, by tests carried out on arms whose rewards follow a truncated Gaussian distribution.  ...  Our experiments, with two-armed Bernoulli bandit problems and various playing budgets, show that the meta-learnt E/E strategies outperform generic strategies of the literature (UCB1, UCB1-Tuned, UCB-v,  ...  Note that the objective function we want to optimize, in addition to being stochastic, has a complex relation with the parameters θ.  ... 
arXiv:1207.5208v1 fatcat:7b2zygar5nc37nt4z6kqoaibry

Dueling Bandits with Adversarial Sleeping [article]

Aadirupa Saha, Pierre Gaillard
2021 arXiv   pre-print
We introduce the problem of sleeping dueling bandits with stochastic preferences and adversarial availabilities (DB-SPAA).  ...  The goal is to find an optimal 'no-regret' policy that can identify the best available item at each round, as opposed to the standard 'fixed best-arm regret objective' of dueling bandits.  ...  On the other hand over the last decade, the relative feedback variants of stochastic MAB problem has seen a widespread resurgence in the form of the Dueling Bandit problem, where, instead of getting noisy  ... 
arXiv:2107.02274v1 fatcat:l4lqydiz6vff5ovp4si76shrzy
« Previous Showing results 1 — 15 out of 494 results