14 Hits in 5.0 sec

Copeland Dueling Bandit Problem: Regret Lower Bound, Optimal Algorithm, and Computationally Efficient Algorithm [article]

Junpei Komiyama, Junya Honda, Hiroshi Nakagawa
2016 arXiv   pre-print
We propose Copeland Winners Relative Minimum Empirical Divergence (CW-RMED) and derive an asymptotically optimal regret bound for it.  ...  However, it is not known whether the algorithm can be efficiently computed or not. To address this issue, we devise an efficient version (ECW-RMED) and derive its asymptotic regret bound.  ...  Acknowledgements This work was supported in part by JSPS KAKENHI Grant Number 15J09850 and 16H00881.  ... 
arXiv:1605.01677v2 fatcat:vmgfq7rhz5hunhbh7iikyyk5eu

Advancements in Dueling Bandits

Yanan Sui, Masrour Zoghi, Katja Hofmann, Yisong Yue
2018 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence  
In this survey, we review recent results in the theories, algorithms, and applications of the dueling bandits problem.  ...  As an emerging domain, the theories and algorithms of dueling bandits have been intensively studied during the past few years.  ...  Indeed, the problem of devising a computationally efficient contextual dueling bandits algorithm with optimal regret bound remains an interesting open problem.  ... 
doi:10.24963/ijcai.2018/776 dblp:conf/ijcai/SuiZHY18 fatcat:vfao6bpxt5aifbwyvtk3wg2cu4

Double Thompson Sampling for Dueling Bandits [article]

Huasen Wu, Xin Liu
2016 arXiv   pre-print
For general Copeland dueling bandits, we show that D-TS achieves O(K^2 T) regret.  ...  In this paper, we propose a Double Thompson Sampling (D-TS) algorithm for dueling bandit problems.  ...  bandits [5] , CW-RMED and its computationally efficient version ECW-RMED for general Copeland dueling bandits [7] .  ... 
arXiv:1604.07101v2 fatcat:l7mooquranbzpohyqblnwhe5ie

MergeDTS: A Method for Effective Large-Scale Online Ranker Evaluation [article]

Chang Li, Ilya Markov, Maarten de Rijke, Masrour Zoghi
2020 arXiv   pre-print
Our main finding is that for large-scale Condorcet ranker evaluation problems, MergeDTS outperforms the state-of-the-art dueling bandit algorithms.  ...  The effectiveness (regret) and efficiency (time complexity) of MergeDTS are extensively evaluated using examples from the domain of online evaluation for web search.  ...  We also thank our editor and the anonymous reviewers for extensive comments and suggestions that helped us to improve the paper.  ... 
arXiv:1812.04412v2 fatcat:jgtm6ukpknh3ppmsjyhmfkvh7u

Preference-based Online Learning with Dueling Bandits: A Survey [article]

Viktor Bengs, Robert Busa-Fekete, Adil El Mesaoudi-Paul, Eyke Hüllermeier
2021 arXiv   pre-print
The aim of this paper is to provide a survey of the state of the art in this field, referred to as preference-based multi-armed bandits or dueling bandits.  ...  In machine learning, the notion of multi-armed bandits refers to a class of online learning problems, in which an agent is supposed to simultaneously explore and exploit a given set of choice alternatives  ...  Acknowledgments Eyke Hüllermeier, Adil El Mesaoudi-Paul and Viktor Bengs gratefully acknowledge financial support by the German Research Foundation (DFG).  ... 
arXiv:1807.11398v2 fatcat:jsu6gap5pbgbtm735fgf4aqwmu

Optimizing Ranking Systems Online as Bandits [article]

Chang Li
2021 arXiv   pre-print
We formulate this nonstationary online learning to rank problem as cascade non-stationary bandits and propose CascadeDUCB and CascadeSWUCB algorithms to solve the problem.  ...  Bandit is a general online learning framework and can be used in our optimization task.  ...  RMED1 is motivated by the lower bound of the Condorcet dueling bandit problem and matches the lower bound up to a factor of O(K 2 ), which indicates that RMED1 has low regret in small-scale problems but  ... 
arXiv:2110.05807v1 fatcat:mp3fctx6sffhjej7idwc7v33ca

Optimal and Efficient Dynamic Regret Algorithms for Non-Stationary Dueling Bandits [article]

Shubham Gupta, Aadirupa Saha
2021 arXiv   pre-print
We study the problem of dynamic regret minimization in K-armed Dueling Bandits under non-stationary or time varying preferences.  ...  We next use similar algorithmic ideas to propose an efficient and provably optimal algorithm for dynamic-regret minimization under two notions of non-stationarities.  ...  Regret lower bound and optimal algorithm in dueling bandit problem. In COLT, pages 1141–1154, 2015. [17] Tor Lattimore and Csaba Szepesvári.  ... 
arXiv:2111.03917v1 fatcat:izn5aqexpngurjgxcpk47hmuvm

Regret Minimization in Stochastic Contextual Dueling Bandits [article]

Aadirupa Saha, Aditya Gopalan
2021 arXiv   pre-print
algorithms along with a matching lower bound analysis.  ...  However, to the best of our knowledge this work is the first to consider the problem of regret minimization of contextual dueling bandits for potentially infinite decision spaces and gives provably optimal  ...  Acknowledgements Aadirupa Saha thanks Branislav Kveton for all the useful initial discussions during her internship at Google, Mountain View, and Ofer Meshi, Craig Boutilier for hosting her internship.  ... 
arXiv:2002.08583v2 fatcat:nxobgxl5jnfa7gx6hzt7qvrei4

Dueling Bandits with Dependent Arms [article]

Bangrui Chen, Peter I. Frazier
2017 arXiv   pre-print
We study dueling bandits with weak utility-based regret when preferences over arms have a total order and carry observable feature vectors.  ...  We propose an algorithm for this setting called Comparing The Best (CTB), which we show has constant expected cumulative weak utility-based regret.  ...  That work instead studies the dueling bandits assuming a Copeland winner, which is guaranteed to exist, and propose two algorithms, CCB and SCB, which achieve O(N log(T )) strong regret in this more general  ... 
arXiv:1605.08838v2 fatcat:hkmcgyqzszh6fpokstlbpvx6kq

Dueling Bandits With Weak Regret [article]

Bangrui Chen, Peter I. Frazier
2017 arXiv   pre-print
WS-W is the first dueling bandit algorithm with weak regret that is constant in time.  ...  We study the dueling bandit problem in the Condorcet winner setting, and consider two notions of regret: the more well-studied strong regret, which is 0 only when both arms pulled are the Condorcet winner  ...  Acknowledgements The authors were partially supported by NSF CAREER CMMI-1254298, NSF CMMI-1536895, NSF IIS-1247696, NSF DMR-1120296, AFOSR FA9550-12-1-0200, AFOSR FA9550-15-1-0038, and AFOSR FA9550-16  ... 
arXiv:1706.04304v1 fatcat:rztmhsnpmfebfc3tmxua22ilpu

Search Engines that Learn from Their Users

Anne Schuth
2016 SIGIR Forum  
The dueling bandit gradient descent (DBGD) algorithm by Yue and Joachims [207] , which we describe in Section 2.5.1, can be seen as an algorithm to solve a variant of the K-armed dueling bandits problem  ...  The DBGD algorithm can be seen as an algorithm to solve a continuous variant of the K-armed dueling bandits problem which we introduced in Section 2.3.4.  ...  Their rankings can be updated at any time and as often as desired. Both click feedback and aggregated outcomes are made available directly and are updated constantly.  ... 
doi:10.1145/2964797.2964817 fatcat:lk24shg7dzbyzk7kkr4x6cjbna

Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability [article]

Aadirupa Saha, Akshay Krishnamurthy
This resolves an open problem of Dudík et al. [2015] on oracle efficient, regret-optimal algorithms for contextual dueling bandits.  ...  The algorithm is also computationally efficient, running in polynomial time assuming access to an online oracle for square loss regression over $\mathcal F$.  ...  Acknowledgements AK thanks Akshay Balsubramani, Alekh Agarwal, Miroslav Dudík, and Robert E. Schapire for fruitful discussions regarding the result in Section 5.  ... 
doi:10.48550/arxiv.2111.12306 fatcat:s7ujwot3dreffby47bwp6n5ufm

Adaptive Preference Learning With Bandit Feedback: Information Filtering, Dueling Bandits and Incentivizing Exploration

Bangrui Chen
For each type of feedback and application setting, we provide an algorithm and a theoretical analysis bounding its regret.  ...  We connect these settings respectively to existing work on classical multi-armed bandits, dueling bandits, and incentivizing exploration.  ...  His immense knowledge in the field, enthusiasm about research and unparalleled  ... 
doi:10.7298/x4251gcq fatcat:vbsrx3qjm5bo7fk57jz4djz2tq

Design and Evaluation of Robust Control Methods for Robotic Transfemoral Prostheses

Nitish Thatte
We also propose a pair of optimization methods that allow us to select prosthesis control parameters using qualitative preference feedback from the user.  ...  , and rough ground.  ...  Moreover, the dueling bandit algorithm is well suited to lifelong learning. Since the algorithm seeks to minimize regret, we can ensure its exploration is only as obtrusive as necessary.  ... 
doi:10.1184/r1/8397551 fatcat:ouzitvlnqfa2zgudsuwjpz26ha