A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Copeland Dueling Bandit Problem: Regret Lower Bound, Optimal Algorithm, and Computationally Efficient Algorithm
[article]
2016
arXiv
pre-print
We propose Copeland Winners Relative Minimum Empirical Divergence (CW-RMED) and derive an asymptotically optimal regret bound for it. ...
However, it is not known whether the algorithm can be efficiently computed or not. To address this issue, we devise an efficient version (ECW-RMED) and derive its asymptotic regret bound. ...
Acknowledgements This work was supported in part by JSPS KAKENHI Grant Number 15J09850 and 16H00881. ...
arXiv:1605.01677v2
fatcat:vmgfq7rhz5hunhbh7iikyyk5eu
Advancements in Dueling Bandits
2018
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
In this survey, we review recent results in the theories, algorithms, and applications of the dueling bandits problem. ...
As an emerging domain, the theories and algorithms of dueling bandits have been intensively studied during the past few years. ...
Indeed, the problem of devising a computationally efficient contextual dueling bandits algorithm with optimal regret bound remains an interesting open problem. ...
doi:10.24963/ijcai.2018/776
dblp:conf/ijcai/SuiZHY18
fatcat:vfao6bpxt5aifbwyvtk3wg2cu4
Double Thompson Sampling for Dueling Bandits
[article]
2016
arXiv
pre-print
For general Copeland dueling bandits, we show that D-TS achieves O(K^2 T) regret. ...
In this paper, we propose a Double Thompson Sampling (D-TS) algorithm for dueling bandit problems. ...
bandits [5] , CW-RMED and its computationally efficient version ECW-RMED for general Copeland dueling bandits [7] . ...
arXiv:1604.07101v2
fatcat:l7mooquranbzpohyqblnwhe5ie
MergeDTS: A Method for Effective Large-Scale Online Ranker Evaluation
[article]
2020
arXiv
pre-print
Our main finding is that for large-scale Condorcet ranker evaluation problems, MergeDTS outperforms the state-of-the-art dueling bandit algorithms. ...
The effectiveness (regret) and efficiency (time complexity) of MergeDTS are extensively evaluated using examples from the domain of online evaluation for web search. ...
We also thank our editor and the anonymous reviewers for extensive comments and suggestions that helped us to improve the paper. ...
arXiv:1812.04412v2
fatcat:jgtm6ukpknh3ppmsjyhmfkvh7u
Preference-based Online Learning with Dueling Bandits: A Survey
[article]
2021
arXiv
pre-print
The aim of this paper is to provide a survey of the state of the art in this field, referred to as preference-based multi-armed bandits or dueling bandits. ...
In machine learning, the notion of multi-armed bandits refers to a class of online learning problems, in which an agent is supposed to simultaneously explore and exploit a given set of choice alternatives ...
Acknowledgments Eyke Hüllermeier, Adil El Mesaoudi-Paul and Viktor Bengs gratefully acknowledge financial support by the German Research Foundation (DFG). ...
arXiv:1807.11398v2
fatcat:jsu6gap5pbgbtm735fgf4aqwmu
Optimizing Ranking Systems Online as Bandits
[article]
2021
arXiv
pre-print
We formulate this nonstationary online learning to rank problem as cascade non-stationary bandits and propose CascadeDUCB and CascadeSWUCB algorithms to solve the problem. ...
Bandit is a general online learning framework and can be used in our optimization task. ...
RMED1 is motivated by the lower bound of the Condorcet dueling bandit problem and matches the lower bound up to a factor of O(K 2 ), which indicates that RMED1 has low regret in small-scale problems but ...
arXiv:2110.05807v1
fatcat:mp3fctx6sffhjej7idwc7v33ca
Optimal and Efficient Dynamic Regret Algorithms for Non-Stationary Dueling Bandits
[article]
2021
arXiv
pre-print
We study the problem of dynamic regret minimization in K-armed Dueling Bandits under non-stationary or time varying preferences. ...
We next use similar algorithmic ideas to propose an efficient and provably optimal algorithm for dynamic-regret minimization under two notions of non-stationarities. ...
Regret lower bound
and optimal algorithm in dueling bandit problem. In COLT, pages 1141–1154, 2015.
[17] Tor Lattimore and Csaba Szepesvári. ...
arXiv:2111.03917v1
fatcat:izn5aqexpngurjgxcpk47hmuvm
Regret Minimization in Stochastic Contextual Dueling Bandits
[article]
2021
arXiv
pre-print
algorithms along with a matching lower bound analysis. ...
However, to the best of our knowledge this work is the first to consider the problem of regret minimization of contextual dueling bandits for potentially infinite decision spaces and gives provably optimal ...
Acknowledgements Aadirupa Saha thanks Branislav Kveton for all the useful initial discussions during her internship at Google, Mountain View, and Ofer Meshi, Craig Boutilier for hosting her internship. ...
arXiv:2002.08583v2
fatcat:nxobgxl5jnfa7gx6hzt7qvrei4
Dueling Bandits with Dependent Arms
[article]
2017
arXiv
pre-print
We study dueling bandits with weak utility-based regret when preferences over arms have a total order and carry observable feature vectors. ...
We propose an algorithm for this setting called Comparing The Best (CTB), which we show has constant expected cumulative weak utility-based regret. ...
That work instead studies the dueling bandits assuming a Copeland winner, which is guaranteed to exist, and propose two algorithms, CCB and SCB, which achieve O(N log(T )) strong regret in this more general ...
arXiv:1605.08838v2
fatcat:hkmcgyqzszh6fpokstlbpvx6kq
Dueling Bandits With Weak Regret
[article]
2017
arXiv
pre-print
WS-W is the first dueling bandit algorithm with weak regret that is constant in time. ...
We study the dueling bandit problem in the Condorcet winner setting, and consider two notions of regret: the more well-studied strong regret, which is 0 only when both arms pulled are the Condorcet winner ...
Acknowledgements The authors were partially supported by NSF CAREER CMMI-1254298, NSF CMMI-1536895, NSF IIS-1247696, NSF DMR-1120296, AFOSR FA9550-12-1-0200, AFOSR FA9550-15-1-0038, and AFOSR FA9550-16 ...
arXiv:1706.04304v1
fatcat:rztmhsnpmfebfc3tmxua22ilpu
Search Engines that Learn from Their Users
2016
SIGIR Forum
The dueling bandit gradient descent (DBGD) algorithm by Yue and Joachims [207] , which we describe in Section 2.5.1, can be seen as an algorithm to solve a variant of the K-armed dueling bandits problem ...
The DBGD algorithm can be seen as an algorithm to solve a continuous variant of the K-armed dueling bandits problem which we introduced in Section 2.3.4. ...
Their rankings can be updated at any time and as often as desired. Both click feedback and aggregated outcomes are made available directly and are updated constantly. ...
doi:10.1145/2964797.2964817
fatcat:lk24shg7dzbyzk7kkr4x6cjbna
Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability
[article]
2021
This resolves an open problem of Dudík et al. [2015] on oracle efficient, regret-optimal algorithms for contextual dueling bandits. ...
The algorithm is also computationally efficient, running in polynomial time assuming access to an online oracle for square loss regression over $\mathcal F$. ...
Acknowledgements AK thanks Akshay Balsubramani, Alekh Agarwal, Miroslav Dudík, and Robert E. Schapire for fruitful discussions regarding the result in Section 5. ...
doi:10.48550/arxiv.2111.12306
fatcat:s7ujwot3dreffby47bwp6n5ufm
Adaptive Preference Learning With Bandit Feedback: Information Filtering, Dueling Bandits and Incentivizing Exploration
2017
For each type of feedback and application setting, we provide an algorithm and a theoretical analysis bounding its regret. ...
We connect these settings respectively to existing work on classical multi-armed bandits, dueling bandits, and incentivizing exploration. ...
His immense knowledge in the field, enthusiasm about research and unparalleled ...
doi:10.7298/x4251gcq
fatcat:vbsrx3qjm5bo7fk57jz4djz2tq
Design and Evaluation of Robust Control Methods for Robotic Transfemoral Prostheses
2019
We also propose a pair of optimization methods that allow us to select prosthesis control parameters using qualitative preference feedback from the user. ...
, and rough ground. ...
Moreover, the dueling bandit algorithm is well suited to lifelong learning. Since the algorithm seeks to minimize regret, we can ensure its exploration is only as obtrusive as necessary. ...
doi:10.1184/r1/8397551
fatcat:ouzitvlnqfa2zgudsuwjpz26ha