55 Hits in 5.7 sec

Fully Gap-Dependent Bounds for Multinomial Logit Bandit [article]

Jiaqi Yang
2020 arXiv   pre-print
To our knowledge, our algorithms are the first to achieve gap-dependent bounds that fully depends on the suboptimality gaps of all items.  ...  We study the multinomial logit (MNL) bandit problem, where at each time step, the seller offers an assortment of size at most K from a pool of N items, and the buyer purchases an item from the assortment  ...  Acknowledgments Jiaqi Yang would like to thank Yuan Zhou for the invaluable comments and suggestions.  ... 
arXiv:2011.09998v1 fatcat:e4rktnkesneldc4poo5qnr5qb4

PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits [article]

Bianca Dumitrascu, Karen Feng, Barbara E Engelhardt
2018 arXiv   pre-print
PG-TS is the first approach to demonstrate the benefits of Polya-Gamma augmentation in bandits and to propose an efficient Gibbs sampler for approximating the analytically unsolvable integral of logistic  ...  contextual bandits.  ...  We propose Pólya-Gamma augmented Thompson sampling (PG-TS), a fully Bayesian alternative to Laplace-TS.  ... 
arXiv:1805.07458v1 fatcat:a5rc4ujdlfh2jgj52misat6tsm

Pure Exploration with Structured Preference Feedback [article]

Shubham Gupta, Aadirupa Saha, Sumeet Katariya
2021 arXiv   pre-print
We also derive an instance-dependent lower bound of Ω(d/Δ^2log1/δ) which matches our upper bound on a worst-case instance.  ...  algorithms that guarantee the detection of the best-arm in Õ (d^2/K Δ^2) samples with probability at least 1 - δ, where d is the dimension of the arm-features and Δ is the appropriate notion of utility gap  ...  Devising a fully adaptive strategy in this setting is a promising direc-tion for future work. Another interesting problem is bridging the gap between the upper and lower bounds in the general case.  ... 
arXiv:2104.05294v1 fatcat:ial2ci7tu5bwzjc3xo2aq5mx6q

Combinatorial Bandits with Relative Feedback [article]

Aadirupa Saha, Aditya Gopalan
2020 arXiv   pre-print
For both settings, we devise instance-dependent and order-optimal regret algorithms with regret O(n/mln T) and O(n/kln T), respectively.  ...  Specifically, we study two regret minimisation problems over subsets of a finite ground set [n], with subset-wise relative preference information feedback according to the Multinomial logit choice model  ...  Aadirupa Saha thanks Arun Rajkumar for the valuable discussions, and the Tata Trusts and ACM-India/IARCS Travel Grants for travel support.  ... 
arXiv:1903.00543v2 fatcat:jvg3546k6ndgtppj77hfck7w2u

MNL-Bandit: A Dynamic Learning Approach to Assortment Selection [article]

Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, Assaf Zeevi
2018 arXiv   pre-print
logit (MNL) choice model.  ...  We refer to this exploration-exploitation formulation as the MNL-Bandit problem.  ...  As noted earlier, we assume consumer preferences are modeled using a multinomial logit (MNL) model.  ... 
arXiv:1706.03880v2 fatcat:7re5aw3sjnenrmttlg5x4x43hu

Dynamic Assortment Selection under the Nested Logit Models [article]

Xi Chen and Chao Shi and Yining Wang and Yuan Zhou
2021 arXiv   pre-print
Although the dynamic assortment planning problem has received increasing attention in revenue management, most existing work is based on the multinomial logit choice models (MNL).  ...  We further provide a lower bound result of Ω(√(MT)), which shows the near optimality of the upper bound when T is much larger than M and N.  ...  Acknowledgement We would like to thank Paat Rusmevichientong for pointing out us the problem and several useful references. Xi Chen would like to thank Adobe Research Award to support this research.  ... 
arXiv:1806.10410v2 fatcat:pip3smb2zngmrfof2ota7twrbq

Introduction to Multi-Armed Bandits [article]

Aleksandrs Slivkins
2022 arXiv   pre-print
Multi-armed bandits a simple but very powerful framework for algorithms that make decisions over time under uncertainty.  ...  The chapters on "bandits with similarity information", "bandits with knapsacks" and "bandits and agents" can also be consumed as standalone surveys on the respective topics.  ...  , linear contextual bandits, and multinomial-logit bandits.  ... 
arXiv:1904.07272v7 fatcat:pptyhyyshrdyhhf7bdonz5dsv4

Preference-based Online Learning with Dueling Bandits: A Survey [article]

Viktor Bengs, Robert Busa-Fekete, Adil El Mesaoudi-Paul, Eyke Hüllermeier
2021 arXiv   pre-print
This observation has motivated the study of variants of the multi-armed bandit problem, in which more general representations are used both for the type of feedback to learn from and the target of prediction  ...  The aim of this paper is to provide a survey of the state of the art in this field, referred to as preference-based multi-armed bandits or dueling bandits.  ...  We would also like to thank two anonymous referees for their valuable comments and suggestions, which helped to significantly improve this survey.  ... 
arXiv:1807.11398v2 fatcat:jsu6gap5pbgbtm735fgf4aqwmu

Thompson Sampling Algorithms for Cascading Bandits [article]

Zixin Zhong, Wang Chi Cheung, Vincent Y. F. Tan
2021 arXiv   pre-print
TS-Cascade achieves the state-of-the-art regret bound for cascading bandits.  ...  While Thompson sampling (TS) algorithms have been shown to be empirically superior to Upper Confidence Bound (UCB) algorithms for cascading bandits, theoretical guarantees are only known for the latter  ...  For example, the posterior update in Algorithm 2 in Agrawal et al. (2017) for the multinomial logit bandit problem is not conjugate.  ... 
arXiv:1810.01187v4 fatcat:o6ptav6banhtdao6wx77a2gsjm

Sparsity-Agnostic Lasso Bandit [article]

Min-hwan Oh, Garud Iyengar, Assaf Zeevi
2021 arXiv   pre-print
Essentially all existing algorithms for sparse bandits require a priori knowledge of the value of the sparsity index s_0.  ...  The main contribution of this paper is to propose an algorithm that does not require prior knowledge of the sparsity index s_0 and establish tight regret bounds on its performance under mild conditions  ...  Thompson sampling for multinomial logit contextual bandits. In Advances in Neural Information Processing Systems, pages 3151-3161, 2019. Garvesh Raskutti, Martin J Wainwright, and Bin Yu.  ... 
arXiv:2007.08477v2 fatcat:can47dxzwvchhafh2bkfxsuk7y

Multinomial Logit Bandit with Low Switching Cost [article]

Kefan Dong, Yingkai Li, Qin Zhang, Yuan Zhou
2020 arXiv   pre-print
We study multinomial logit bandit with limited adaptivity, where the algorithms change their exploration actions as infrequently as possible when achieving almost optimal minimax regret.  ...  We present an anytime algorithm (AT-DUCB) with O(N log T) assortment switches, almost matching the lower bound Ω(N log T/loglog T).  ...  While it is not clear to us whether the dependence on N delivered by this analysis is optimal, we also discuss the relationship between the analysis and an extensively studied (but not yet fully resolved  ... 
arXiv:2007.04876v1 fatcat:s25demt6frflzaonokpi7xeadq

Online Learning and Optimization for Revenue Management Problems with Add-on Discounts [article]

David Simchi-Levi, Rui Sun, Huanan Zhang
2020 arXiv   pre-print
Recent research on assortment planning problems also focuses on the online setting where the parameters of the underlying choice models, such as multinomial logit (MNL), are not known and need to be learned  ...  One of the classic multi-armed bandit models is the stochastic bandit, where the reward for pulling each arm is assumed to be i.i.d. drawn from an unknown probability distribution.  ...  Note that the demand for each product under all allowable prices is always between [0, 1], and thus can be interpreted as the mean of a Bernoulli random variable.  ... 
arXiv:2005.00947v1 fatcat:fsdvi3kvvnfb7bwybhhdlhpg3m

Reinforcement Learning in Economics and Finance [article]

Arthur Charpentier and Romuald Elie and Carl Remlinger
2020 arXiv   pre-print
As in multi-armed bandit problems, when an agent picks an action, he can not infer ex-post the rewards induced by other action choices.  ...  Many problems of optimal control, popular in economics for more than forty years, can be expressed in the reinforcement learning framework, and recent advances in computational science, provided in particular  ...  Assuming that rewards have a Gumbel distribution, we obtain a multinomial logit model, where the log-odds ratios are proportional to the value function.  ... 
arXiv:2003.10014v1 fatcat:trmt5cfybbftvd5jegyu4etila

On Sample Complexity Upper and Lower Bounds for Exact Ranking from Noisy Comparisons [article]

Wenbo Ren, Jia Liu, Ness B. Shroff
2021 arXiv   pre-print
We first derive lower bounds for pairwise ranking (i.e., compare two items each time), and then propose (nearly) optimal pairwise ranking algorithms.  ...  By repeatedly and adaptively choosing items to compare, we want to fully rank the items with a certain confidence, and use as few comparisons as possible.  ...  Also, their bounds only depend on the minimal gap ∆ but not ∆ i,j 's or ∆ i 's, and hence is not tight in most cases.  ... 
arXiv:1909.03194v3 fatcat:walxdj5e6ngfng54lzlr7boo3u

Dimension Reduction in Contextual Online Learning via Nonparametric Variable Selection [article]

Wenhao Li, Ningyuan Chen, L. Jeff Hong
2020 arXiv   pre-print
We consider a contextual online learning (multi-armed bandit) problem with high-dimensional covariate 𝐱 and decision 𝐲.  ...  They use the multinomial logit choice model and propose a pricing policy with regret O(log(d x T )( √ T + d x log T )).  ...  The proposed "LASSO bandit" algorithm obtains regret O((d * x ) 2 (log T + log d x ) 2 ), which almost only depends on the effective dimension d * x , compared with the regret bound O(d 3 x log T ) of  ... 
arXiv:2009.08265v1 fatcat:fmzgpyc74fdlnmkws6jmjjd2ym
« Previous Showing results 1 — 15 out of 55 results