Filters








680 Hits in 3.0 sec

Optimal Gradient-based Algorithms for Non-concave Bandit Optimization [article]

Baihe Huang, Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei, Runzhe Wang, Jiaqi Yang
2021 arXiv   pre-print
Bandit problems with linear or concave reward have been extensively studied, but relatively few works have studied bandits with non-concave reward.  ...  For the low-rank generalized linear bandit problem, we provide a minimax-optimal algorithm in the dimension, refuting both conjectures in [LMT21, JWWN19].  ...  Though the reward is non-concave, we combine techniques from two bodies of work, nonconvex optimization and numerical linear algebra, to design robust gradient-based algorithms that converge to global  ... 
arXiv:2107.04518v1 fatcat:4vrtlaz67zhg3gqvwktlscwtz4

Efficient Automatic CASH via Rising Bandits

Yang Li, Jiawei Jiang, Jinyang Gao, Yingxia Shao, Ce Zhang, Bin Cui
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
To alleviate this issue, we propose the alternating optimization framework, where the HPO problem for each ML algorithm and the algorithm selection problem are optimized alternately.  ...  The existing Bayesian optimization (BO) based solutions turn the CASH problem into a Hyperparameter Optimization (HPO) problem by combining the hyperparameters of all machine learning (ML) algorithms,  ...  For example, for large datasets, training linear models is much faster than the tree-based model such as gradient boosting. To solve the cost-aware CASH, we develop a variant of Algorithm 1.  ... 
doi:10.1609/aaai.v34i04.5910 fatcat:p4prqsqh3zav3nm3z3glbzzhhi

Differentiable Bandit Exploration [article]

Craig Boutilier, Chih-Wei Hsu, Branislav Kveton, Martin Mladenov, Csaba Szepesvari, Manzil Zaheer
2020 arXiv   pre-print
The latter has regret guarantees and is a natural starting point for our optimization. Our experiments show the versatility of our approach.  ...  To do this, we parameterize our policies in a differentiable way and optimize them by policy gradients, an approach that is general and easy to implement.  ...  Policy Optimization We develop GradBand (Algorithm 1), an iterative gradient-based algorithm for optimizing bandit policies. GradBand is initialized with policy θ 0 ∈ Θ.  ... 
arXiv:2002.06772v2 fatcat:v7li4qi7zjgh5a7qofquu6dtqu

Gradient Ascent for Active Exploration in Bandit Problems [article]

Pierre Ménard
2019 arXiv   pre-print
We present a new algorithm based on an gradient ascent for a general Active Exploration bandit problem in the fixed confidence setting.  ...  We prove that this algorithm is asymptotically optimal and, most importantly, computationally efficient.  ...  Indeed there is a candidate of choice for optimizing non-smooth concave function namely the sub-gradient ascent.  ... 
arXiv:1905.08165v1 fatcat:srrwpzcwefh4heoyh7pttmblte

Bandits with concave rewards and convex knapsacks

Shipra Agrawal, Nikhil R. Devanur
2014 Proceedings of the fifteenth ACM conference on Economics and computation - EC '14  
We demonstrate that a natural and simple extension of the UCB family of algorithms for MAB provides a polynomial time algorithm that has near-optimal regret guarantees for this substantially more general  ...  , online convex optimization, and the Frank-Wolfe technique for convex optimization.  ...  Here we present a primal algorithm (for BwR) that requires computing the gradient of f in each step, based on the Frank-Wolfe algorithm [Frank and Wolfe 1956] .  ... 
doi:10.1145/2600057.2602844 dblp:conf/sigecom/AgrawalD14 fatcat:jodo3ehfmjgwjfxjcbjxa6u3fm

Bandits with concave rewards and convex knapsacks [article]

Shipra Agrawal, Nikhil R. Devanur
2014 arXiv   pre-print
We demonstrate that a natural and simple extension of the UCB family of algorithms for MAB provides a polynomial time algorithm that has near-optimal regret guarantees for this substantially more general  ...  , online convex optimization, and the Frank-Wolfe technique for convex optimization.  ...  Here we present a primal algorithm (for BwR) that requires computing the gradient of f in each step, based on the Frank-Wolfe algorithm [Frank and Wolfe 1956] .  ... 
arXiv:1402.5758v1 fatcat:rpmgelekqzgozjn3cg6bm2pc2e

Optimal No-Regret Learning in Strongly Monotone Games with Bandit Feedback [article]

Tianyi Lin, Zhengyuan Zhou, Wenjia Ba, Jiawei Zhang
2021 arXiv   pre-print
Leveraging self-concordant barrier functions, we first construct an online bandit convex optimization algorithm and show that it achieves the single-agent optimal regret of Θ̃(√(T)) under smooth and strongly-concave  ...  Our results thus settle this open problem and contribute to the broad landscape of bandit game-theoretical learning by identifying the first doubly optimal bandit learning algorithm, in that it achieves  ...  and prove that it achieves the near-optimal regret minimization property for bandit concave optimization (BCO) 2 .  ... 
arXiv:2112.02856v2 fatcat:7gol4fzeunakrfosnelcve3zdy

Regret Analysis for Continuous Dueling Bandit [article]

Wataru Kumagai
2017 arXiv   pre-print
Moreover, when considering a lower bound in convex optimization, our algorithm is shown to achieve the optimal convergence rate in convex optimization and the optimal regret in dueling bandit except for  ...  In this research, we address a dueling bandit problem based on a cost function over a continuous space.  ...  Acknowledgment We would like to thank Professor Takafumi Kanamori for helpful comments. This work was supported by JSPS KAKENHI Grant Number 17K12653.  ... 
arXiv:1711.07693v2 fatcat:di6jhn5lkze2xhohy4j2fenpg4

Meta-Learning Bandit Policies by Gradient Ascent [article]

Branislav Kveton, Martin Mladenov, Chih-Wei Hsu, Manzil Zaheer, Csaba Szepesvari, Craig Boutilier
2021 arXiv   pre-print
We derive reward gradients that reflect the structure of bandit problems and policies, for both non-contextual and contextual settings, and propose a number of interesting policies that are both differentiable  ...  We propose the use of parameterized bandit policies that are differentiable and can be optimized using policy gradients. This provides a broadly applicable framework that is easy to implement.  ...  Algorithm Exp3 Exp3 (Auer et al., 1995) is a well-known algorithm for non-stochastic bandits.  ... 
arXiv:2006.05094v2 fatcat:5lkjkzjy55cwhmxkkjc7idjp2a

Improving Offline Contextual Bandits with Distributional Robustness [article]

Otmane Sakhi, Louis Faury, Flavian Vasile
2020 arXiv   pre-print
This paper extends the Distributionally Robust Optimization (DRO) approach for offline contextual bandits.  ...  Our approach relies on the construction of asymptotic confidence intervals for offline contextual bandits through the DRO framework.  ...  This is not the case for DRO-based algorithms.  ... 
arXiv:2011.06835v1 fatcat:hudhcyqy6jdidndspkv5aayqt4

Regret bounded by gradual variation for online convex optimization

Tianbao Yang, Mehrdad Mahdavi, Rong Jin, Shenghuo Zhu
2013 Machine Learning  
Unlike previous approaches that maintain a single sequence of solutions, the proposed algorithms maintain two sequences of solutions that make it possible to achieve a variation-based regret bound for  ...  We extend the main results three-fold: (i) we present a general method to obtain a gradual variation bound measured by general norm; (ii) we extend algorithms to a class of online non-smooth optimization  ...  Acknowledgements We thank the reviewers for their immensely helpful and thorough comments.  ... 
doi:10.1007/s10994-013-5418-8 fatcat:lou3apxvpzg35dni5shyugyu7q

Improper Reinforcement Learning with Gradient-based Policy Optimization [article]

Mohammadi Zaki, Avinash Mohan, Aditya Gopalan, Shie Mannor
2021 arXiv   pre-print
gradient descent optimization.  ...  We propose a gradient-based approach that operates over a class of improper mixtures of the controllers. We derive convergence rate guarantees for the approach assuming access to a gradient oracle.  ...  π θ is non-concave.  ... 
arXiv:2102.08201v3 fatcat:ycfwh2cdpfcafmryqai4uh33pu

Bandit Convex Optimization in Non-stationary Environments [article]

Peng Zhao and Guanghui Wang and Lijun Zhang and Zhi-Hua Zhou
2020 arXiv   pre-print
Bandit Convex Optimization (BCO) is a fundamental framework for modeling sequential decision-making with partial information, where the only feedback available to the player is the one-point or two-point  ...  We propose a novel algorithm that achieves O(T^3/4(1+P_T)^1/2) and O(T^1/2(1+P_T)^1/2) dynamic regret respectively for the one-point and two-point feedback models.  ...  We extend the algorithm to an anytime version. Besides, we also present the algorithm for BCO problems to optimize the adaptive regret, another measure for non-stationary online learning.  ... 
arXiv:1907.12340v2 fatcat:lpxbjvd54faq5jbypcsicndqle

An adaptive stochastic optimization algorithm for resource allocation [article]

Xavier Fontaine and Shie Mannor and Vianney Perchet
2020 arXiv   pre-print
Our parameter-independent algorithm recovers the optimal rates for strongly-concave functions and the classical fast rates of multi-armed bandit (for linear reward functions).  ...  Moreover, the algorithm improves existing results on stochastic optimization in this regret minimization setting for intermediate cases.  ...  This work was supported by a public grant as part of the Investissement d'avenir project, reference ANR-11-LABX-0056-LMH, LabEx LMH, in a joint call with Gaspard Monge Program for optimization, operations  ... 
arXiv:1902.04376v3 fatcat:ef6jyh4plfembexgchuiqjtyue

Fighting Contextual Bandits with Stochastic Smoothing [article]

Young Hun Jung, Ambuj Tewari
2019 arXiv   pre-print
We propose a general algorithm template that represents random perturbation based algorithms and identify several perturbation distributions that lead to strong regret bounds.  ...  Using the idea of smoothness, we provide an O(√(T)) zero-order bound for the vanilla algorithm and an O(L^*2/3_T) first-order bound for the clipped version.  ...  ALGORITHMS Our algorithm is based on the Gradient-Based Prediction Algorithm of Abernethy et al. (2015) .  ... 
arXiv:1810.05188v2 fatcat:tyudee2yljfv5gfs45572tkctu
« Previous Showing results 1 — 15 out of 680 results