Filters








60 Hits in 8.7 sec

Improved Variance-Aware Confidence Sets for Linear Bandits and Linear Mixture MDP [article]

Zihan Zhang, Jiaqi Yang, Xiangyang Ji, Simon S. Du
2021 arXiv   pre-print
This paper presents new variance-aware confidence sets for linear bandits and linear mixture Markov Decision Processes (MDPs).  ...  For linear mixture MDPs, we obtain an Õ(poly(d, log H)√(K)) regret bound, where d is the number of base models, K is the number of episodes, and H is the planning horizon.  ...  Du gratefully acknowledges funding from NSF Award's IIS-2110170 and DMS-2134106.  ... 
arXiv:2101.12745v4 fatcat:6jofrm4v4nd7rhnc4la3i75clq

Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs [article]

Yeoneung Kim, Insoon Yang, Kwang-Sung Jun
2021 arXiv   pre-print
For linear mixture MDPs, we achieve a horizon-free regret bound of Õ(d^1.5√(K) + d^3) where d is the number of base models and K is the number of episodes.  ...  bound for linear mixture Markov decision processes (MDPs).  ...  Pac reinforcement learning with rich ob- Dependent Bound for Linear Bandits and Horizon- servations. arXiv preprint arXiv:1602.02722, 2016. Free Bound for Linear Mixture MDP. CoRR, abs/2101.1, 2021.  ... 
arXiv:2111.03289v1 fatcat:2aiinalr4zcudkz53tgc3fqjxi

Learning Near Optimal Policies with Low Inherent Bellman Error [article]

Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, Emma Brunskill
2020 arXiv   pre-print
While computational tractability questions remain open for the MDP setting, this enriches the class of MDPs with a linear representation for the action-value function where statistically efficient reinforcement  ...  First we relate this condition to other common frameworks and show that it is strictly more general than the low rank (or linear) MDP assumption of prior work.  ...  We thank Alekh Agarwal for pointing our the connection with the low Bellman rank setting. The authors are grateful to the reviewers for their helpful comments.  ... 
arXiv:2003.00153v3 fatcat:2kojpcgskra4hjvhv4hvdtlx3q

A Model Selection Approach for Corruption Robust Reinforcement Learning [article]

Chen-Yu Wei, Christoph Dann, Julian Zimmert
2021 arXiv   pre-print
Finally, our model selection framework can be easily applied to other settings including linear bandits, linear contextual bandits, and MDPs with general function approximation, leading to several improved  ...  For finite-horizon linear MDPs, we develop a computationally efficient algorithm with a regret bound of 𝒪(√((1+C)T)), and another computationally inefficient one with 𝒪(√(T)+C), improving the result  ...  Acknowledgments The authors would like to thank Liyu Chen and Thodoris Lykouris for helpful discussions.  ... 
arXiv:2110.03580v1 fatcat:us2atl3cybhbtekysz4uiqjmau

Convex Optimization: Algorithms and Complexity

Mohammed Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar
2015 Foundations and Trends® in Machine Learning  
We first discuss models and methods for Bayesian inference in the simple single-step Bandit model.  ...  We also present Bayesian methods for model-free RL, where priors are expressed over the value function or policy class.  ...  Acknowledgements 464 Acknowledgements The authors extend their warmest thanks to Michael Littman, James Finlay, Melanie Lyman-Abramovitch and the anonymous reviewers for their insights and support throughout  ... 
doi:10.1561/2200000049 fatcat:xrgut7tqjbf5le7h5otjwcwkry

A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs [article]

Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric
2021 arXiv   pre-print
We derive a novel asymptotic problem-dependent lower-bound for regret minimization in finite-horizon tabular Markov Decision Processes (MDPs).  ...  derived for specific MDP instances. 3) Finally, we show that, in certain "simple" MDPs, the lower bound is considerably smaller than in the general case and it does not scale with the minimum action gap  ...  Kochenderfer, and Emma Brunskill. Almost horizon-free structure-aware best policy identification with a generative model. In NeurIPS, pages 5626-5635, 2019. . Proof.  ... 
arXiv:2106.13013v1 fatcat:tc72ivbe6zdplc7tgvlfujatdi

Adaptive Multi-Goal Exploration [article]

Jean Tarbouriech, Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Michal Valko, Alessandro Lazaric
2021 arXiv   pre-print
We also readily instantiate AdaGoal in linear mixture Markov decision processes, which yields the first goal-oriented PAC guarantee with linear function approximation.  ...  s_0 in a reward-free Markov decision process.  ...  Acknowledgement We thank Evrard Garcelon, Andrea Tirinzoni and Yann Ollivier for helpful discussion.  ... 
arXiv:2111.12045v1 fatcat:jyajblwkdjf7xetikv3l3fx334

Reinforcement Learning in Linear MDPs: Constant Regret and Representation Selection [article]

Matteo Papini, Andrea Tirinzoni, Aldo Pacchiano, Marcello Restelli, Alessandro Lazaric, Matteo Pirotta
2021 arXiv   pre-print
We study the role of the representation of state-action value functions in regret minimization in finite-horizon Markov Decision Processes (MDPs) with linear structure.  ...  We then demonstrate that this condition is also sufficient for these classes of problems by deriving a constant regret bound for two optimistic algorithms (LSVI-UCB and ELEANOR).  ...  Zihan Zhang, Jiaqi Yang, Xiangyang Ji, and Simon S.Du. Variance-aware confidence set: Variancedependent bound for linear bandits and horizon-free bound for linear mixture MDP.  ... 
arXiv:2110.14798v1 fatcat:tivszrwcunhshndl6bnzykui2m

Robust Policy Gradient against Strong Data Corruption [article]

Xuezhou Zhang, Yiding Chen, Xiaojin Zhu, Wen Sun
2021 arXiv   pre-print
Our attack model assumes an adaptive adversary who can arbitrarily corrupt the reward and transition at every step within an episode, for at most ϵ-fraction of the learning episodes.  ...  Next, we show that surprisingly the natural policy gradient (NPG) method retains a natural robustness property if the reward corruption is bounded, and can find an O(√(ϵ))-optimal policy.  ...  -Y., and Zhang, M. Bias no more: high-probability data-dependent regret bounds for adversarial bandits and mdps. Advances in Neural Infor- mation Processing Systems, 33, 2020.  ... 
arXiv:2102.05800v3 fatcat:4xzl4rurpnflhbx3y2fqmhhsne

Optimizing for the Future in Non-Stationary MDPs [article]

Yash Chandak, Georgios Theocharous, Shiv Shankar, Martha White, Sridhar Mahadevan, Philip S. Thomas
2020 arXiv   pre-print
The resulting algorithm amounts to a non-uniform reweighting of past data, and we observe that minimizing performance over some of the data from past episodes can be beneficial when searching for a policy  ...  To proactively search for a good future policy, we present a policy gradient algorithm that maximizes a forecast of future performance.  ...  On upper-confidence bound policies for non-stationary bandit problems. arXiv preprint arXiv:0805.3415, 2008. Nagabandi, A., Clavera, I., Liu, S., Fearing, R.  ... 
arXiv:2005.08158v4 fatcat:er42kn4d2bbsni6xqvkni37jli

Differentially Private Exploration in Reinforcement Learning with Linear Representation [article]

Paul Luyo and Evrard Garcelon and Alessandro Lazaric and Matteo Pirotta
2021 arXiv   pre-print
We first consider the setting of linear-mixture MDPs (Ayoub et al., 2020) (a.k.a. model-based setting) and provide an unified framework for analyzing joint and local differential private (DP) exploration  ...  We further study privacy-preserving exploration in linear MDPs (Jin et al., 2020) (a.k.a. model-free setting) where we provide a O(K^3/5/ϵ^2/5) regret bound for (ϵ,δ)-joint DP, with a novel algorithm based  ...  For linear-mixture MDPs, we presented a unified framework that allowed us to prove a O( K/ε) regret bound for JDP and O(K 3/4 / √ ε) for (ǫ, δ)-LDP.  ... 
arXiv:2112.01585v2 fatcat:ubnlne4zyrgqrfcb7gcyhzznt4

Towards Instance-Optimal Offline Reinforcement Learning with Pessimism [article]

Ming Yin, Yu-Xiang Wang
2021 arXiv   pre-print
In particular, we consider the sample complexity problems of offline RL for finite-horizon MDPs.  ...  setting, single policy concentrability, and the tight problem-dependent results.  ...  MY would like to thank Chenjun Xiao for bringing up a related literature [Xiao et al., 2021] and Masatoshi Uehara for helpful suggestions.  ... 
arXiv:2110.08695v1 fatcat:3wctna57pbhvpi5qwhuri5iuuu

Efficient First-Order Contextual Bandits: Prediction, Allocation, and Triangular Discrimination [article]

Dylan J. Foster, Akshay Krishnamurthy
2021 arXiv   pre-print
In a COLT 2017 open problem, Agarwal, Krishnamurthy, Langford, Luo, and Schapire asked whether first-order guarantees are even possible for contextual bandits and -- if so -- whether they can be attained  ...  While first-order guarantees are relatively well understood in statistical and online learning, adapting to low noise in contextual bandits (and more broadly, decision making) presents major algorithmic  ...  Acknowledgements We thank Sivaraman Balakrishnan, John Langford, Zakaria Mhammedi, and Sasha Rakhlin for many helpful discussions.  ... 
arXiv:2107.02237v1 fatcat:xvw6eti3frdwpnf2tt4tqt7mn4

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems [article]

Sébastien Bubeck, Nicolò Cesa-Bianchi
2012 arXiv   pre-print
Besides the basic setting of finitely many actions, we also analyze some of the most important variants and extensions, such as the contextual bandit model.  ...  Although the study of bandit problems dates back to the Thirties, exploration-exploitation trade-offs arise in several modern applications, such as ad placement, website optimization, and packet routing  ...  Acknowledgements We would like to thank Mike Jordan for proposing to write this survey and James Finlay for keeping us on track. The table of contents was laid down with the help of Gábor Lugosi.  ... 
arXiv:1204.5721v2 fatcat:kpclt3fswzewtcsjp7hkjncd6q

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Sébastien Bubeck
2012 Foundations and Trends® in Machine Learning  
Note that the randomization of the adversary is not very important here since we ask for bounds which hold for any opponent.  ...  Besides the basic setting of finitely many actions, we also analyze some of the most important variants and extensions, such as the contextual bandit model.  ...  Acknowledgments We would like to thank Mike Jordan for proposing to write this monograph and James Finlay for keeping us on track. The table of contents was laid down with the help of Gábor Lugosi.  ... 
doi:10.1561/2200000024 fatcat:fzpfffppvrfrle6vkj7z6wzh2e
« Previous Showing results 1 — 15 out of 60 results