491 Hits in 4.6 sec

Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu
2020 International Conference on Machine Learning  
We consider the task of learning in episodic finitehorizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses.  ...  Our key contributions are two-fold: a tighter confidence set for the transition function; and an optimistic loss estimator that is inversely weighted by an upper occupancy bound.  ...  The environment dynamics are usually modeled as a Markov Decision Process (MDP) with a fixed and unknown transition function.  ... 
dblp:conf/icml/JinJLSY20 fatcat:ylnaejcljng3nbsjs6v3jeh5oi

Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

Aviv Rosenberg, Yishay Mansour
2019 Neural Information Processing Systems  
We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes.  ...  To our knowledge these are the first algorithms that in our setting handle both bandit feedback and an unknown transition function.  ...  Acknowledgements This work was supported in part by a grant from the Israel Science Foundation (ISF) and by the Tel Aviv University Yandex Initiative in Machine Learning.  ... 
dblp:conf/nips/0002M19 fatcat:a6wjfkw3zjffbl4sjld7ip67ue

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition [article]

Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu
2020 arXiv   pre-print
We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses.  ...  full-information feedback.  ...  Acknowledgments HL is supported by NSF Awards IIS-1755781 and IIS-1943607. SS is partially supported by NSF-BIGDATA Award IIS-1741341 and an NSF-CAREER grant Award IIS-1846088.  ... 
arXiv:1912.01192v5 fatcat:cnim65b3wrarrfmisbc6apvpni

Adaptive demand response: Online learning of restless and controlled bandits

Qingsi Wang, Mingyan Liu, Johanna L. Mathieu
2014 2014 IEEE International Conference on Smart Grid Communications (SmartGridComm)  
Our problem has two features not commonly addressed in the bandit literature: the arms/processes evolve according to different probabilistic laws depending on the control, and the reward/feedback observed  ...  We develop an adaptive demand response learning algorithm and an extended version that works with aggregate feedback, both aimed at approximating the Whittle index policy.  ...  This further splits into two cases: In the first, the process is assumed to follow a certain probabilistic model with unknown parameters, e.g., a Markov chain with unknown transition probabilities or an  ... 
doi:10.1109/smartgridcomm.2014.7007738 dblp:conf/smartgridcomm/WangLM14 fatcat:ddsolmx3lndn7fzwv5gwda3hry

Learning Adversarial Markov Decision Processes with Delayed Feedback

Tal Lancewicki, Aviv Rosenberg, Yishay Mansour
This paper studies online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback.  ...  Under bandit feedback, we prove similar (K+D)¹ᐟ² regret assuming the costs are stochastic, and (K+D)²ᐟ³ regret in the general case.  ...  "Known Transition" assumes dynamics are known to the learner in advance, and "Unknown Transition" means that the learner needs to learn the dynamics.  ... 
doi:10.1609/aaai.v36i7.20690 fatcat:fgscdhlmdbdjfle5lbevarhxfi

Online Markov Decision Processes with Aggregate Bandit Feedback [article]

Alon Cohen, Haim Kaplan, Tomer Koren, Yishay Mansour
2021 arXiv   pre-print
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.  ...  In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback: the trajectory is revealed along with  ...  Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. Learning adversarial markov decision processes with bandit feedback and unknown transition.  ... 
arXiv:2102.00490v1 fatcat:gy4pu4pz2razlluimutuxbytvy

Deterministic MDPs with Adversarial Rewards and Bandit Feedback [article]

Raman Arora, Ofer Dekel, Ambuj Tewari
2012 arXiv   pre-print
We consider a Markov decision process with deterministic state transition dynamics, adversarially generated rewards that change arbitrarily from round to round, and a bandit feedback model in which the  ...  In this setting, we present a novel and efficient online decision making algorithm named MarcoPolo.  ...  Acknowledgements A major portion of this work was done when RA and AT were visiting OD at MSR Redmond.  ... 
arXiv:1210.4843v1 fatcat:fyxownfssjbtdh76d7wcoty6lm

Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition [article]

Liyu Chen, Haipeng Luo, Chen-Yu Wei
2021 arXiv   pre-print
Our work is also the first to consider bandit feedback with adversarial costs.  ...  We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is O(√(DT^⋆ K)) and O(√(DT^⋆ SA K)) for the full-information setting and the  ...  This work is supported by NSF Award IIS-1943607 and a Google Faculty Research Award.  ... 
arXiv:2012.04053v3 fatcat:2kvjqo6ehvh75hcubaw7m3hrde

Distributed No-Regret Learning in Multi-Agent Systems [article]

Xiao Xu, Qing Zhao
2020 arXiv   pre-print
In this tutorial article, we give an overview of new challenges and representative results on distributed no-regret learning in multi-agent systems modeled as repeated unknown games.  ...  Four emerging game characteristics---dynamicity, incomplete and imperfect feedback, bounded rationality, and heterogeneity---that challenge canonical game models are explored.  ...  In a multi-player game setting with bandit feedback, no-regret learning from an individual player's perspective can be cast as a single-player non-stochastic/adversarial bandit model where the payoff of  ... 
arXiv:2002.09047v1 fatcat:igu7xdfmyzh3ddjz4li3xcmb6m

Online Learning in Unknown Markov Games [article]

Yi Tian, Yuanhao Wang, Tiancheng Yu, Suvrit Sra
2021 arXiv   pre-print
We study online learning in unknown Markov games, a problem that arises in episodic multi-agent reinforcement learning where the actions of the opponents are unobservable.  ...  This is the first sublinear regret bound (to our knowledge) for online learning in unknown Markov games. Importantly, our regret bound is independent of the size of the opponents' action spaces.  ...  We thank Yu Bai, Kefan Dong and Chi Jin for useful discussions.  ... 
arXiv:2010.15020v2 fatcat:w6u272f33jg7fpq4gi4pnolmuu

Robust Policy Gradient against Strong Data Corruption [article]

Xuezhou Zhang, Yiding Chen, Xiaojin Zhu, Wen Sun
2021 arXiv   pre-print
We study the problem of robust reinforcement learning under adversarial corruption on both rewards and transitions.  ...  Our attack model assumes an adaptive adversary who can arbitrarily corrupt the reward and transition at every step within an episode, for at most ϵ-fraction of the learning episodes.  ...  Learning ad- versarial markov decision processes with bandit feedback and unknown transition. In International Conference on Machine Learning, pp. 4860-4869. PMLR, 2020a.  ... 
arXiv:2102.05800v3 fatcat:4xzl4rurpnflhbx3y2fqmhhsne

Model-Free Online Learning in Unknown Sequential Decision Making Problems and Games [article]

Gabriele Farina, Tuomas Sandholm
2021 arXiv   pre-print
a best response, learning safe opponent exploitation, and online play against an unknown opponent/environment.  ...  We give an efficient algorithm that achieves O(T^3/4) regret with high probability for that setting, even when the agent faces an adversarial environment.  ...  We are grateful to Marc Lanctot and Marcello Restelli for their valuable feedback while preparing our manuscript, and to Marc Lanctot and Vinicius Zambaldi for their help in tuning the hyperparameters  ... 
arXiv:2103.04539v1 fatcat:3s5z2crajvamplbizn65dwn5ky

Optimistic Policy Optimization with Bandit Feedback [article]

Yonathan Efroni, Lior Shani, Aviv Rosenberg, Shie Mannor
2020 arXiv   pre-print
In this paper we consider model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback.  ...  To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.  ...  Learning adversarial markov decision processes with bandit feedback and unknown transition. arXiv preprint arXiv:1912.01192, 2019. Kakade, S. and Langford, J.  ... 
arXiv:2002.08243v2 fatcat:mzkftuzxk5e2ll4cgqon3he7iu

Slowly Changing Adversarial Bandit Algorithms are Provably Efficient for Discounted MDPs [article]

Ian A. Kash, Lev Reyzin, Zishun Yu
2022 arXiv   pre-print
Reinforcement learning (RL) generalizes bandit problems with additional difficulties on longer planning horzion and unknown transition kernel.  ...  We show that, under some mild assumptions, any slowly changing adversarial bandit algorithm enjoys near-optimal regret in adversarial bandits can achieve near-optimal (expected) regret in non-episodic  ...  Gergely Neu, András György, Csaba Szepesvári, and András Antos. Online markov decision processes under bandit feedback.  ... 
arXiv:2205.09056v1 fatcat:tn4t4xvcpjeq3fcwztwzsh7acy

Constrained Contextual Bandit Learning for Adaptive Radar Waveform Selection [article]

Charles E. Thornton, R. Michael Buehrer, Anthony F. Martone
2021 arXiv   pre-print
A sequential decision process in which an adaptive radar system repeatedly interacts with a finite-state target channel is studied.  ...  Stochastic and adversarial linear contextual bandit models are introduced, allowing the radar to achieve effective performance in broad classes of physical environments.  ...  This has been achieved by modeling the waveform selection problem as a Markov Decision Process (MDP) [38] .  ... 
arXiv:2103.05541v2 fatcat:bvjexjasbvfbfbdobr4olshequ
« Previous Showing results 1 — 15 out of 491 results