12,366 Hits in 5.0 sec

Dynamic Spectrum Access in Time-varying Environment: Distributed Learning Beyond Expectation Optimization [article]

Yuhua Xu, Jinlong Wang, Qihui Wu, Jianchao Zheng, Liang Shen and Alagan Anpalagan
2017 arXiv   pre-print
Therefore, we formulate the interactions among the users in the time-varying environment as a non-cooperative game, in which the utility function is defined as the achieved effective capacity.  ...  This article investigates the problem of dynamic spectrum access for canonical wireless networks, in which the channel states are time-varying.  ...  access game in time-varying environment.  ... 
arXiv:1502.06672v4 fatcat:ehiz6u5ep5ghnlhxvs7h7imozi

MDP Playground: A Design and Debug Testbed for Reinforcement Learning [article]

Raghu Rajan, Jessica Lizeth Borja Diaz, Suresh Guttikonda, Fabio Ferreira, André Biedenkapp, Jan Ole von Hartz, Frank Hutter
2021 arXiv   pre-print
We present MDP Playground, an efficient testbed for Reinforcement Learning (RL) agents with orthogonal dimensions that can be controlled independently to challenge agents in different ways and obtain varying  ...  degrees of hardness in generated environments.  ...  The underlying assumptions in many of these environments are that of a Markov Decision Process (MDP) [see, e.g., Puterman, 1994, Sutton and Barto, 2018] or a Partially Observable MDP (POMDP) [see, e.g  ... 
arXiv:1909.07750v4 fatcat:wcj7j7cxqzhdzb5t2hicms7yzy

Active Reinforcement Learning over MDPs [article]

Qi Yang, Peng Yang, Ke Tang
2021 arXiv   pre-print
This paper proposes a framework of Active Reinforcement Learning (ARL) over MDPs to improve generalization efficiency in a limited resource by instance selection.  ...  However, one of the greatest challenges in RL is generalization efficiency (i.e., generalization performance in a unit time).  ...  In our context, active reinforcement learning (ARL) over MDPs decides which instances (i.e., MDPs) to train and save the training cost.  ... 
arXiv:2108.02323v3 fatcat:cmcn36kiyvffdkprzbdb252tyy

Learning Robust State Abstractions for Hidden-Parameter Block MDPs [article]

Amy Zhang, Shagun Sodhani, Khimya Khetarpal, Joelle Pineau
2021 arXiv   pre-print
In this work, we leverage ideas of common structure from the HiP-MDP setting, and extend it to enable robust state abstractions inspired by Block MDPs.  ...  Hidden-Parameter Markov Decision Processes (HiP-MDPs) explicitly model this structure to improve sample efficiency in multi-task settings.  ...  Figure 1 : 1 Visualizations of the typical MTRL setting and the HiP-MDP setting. 1 . 1 Cartpole-Swingup-V0: the mass of the pole varies, 2. Cheetah-Run-V0: the size of the torso varies, 3.  ... 
arXiv:2007.07206v4 fatcat:mdp3x6s6ovf5znhb2oiv56y5fy

Block Contextual MDPs for Continual Learning [article]

Shagun Sodhani, Franziska Meier, Joelle Pineau, Amy Zhang
2021 arXiv   pre-print
In reinforcement learning (RL), when defining a Markov Decision Process (MDP), the environment dynamics is implicitly assumed to be stationary.  ...  In this work, we propose to examine this continual reinforcement learning setting through the block contextual MDP (BC-MDP) framework, which enables us to relax the assumption of stationarity.  ...  Finally, we discuss additional related works in multi-task RL, transfer learning, and MDP metrics in Appendix A.  ... 
arXiv:2110.06972v1 fatcat:cqmjbgeynbacjkeirikksjimtq

Optimizing for the Future in Non-Stationary MDPs [article]

Yash Chandak, Georgios Theocharous, Shiv Shankar, Martha White, Sridhar Mahadevan, Philip S. Thomas
2020 arXiv   pre-print
However, in many real-world applications, this assumption is violated, and using existing algorithms may result in a performance lag.  ...  Most reinforcement learning methods are based upon the key assumption that the transition dynamics and reward functions are fixed, that is, the underlying Markov decision process is stationary.  ...  Learning and planning for time-varying mdps using maximum likelihood estima- tion. arXiv preprint arXiv:1911.12976, 2019. Padakandla, S.  ... 
arXiv:2005.08158v4 fatcat:er42kn4d2bbsni6xqvkni37jli

Online Reinforcement Learning for Periodic MDP [article]

Ayush Aniket, Arpan Chattopadhyay
2022 arXiv   pre-print
We study learning in periodic Markov Decision Process(MDP), a special type of non-stationary MDP where both the state transition probabilities and reward functions vary periodically, under the average  ...  We formulate the problem as a stationary MDP by augmenting the state space with the period index, and propose a periodic upper confidence bound reinforcement learning-2 (PUCRL2) algorithm.  ...  This was first analysed in [7] in a solely reward varying environment.  ... 
arXiv:2207.12045v1 fatcat:tk5tgz2ylvaz3csvba63b72ggu

Decentralized MDPs with sparse interactions

Francisco S. Melo, Manuela Veloso
2011 Artificial Intelligence  
Finally, we show a reinforcement learning algorithm in which independent agents learn both individual policies and when and how to coordinate.  ...  We relate our new model to other existing models such as MMDPs and Dec-MDPs.  ...  We run our learning algorithm in each of the test environments. Table 7 summarizes the number of learning steps allowed in each environments.  ... 
doi:10.1016/j.artint.2011.05.001 fatcat:k2theoe5qvf3pbrij37jmjafky

Invariant Causal Prediction for Block MDPs [article]

Amy Zhang, Clare Lyle, Shagun Sodhani, Angelos Filos, Marta Kwiatkowska, Joelle Pineau, Yarin Gal, Doina Precup
2020 arXiv   pre-print
In this paper, we consider the problem of learning abstractions that generalize in block MDPs, families of environments with a shared latent state space and dynamics structure over that latent space, but  ...  varying observations.  ...  The authors would also like to thank Marlos Machado for helpful feedback in the writing process.  ... 
arXiv:2003.06016v2 fatcat:hnqf7cfkergp3fsaoi6lbsxa6u

Learning and Planning for Time-Varying MDPs Using Maximum Likelihood Estimation [article]

Melkior Ornik, Ufuk Topcu
2021 arXiv   pre-print
This paper proposes a formal approach to online learning and planning for agents operating in a priori unknown, time-varying environments.  ...  Based on the proposed method, we generalize the exploration bonuses used in learning for time-invariant Markov decision processes by introducing a notion of uncertainty in a learned time-varying model,  ...  In contrast to assuming time-invariance, accounting for time-varying changes in the environment presents a major challenge to learning and planning.  ... 
arXiv:1911.12976v2 fatcat:kqwcpk7iejctlegk5o4cds5sve

Inverse Reinforcement Learning in Contextual MDPs [article]

Stav Belogolovsky, Philip Korsunsky, Shie Mannor, Chen Tessler, Tom Zahavy
2020 arXiv   pre-print
We consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs).  ...  Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them  ...  Apprenticeship Learning and Inverse Reinforcement Learning In Apprenticeship Learning (AL), the reward function is unknown, and we denote the MDP without the reward function (also commonly called a controlled  ... 
arXiv:1905.09710v5 fatcat:tluul5ast5dedk4nxsbpevr27a

Safety-Constrained Reinforcement Learning for MDPs [article]

Sebastian Junges, Nils Jansen, Christian Dehnert, Ufuk Topcu, Joost-Pieter Katoen
2015 arXiv   pre-print
We consider controller synthesis for stochastic and partially unknown environments in which safety is essential.  ...  Exploiting an iterative learning procedure, the resulting policy is safety-constrained and optimal.  ...  Learning In the learning phase, the main goal of this learning phase is the exploration of this MDP, as we thereby learn the cost function.  ... 
arXiv:1510.05880v1 fatcat:bqjtjzv7kngkxm2z4c7zgxzji4

Bayesian regularization of empirical MDPs [article]

Samarth Gupta, Daniel N. Hill, Lexing Ying, Inderjit Dhillon
2022 arXiv   pre-print
When applied to the environment of the underlying model, the learned policy results in suboptimal performance, thus calling for solutions with better generalization performance.  ...  Our results demonstrate the robustness of regularized MDP policies against the noise present in the models.  ...  When applying π directly to the environment modeled by the underlying MDP M , one often experiences suboptimal performance.  ... 
arXiv:2208.02362v1 fatcat:5g3fnekgbvgabatwxmjrufd3ta

Denoised MDPs: Learning World Models Better Than the World Itself [article]

Tongzhou Wang, Simon S. Du, Antonio Torralba, Phillip Isola, Amy Zhang, Yuandong Tian
2022 arXiv   pre-print
This framework clarifies the kinds information removed by various prior work on representation learning in reinforcement learning (RL), and leads to our proposed approach of learning a Denoised MDP that  ...  In this work, we categorize information out in the wild into four types based on controllability and relation with reward, and formulate useful information as that which is both controllable and reward-relevant  ...  We are very thankful to Alex Lamb for suggestions and catching our typo in the conditioning of Equation (1).  ... 
arXiv:2206.15477v4 fatcat:he66y45mgfcjvp6l6d253hzkn4

Bounded Optimal Exploration in MDP [article]

Kenji Kawaguchi
2016 arXiv   pre-print
In this paper, we relax the PAC-MDP conditions to reconcile theoretically driven exploration methods and practical needs.  ...  Within the framework of probably approximately correct Markov decision processes (PAC-MDP), much theoretical work has focused on methods to attain near optimality after a relatively long period of learning  ...  Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.  ... 
arXiv:1604.01350v1 fatcat:h2szqltoknb2tbgpusrd7bdthq
« Previous Showing results 1 — 15 out of 12,366 results