Filters








8,795 Hits in 5.4 sec

Optimal Regret Bounds for Selecting the State Representation in Reinforcement Learning [article]

Odalric-Ambrym Maillard, Phuong Nguyen, Ronald Ortner, Daniil Ryabko
2013 arXiv   pre-print
This is optimal in T since O(√(T)) is the optimal regret in the setting of learning in a (single discrete) MDP.  ...  Recent regret bounds for this setting are of order O(T^2/3) with an additive term constant yet exponential in some characteristics of the optimal MDP.  ...  Introduction In Reinforcement Learning (RL), an agent has to learn a task through interactions with the environment.  ... 
arXiv:1302.2553v2 fatcat:c64pkkwhk5dpvfffdjhxp2pssa

Synergies between Evolutionary Algorithms and Reinforcement Learning

Madalina M. Drugan
2015 Proceedings of the Companion Publication of the 2015 on Genetic and Evolutionary Computation Conference - GECCO Companion '15  
of scalarized MOMABs in terms of upper and lower regret bounds •Scalarized / Pareto regret metric •The Kullback-Leibler divergence regret metric • Exploitation/exploration trade-off: •Exploration  ...  Reinforcement learning in Evolutionary Computation II.1.  ...  real world problems • Hyper-heuristics use reinforcement learning to select the best heuristic for a given task [Ozcan et al, 2010] • Pareto Local search is used in combination with RL for optimising  ... 
doi:10.1145/2739482.2756582 dblp:conf/gecco/Drugan15 fatcat:5jedfs4jmfgclcpppzcjvz6yvu

Regret Balancing for Bandit and RL Model Selection [article]

Yasin Abbasi-Yadkori, Aldo Pacchiano, My Phan
2020 arXiv   pre-print
We consider model selection in stochastic bandit and reinforcement learning problems.  ...  Given a set of base learning algorithms, an effective model selection strategy adapts to the best learning algorithm in an online fashion.  ...  Regret balancing adapts to the best performing strategy (PSRL in this case). As another application, consider the problem of choosing state representation in reinforcement learning.  ... 
arXiv:2006.05491v1 fatcat:vrmql5wkavbmlaw6cfubca253i

Selecting the State-Representation in Reinforcement Learning [article]

Odalric-Ambrym Maillard, Rémi Munos, Daniil Ryabko
2013 arXiv   pre-print
The problem of selecting the right state-representation in a reinforcement learning problem is considered.  ...  Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the  ...  (ANR-08-COSI-004) and Lampada (ANR-09-EMER-007), by the European Communitys Seventh Framework Programme (FP7/2007-2013) under grant agreement 231495 (project CompLACS), and by Pascal-2.  ... 
arXiv:1302.2552v1 fatcat:fmwjnfv5u5fztjbn27ooia5dm4

Provably Efficient Representation Learning in Low-rank Markov Decision Processes [article]

Weitong Zhang and Jiafan He and Dongruo Zhou and Amy Zhang and Quanquan Gu
2021 arXiv   pre-print
The success of deep reinforcement learning (DRL) is due to the power of learning a representation that is suitable for the underlying exploration and exploitation task.  ...  In order to understand how representation learning can improve the efficiency of RL, we study representation learning for a class of low-rank Markov Decision Processes (MDPs) where the transition kernel  ...  Zhou et al. (2021b) Representation Learning in Reinforcement Learning. Learning good representations in reinforcement learning enjoys a long history.  ... 
arXiv:2106.11935v1 fatcat:tur44wmigrc3nkscfhoachbcxy

Anti-Concentrated Confidence Bonuses for Scalable Exploration [article]

Jordan T. Ash, Cyril Zhang, Surbhi Goel, Akshay Krishnamurthy, Sham Kakade
2022 arXiv   pre-print
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off when designing sequential decision-making algorithms, in both foundational theory and state-of-the-art deep reinforcement  ...  Using this approximation, we obtain stochastic linear bandit algorithms which obtain Õ(d √(T)) regret bounds for poly(d) fixed actions.  ...  INTRODUCTION Optimism in the face of uncertainty (OFU) is a ubiquitous algorithmic principle for online decisionmaking in bandit and reinforcement learning problems.  ... 
arXiv:2110.11202v2 fatcat:y73umcivengpbhmwqzynokzlha

Optimal Behavior is Easier to Learn than the Truth

Ronald Ortner
2016 Minds and Machines  
While there are algorithms that are able to successfully learn optimal behavior in this setting, they do so without trying to identify the underlying true model.  ...  We consider a reinforcement learning setting where the learner is given a set of possible models containing the true model.  ...  Acknowledgments The author would like to thank two anonymous reviewers for their valuable comments which helped to improve the paper.  ... 
doi:10.1007/s11023-016-9389-y pmid:27682861 pmcid:PMC5018263 fatcat:temx33ic55bv7lpvccoivi3r4a

Selecting Near-Optimal Approximate State Representations in Reinforcement Learning [article]

Ronald Ortner, Odalric-Ambrym Maillard, Daniil Ryabko
2014 arXiv   pre-print
Here we improve over known regret bounds in this setting, and more importantly generalize to the case where the models given to the learner do not contain a true model resulting in an MDP representation  ...  We consider a reinforcement learning setting introduced in (Maillard et al., NIPS 2011) where the learner does not have explicit access to the states of the underlying Markov decision process (MDP).  ...  This research was funded by the Austrian Science Fund  ... 
arXiv:1405.2652v6 fatcat:vogf5szadfhz7eisqvvasrwcj4

Deep Exploration via Randomized Value Functions [article]

Ian Osband, Benjamin Van Roy, Daniel Russo, Zheng Wen
2019 arXiv   pre-print
We study the use of randomized value functions to guide deep exploration in reinforcement learning.  ...  We also prove a regret bound that establishes statistical efficiency with a tabular representation.  ...  , and more broadly, students who participated in Stanford University's 2017 and 2018 offerings of Reinforcement Learning, for feedback and stimulating discussions on this work.  ... 
arXiv:1703.07608v5 fatcat:vwihhabalzfe3daoek4dih6efu

Regret Bounds for Reinforcement Learning with Policy Advice [article]

Mohammad Gheshlaghi Azar and Alessandro Lazaric and Emma Brunskill
2013 arXiv   pre-print
We present a reinforcement learning with policy advice (RLPA) algorithm which leverages this input set and learns to use the best policy in the set for the reinforcement learning task at hand.  ...  We prove that RLPA has a sub-linear regret of Õ(√(T)) relative to the best input policy, and that both this regret and its computational complexity are independent of the size of the state and action space  ...  In [8] the objective is to learn the optimal policy along with a state representation which satisfies the Markov property.  ... 
arXiv:1305.1027v2 fatcat:ngfawqhphrg6pdmzidmbq5e3mq

Residual Loss Prediction: Reinforcement Learning With No Incremental Feedback

Hal Daumé III, John Langford, Amr Sharaf
2018 International Conference on Learning Representations  
RESLOPE enjoys a no-regret reductionstyle theoretical guarantee and outperforms state of the art reinforcement learning algorithms in both MDP environments and bandit structured prediction settings. *  ...  We consider reinforcement learning and bandit structured prediction problems with very sparse loss feedback: only at the end of an episode.  ...  Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.  ... 
dblp:conf/iclr/Daume0S18 fatcat:gz4rvzh5szgankzwvyoqnkfenm

Model-based Reinforcement Learning for Continuous Control with Posterior Sampling [article]

Ying Fan, Yifei Ming
2021 arXiv   pre-print
In this paper, we study model-based posterior sampling for reinforcement learning (PSRL) in continuous state-action spaces theoretically and empirically.  ...  Balancing exploration and exploitation is crucial in reinforcement learning (RL).  ...  To the best of our knowledge, we are the first to show that the regret bound for PSRL in continuous state-action spaces can be polynomial in the episode length H and simultaneously sub-linear in T : For  ... 
arXiv:2012.09613v2 fatcat:7qbmuqa3ezd27biaymrxrtuasq

Regret Bounds for Reinforcement Learning with Policy Advice [chapter]

Mohammad Gheshlaghi Azar, Alessandro Lazaric, Emma Brunskill
2013 Lecture Notes in Computer Science  
We present a reinforcement learning with policy advice (RLPA) algorithm which leverages this input set and learns to use the best policy in the set for the reinforcement learning task at hand.  ...  We prove that RLPA has a sub-linear regret of O( √ T ) relative to the best input policy, and that both this regret and its computational complexity are independent of the size of the state and action  ...  In [8] the objective is to learn the optimal policy along with a state representation which satisfies the Markov property.  ... 
doi:10.1007/978-3-642-40988-2_7 fatcat:pvh4zm63qjbhnomepyqpsgrxd4

Reinforcement Learning Algorithm Selection [article]

Romain Laroche, Raphael Feraud
2017 arXiv   pre-print
This paper formalises the problem of online algorithm selection in the context of Reinforcement Learning.  ...  Its principle is to freeze the policy updates at each epoch, and to leave a rebooted stochastic bandit in charge of the algorithm selection.  ...  The theoretical aspects of algorithm selection for reinforcement learning in general, and Epochal Stochastic Bandit Algorithm Selection in particular, are thoroughly detailed in this section.  ... 
arXiv:1701.08810v3 fatcat:fjr5azqqbvdwxdzv33sdnln5he

Learning Robust Representations with Graph Denoising Policy Network [article]

Lu Wang, Wenchao Yu, Wei Wang, Wei Cheng, Wei Zhang, Hongyuan Zha, Xiaofeng He, Haifeng Chen
2019 arXiv   pre-print
In this paper, we propose Graph Denoising Policy Network (short for GDPNet) to learn robust representations from noisy graph data through reinforcement learning.  ...  GDPNet first selects signal neighborhoods for each node, and then aggregates the information from the selected neighborhoods to learn node representations for the down-stream tasks.  ...  In summary, our contributions in this work include: • We propose a novel model, GDPNet, for robust graph representation learning through reinforcement learning.  ... 
arXiv:1910.01784v1 fatcat:aig47osibrgallnkn3viifjoey
« Previous Showing results 1 — 15 out of 8,795 results