Filters








12 Hits in 2.3 sec

POLITEX: Regret Bounds for Policy Iteration using Expert Prediction

Yasin Abbasi-Yadkori, Peter L. Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvári, Gellért Weisz
2019 International Conference on Machine Learning  
We present POLITEX (POLicy ITeration with EXpert advice), a variant of policy iteration where each policy is a Boltzmann distribution over the sum of action-value function estimates of the previous policies  ...  Thus, we provide the first regret bound for a fully practical model-free method which only scales in the number of features, and not in the size of the underlying MDP.  ...  To get a regret bound for POLITEX, we also need a bound on V T and W T . We bound these terms under the assumption that all policies mix at the same speed.  ... 
dblp:conf/icml/X19 fatcat:kk7bwiqev5h6zh6hg3efij3dui

The Advantage Regret-Matching Actor-Critic [article]

Audrūnas Gruslys, Marc Lanctot, Rémi Munos, Finbarr Timbers, Martin Schmid, Julien Perolat, Dustin Morrill, Vinicius Zambaldi, Jean-Baptiste Lespiau, John Schultz, Mohammad Gheshlaghi Azar, Michael Bowling (+1 others)
2020 arXiv   pre-print
These retrospective value estimates are used to predict conditional advantages which, combined with regret matching, produces a new policy.  ...  In this paper, we describe a general model-free RL method for no-regret learning based on repeated reconsideration of past behavior.  ...  In the single-agent setting, ARMAC is related to POLITEX [1] , except that it is based on regret-matching [17] and it predicts average quantities rather than explicitly summing over all the experts  ... 
arXiv:2008.12234v1 fatcat:tfuusoghqjbcjkyn2rrnxkaauy

Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation [article]

Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Rahul Jain
2021 arXiv   pre-print
Moreover, we draw a connection between this algorithm and the Natural Policy Gradient algorithm proposed by Kakade (2002), and show that our analysis improves the sample complexity bound recently given  ...  Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal O(√(T)) regret and another computationally efficient  ...  Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pages 3692-3702, 2019a.  ... 
arXiv:2007.11849v2 fatcat:4eycsrb6obcz7ow3xfasggl7ru

Average-reward model-free reinforcement learning: a systematic review and literature mapping [article]

Vektor Dewanto, George Dunn, Ali Eshragh, Marcus Gallagher, Fred Roosta
2021 arXiv   pre-print
Motivated by the solo survey by Mahadevan (1996a), we provide an updated review of work in this area and extend it to cover policy-iteration and function approximation methods (in addition to the value-iteration  ...  We also identify and discuss opportunities for future work.  ...  Acknowledgments We thank Aaron Snoswell, Nathaniel Du Preez-Wilkinson, Jordan Bishop, Russell Tsuchida, and Matthew Aitchison for insightful discussions that helped improve this paper.  ... 
arXiv:2010.08920v2 fatcat:hmjm7djacncc7gh6jqeglm4iri

Leverage the Average: an Analysis of KL Regularization in Reinforcement Learning

Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, Matthieu Geist
2020 Neural Information Processing Systems  
We study KL regularization within an approximate value iteration scheme and show that it implicitly averages q-values.  ...  Recent Reinforcement Learning (RL) algorithms making use of Kullback-Leibler (KL) regularization as a core component have shown outstanding performance.  ...  Politex [1] is a PI scheme for the average reward case, building upon prediction with expert advice. In the discounted case, it is DA-MPI(⁄,0), w/o.  ... 
dblp:conf/nips/VieillardKSPMG20 fatcat:hs3wz355qjedlnokjxys7nosli

Cautiously Optimistic Policy Optimization and Exploration with Linear Function Approximation [article]

Andrea Zanette, Ching-An Cheng, Alekh Agarwal
2021 arXiv   pre-print
However, the same properties also make them slow to converge and sample inefficient, as the on-policy requirement precludes data reuse and the incremental updates couple large iteration complexity into  ...  These characteristics have been observed in experiments as well as in theory in the recent work of , which provides a policy optimization method PCPG that can robustly find near optimal polices for approximately  ...  Acknowledgments The authors are grateful to the reviewers for their helpful comments.  ... 
arXiv:2103.12923v2 fatcat:azktjwvo5fg6hhe3quuuqb4jyi

Leverage the Average: an Analysis of KL Regularization in RL [article]

Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, Matthieu Geist
2021 arXiv   pre-print
We study KL regularization within an approximate value iteration scheme and show that it implicitly averages q-values.  ...  Recent Reinforcement Learning (RL) algorithms making use of Kullback-Leibler (KL) regularization as a core component have shown outstanding performance.  ...  Politex [1] is a PI scheme for the average reward case, building upon prediction with expert advice. In the discounted case, it is DA-MPI(λ,0), w/o.  ... 
arXiv:2003.14089v5 fatcat:xyfl2j5ygjdnbiuzu24mzbc3be

Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy [article]

Zuyue Fu, Zhuoran Yang, Zhaoran Wang
2021 arXiv   pre-print
For both cases, we prove that the actor sequence converges to a globally optimal policy at a sublinear O(K^-1/2) rate, where K is the number of iterations.  ...  Specifically, in each iteration, the critic update is obtained by applying the Bellman evaluation operator only once while the actor is updated in the policy gradient direction computed using the critic  ...  Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning. Abbasi-Yadkori, Y., Lazic, N., Szepesvari, C. and Weisz, G.  ... 
arXiv:2008.00483v2 fatcat:neaegxkea5gzhhaiclzqvtjvka

Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently [article]

Asaf Cassel
2020 arXiv   pre-print
On the other hand, we give a lower bound that shows that when the latter condition is violated, square root regret is unavoidable.  ...  is unknown, and when only the state-action transition matrix B is unknown and the optimal policy satisfies a certain non-degeneracy condition.  ...  Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1-26, 2011.  ... 
arXiv:2002.08095v2 fatcat:7ojyorwaufbyrjxa6hgsmt7vta

Logistic Q-Learning [article]

Joan Bas-Serrano, Sebastian Curi, Andreas Krause, Gergely Neu
2021 arXiv   pre-print
The main feature of our algorithm (called QREPS) is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.  ...  of the output policy.  ...  Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pages 3692-3702. Abdolmaleki, A., Springenberg, J.  ... 
arXiv:2010.11151v2 fatcat:gxtbbmc2v5arpmqia36q4pahvm

An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective [article]

Yaodong Yang, Jun Wang
2021 arXiv   pre-print
We expect this work to serve as a stepping stone for both new researchers who are about to enter this fast-growing domain and existing domain experts who want to obtain a panoramic view and identify new  ...  The idea of these methods is to predict regret directly, and the no-regret algorithm then uses these predictions in place of the true regret to define a sequence of policies.  ...  et al. (2009) proposed the famous MDP-Expert (MDP-E) algorithm, which adopts Hedge (Freund and Schapire, 1997) as the regret minimiser and achieves O( τ 3 T ln |A|) regret, where τ is the bound on the  ... 
arXiv:2011.00583v3 fatcat:3k3smfqopvejnn4wdpnc2gbpzi

Regret Minimization with Function Approximation in Extensive-Form Games

Ryan D'Orazio
2020
Online learning with self-play via Counterfactual Regret Minimization (CFR) is the leading approach for saddle point computation in large games with sequential decision making and imperfect information  ...  For very large games, CFR can be scaled in various dimensions such as sampling, subgame decomposition, and function approximation.  ...  In contrast to f -RCFR, Politex trains models to predict cumulative action values.  ... 
doi:10.7939/r3-040j-9e84 fatcat:p7ldvu442vdh5ps3kwjxvspvly