Filters








453 Hits in 11.1 sec

Policy Optimization with Stochastic Mirror Descent

Long Yang, Yu Zhang, Gang Zheng, Qian Zheng, Pengfei Li, Jianhang Huang, Gang Pan
2022 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
We prove that the proposed VRMPO needs only O(ε−3) sample trajectories to achieve an ε-approximate first-order stationary point, which matches the best sample complexity for policy optimization.  ...  This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent.  ...  Curves are smoothed uniformly for visual clarity. , REINFORCE Policy π θ (a|s) with parameter θ0 , mirror map ψ, step-size α > 0, epoch size K,m.  ... 
doi:10.1609/aaai.v36i8.20863 fatcat:xtwgwd6bfbcxhl4y5vdlaz6wbu

Policy Gradients for Contextual Recommendations

Feiyang Pan, Qingpeng Cai, Pingzhong Tang, Fuzhen Zhuang, Qing He
2019 The World Wide Web Conference on - WWW '19  
In this work, we put forward Policy Gradients for Contextual Recommendations (PGCR) to solve the problem without those unrealistic assumptions.  ...  The former ensures PGCR to be empirically greedy in the limit, and the latter addresses the trade-off between exploration and exploitation by using the policy network with Dropout as a Bayesian approximation  ...  Assuming the policy π leads to stationary distributions for states and contexts, the unbiased policy gradient is ∇ θ J(π θ ) = m Ec ∼ξ ∇ θ p θ (c) Q(c) , (8) where Q π (c) := Q π (s, c) is the state-action  ... 
doi:10.1145/3308558.3313616 dblp:conf/www/PanCTZH19 fatcat:ijck6gknk5dkfd5n2npsybbxmq

Heuristic policies for the stochastic economic lot sizing problem with remanufacturing under service level constraints

Onur A. Kilic, Huseyin Tunc, S. Armagan Tarim
2018 European Journal of Operational Research  
Heuristic policies for the stochastic economic lot sizing problem with remanufacturing under service level constraints Kilic, Onur A.  ...  Acknowledgments We thank the editor and three anonymous reviewers for their constructive feedback on earlier drafts of the manuscript. Onur A.  ...  Kilic and Huseyin Tunc are supported by the Scientific and Technological Research Council of Turkey (TUBITAK) under Grant No. MAG-114M389.  ... 
doi:10.1016/j.ejor.2017.12.041 fatcat:w7qla7dqozd3fjqs3bpohdr3ya

Policy Gradients for CVaR-Constrained MDPs [article]

Prashanth L.A.
2014 arXiv   pre-print
We propose two algorithms that obtain a locally risk-optimal policy by employing four tools: stochastic approximation, mini batches, policy gradients and importance sampling.  ...  The algorithms differ in the manner in which they approximate the CVaR estimates/necessary gradients - the first algorithm uses stochastic approximation, while the second employ mini-batches in the spirit  ...  Algorithm 2 PG-CVaR-mB Input: parameterized policy π θ (·|·), step-sizes {γ n , β n }, non-negative weights {a n }, mini-batch sizes {m n }.  ... 
arXiv:1405.2690v1 fatcat:yauljmdttrevtemqtu3ziwgq2e

Policy Gradients for CVaR-Constrained MDPs [chapter]

L. A. Prashanth
2014 Lecture Notes in Computer Science  
We propose two algorithms that obtain a locally risk-optimal policy by employing four tools: stochastic approximation, mini batches, policy gradients and importance sampling.  ...  The algorithms differ in the manner in which they approximate the CVaR estimates/necessary gradients -the first algorithm uses stochastic approximation, while the second employs mini-batches in the spirit  ...  ) If ∇Q can be written as an expectation, i.e., ∇Q(η) = E[q(η, X)], then one can hope to estimate this expectation (and hence minimize Q) using a stochastic approximation recursion.  ... 
doi:10.1007/978-3-319-11662-4_12 fatcat:szykydrgrzgs5lpe4cvckyiw6i

A Convergent Off-Policy Temporal Difference Algorithm [article]

Raghuram Bharadwaj Diddigi, Chandramouli Kamanchi, Shalabh Bhatnagar
2019 arXiv   pre-print
In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation.  ...  However, it has been well established in the literature that off-policy TD algorithms under linear function approximation diverge.  ...  Since then, there have been a lot of improvements on the GTD algorithm under various settings like prediction, control, and non-linear function approximation [7] - [10] .  ... 
arXiv:1911.05697v1 fatcat:sqgxx34bpvd2jbb33yhicaghp4

Stratified Experience Replay: Correcting Multiplicity Bias in Off-Policy Reinforcement Learning [article]

Brett Daley, Cameron Hickert, Christopher Amato
2021 arXiv   pre-print
This suggests that outdated experiences somehow impact the performance of deep RL, which should not be the case for off-policy methods like DQN.  ...  Deep Reinforcement Learning (RL) methods rely on experience replay to approximate the minibatched supervised learning setting; however, unlike supervised learning where access to lots of training data  ...  MOTIVATION We begin by showing how bias arises under experience replay with function approximation by comparing Q-Learning [16] with its deep analog, DQN [11] .  ... 
arXiv:2102.11319v1 fatcat:uopf7ybtobdaresx3zlzikmotm

Policy Optimization with Stochastic Mirror Descent [article]

Long Yang, Yu Zhang, Gang Zheng, Qian Zheng, Pengfei Li, Jianhang Huang, Jun Wen, Gang Pan
2022 arXiv   pre-print
We prove that the proposed 𝚅𝚁𝙼𝙿𝙾 needs only 𝒪(ϵ^-3) sample trajectories to achieve an ϵ-approximate first-order stationary point, which matches the best sample complexity for policy optimization.  ...  This paper proposes 𝚅𝚁𝙼𝙿𝙾 algorithm: a sample efficient policy gradient method with stochastic mirror descent.  ...  Formally, we are satisfied with finding an -approximate first-order stationary point ( -FOSP) θ such that G ψ α,T (θ ) (θ ) 2 ≤ . (13) Particularly, for policy optimization (2), we would choose T (θ) =  ... 
arXiv:1906.10462v5 fatcat:tjrmnn7herd6dd3epq37o5x4ya

Non-Parametric Stochastic Policy Gradient with Strategic Retreat for Non-Stationary Environment [article]

Apan Dastider, Mingjie Lin
2022 arXiv   pre-print
In modern robotics, effectively computing optimal control policies under dynamically varying environments poses substantial challenges to the off-the-shelf parametric policy gradient methods, such as the  ...  Specifically, our non-parametric kernel-based methodology embeds a policy distribution as the features in a non-decreasing Euclidean space, therefore allowing its search space to be defined as a very high  ...  RELATED WORKS Non-Stationary Reinforcement Learning-Very recently, a lot of attention have been drawn to address non-stationary environments through DRL.  ... 
arXiv:2203.14905v1 fatcat:3bwzsnecmvghffqcay4k7kqdia

Least-Squares Policy Iteration

Michail G. Lagoudakis, Ronald Parr
2003 Journal of machine learning research  
We propose a new approach to reinforcement learning for control problems which combines value-function approximation with linear architectures and approximate policy iteration.  ...  Our new algorithm, least-squares policy iteration (LSPI), learns the state-action value function which allows for action selection without a model and for incremental policy improvement within a policy-iteration  ...  Acknowledgments We would like to thank Jette Randløv and Preben Alstrøm for making the bicycle simulator available.  ... 
dblp:journals/jmlr/LagoudakisP03 fatcat:5svocd5wxjbejlp7x2fl2qdrh4

On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction [article]

Jiawei Huang, Nan Jiang
2022 arXiv   pre-print
In this paper, we study the convergence properties of off-policy policy improvement algorithms with state-action density ratio correction under function approximation setting, where the objective function  ...  We prove that O-SPIM converges to a stationary point with total complexity O(ϵ^-4), which matches the convergence rate of some recent actor-critic algorithms in the on-policy setting.  ...  Among the plethora of works studying off-policy policy evaluation with linear function approximation, [Liu et al., 2020] connected the GTD family and stochastic gradient optimization, and established  ... 
arXiv:2106.00993v2 fatcat:jyp57emhqfdznk7i4qqkdiojfi

Cautious Policy Programming: Exploiting KL Regularization in Monotonic Policy Improvement for Reinforcement Learning [article]

Lingwei Zhu, Toshinori Kitamura, Takamitsu Matsubara
2022 arXiv   pre-print
CPP leverages this lower bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation.  ...  In this paper, we propose cautious policy programming (CPP), a novel value-based reinforcement learning (RL) algorithm that can ensure monotonic policy improvement during learning.  ...  We can first compute Q π K (s, a) − V π K (s), ∀s, a for the current policy, and then update the policy to obtain π K+1 (a|s).  ... 
arXiv:2107.05798v3 fatcat:g3uicog4tzhgjph2g6xwlujkby

Learning cost-efficient control policies with XCSF

Didier Marin, Jérémie Decock, Lionel Rigoux, Olivier Sigaud
2011 Proceedings of the 13th annual conference on Genetic and evolutionary computation - GECCO '11  
Furthermore, we show that an additional Cross-Entropy Policy Search method can improve the global performance of the parametric controller.  ...  In this paper we present a method based on the "learning from demonstration" paradigm to get a cost-efficient control policy in a continuous state and action space.  ...  This results in the possibility to learn stationary policies from the model.  ... 
doi:10.1145/2001576.2001743 dblp:conf/gecco/MarinDRS11 fatcat:jb67mwejk5huxmtygggbjw2hbq

Markovian inventory policy with application to the paper industry

K.Karen Yin, Hu Liu, Neil E. Johnson
2002 Computers and Chemical Engineering  
Using data collected from a large paper manufacturer, we develop inventory policies for the finished products.  ...  algorithm to obtain the optimal policy.  ...  Stochastic processes can be classified by their index, their state space, and other properties such as stationary vs. non-stationary and jump vs. smooth sample path, etc.  ... 
doi:10.1016/s0098-1354(02)00113-8 fatcat:dzztkddlbzccnin4oxqkl7ycee

Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial Imitation Learning [article]

Lionel Blondé, Pablo Strasser, Alexandros Kalousis
2022 arXiv   pre-print
We show that forcing the learned reward function to be local Lipschitz-continuous is a sine qua non condition for the method to perform well.  ...  We consider the case of off-policy generative adversarial imitation learning, and perform an in-depth review, qualitative and quantitative, of the method.  ...  APPENDIX B for a review of sequential decision making under uncertainty in non-stationary MDPs).  ... 
arXiv:2006.16785v3 fatcat:vtb6fvqrqbf35hnbyzob3utz2u
« Previous Showing results 1 — 15 out of 453 results