A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Filters
Policy Optimization with Stochastic Mirror Descent
2022
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
We prove that the proposed VRMPO needs only O(ε−3) sample trajectories to achieve an ε-approximate first-order stationary point, which matches the best sample complexity for policy optimization. ...
This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. ...
Curves are smoothed uniformly for visual clarity.
, REINFORCE Policy π θ (a|s) with parameter θ0 , mirror map ψ, step-size α > 0, epoch size K,m. ...
doi:10.1609/aaai.v36i8.20863
fatcat:xtwgwd6bfbcxhl4y5vdlaz6wbu
Policy Gradients for Contextual Recommendations
2019
The World Wide Web Conference on - WWW '19
In this work, we put forward Policy Gradients for Contextual Recommendations (PGCR) to solve the problem without those unrealistic assumptions. ...
The former ensures PGCR to be empirically greedy in the limit, and the latter addresses the trade-off between exploration and exploitation by using the policy network with Dropout as a Bayesian approximation ...
Assuming the policy π leads to stationary distributions for states and contexts, the unbiased policy gradient is ∇ θ J(π θ ) = m Ec ∼ξ ∇ θ p θ (c) Q(c) , (8) where Q π (c) := Q π (s, c) is the state-action ...
doi:10.1145/3308558.3313616
dblp:conf/www/PanCTZH19
fatcat:ijck6gknk5dkfd5n2npsybbxmq
Heuristic policies for the stochastic economic lot sizing problem with remanufacturing under service level constraints
2018
European Journal of Operational Research
Heuristic policies for the stochastic economic lot sizing problem with remanufacturing under service level constraints Kilic, Onur A. ...
Acknowledgments We thank the editor and three anonymous reviewers for their constructive feedback on earlier drafts of the manuscript. Onur A. ...
Kilic and Huseyin Tunc are supported by the Scientific and Technological Research Council of Turkey (TUBITAK) under Grant No. MAG-114M389. ...
doi:10.1016/j.ejor.2017.12.041
fatcat:w7qla7dqozd3fjqs3bpohdr3ya
Policy Gradients for CVaR-Constrained MDPs
[article]
2014
arXiv
pre-print
We propose two algorithms that obtain a locally risk-optimal policy by employing four tools: stochastic approximation, mini batches, policy gradients and importance sampling. ...
The algorithms differ in the manner in which they approximate the CVaR estimates/necessary gradients - the first algorithm uses stochastic approximation, while the second employ mini-batches in the spirit ...
Algorithm 2 PG-CVaR-mB Input: parameterized policy π θ (·|·), step-sizes {γ n , β n }, non-negative weights {a n }, mini-batch sizes {m n }. ...
arXiv:1405.2690v1
fatcat:yauljmdttrevtemqtu3ziwgq2e
Policy Gradients for CVaR-Constrained MDPs
[chapter]
2014
Lecture Notes in Computer Science
We propose two algorithms that obtain a locally risk-optimal policy by employing four tools: stochastic approximation, mini batches, policy gradients and importance sampling. ...
The algorithms differ in the manner in which they approximate the CVaR estimates/necessary gradients -the first algorithm uses stochastic approximation, while the second employs mini-batches in the spirit ...
) If ∇Q can be written as an expectation, i.e., ∇Q(η) = E[q(η, X)], then one can hope to estimate this expectation (and hence minimize Q) using a stochastic approximation recursion. ...
doi:10.1007/978-3-319-11662-4_12
fatcat:szykydrgrzgs5lpe4cvckyiw6i
A Convergent Off-Policy Temporal Difference Algorithm
[article]
2019
arXiv
pre-print
In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation. ...
However, it has been well established in the literature that off-policy TD algorithms under linear function approximation diverge. ...
Since then, there have been a lot of improvements on the GTD algorithm under various settings like prediction, control, and non-linear function approximation [7] - [10] . ...
arXiv:1911.05697v1
fatcat:sqgxx34bpvd2jbb33yhicaghp4
Stratified Experience Replay: Correcting Multiplicity Bias in Off-Policy Reinforcement Learning
[article]
2021
arXiv
pre-print
This suggests that outdated experiences somehow impact the performance of deep RL, which should not be the case for off-policy methods like DQN. ...
Deep Reinforcement Learning (RL) methods rely on experience replay to approximate the minibatched supervised learning setting; however, unlike supervised learning where access to lots of training data ...
MOTIVATION We begin by showing how bias arises under experience replay with function approximation by comparing Q-Learning [16] with its deep analog, DQN [11] . ...
arXiv:2102.11319v1
fatcat:uopf7ybtobdaresx3zlzikmotm
Policy Optimization with Stochastic Mirror Descent
[article]
2022
arXiv
pre-print
We prove that the proposed 𝚅𝚁𝙼𝙿𝙾 needs only 𝒪(ϵ^-3) sample trajectories to achieve an ϵ-approximate first-order stationary point, which matches the best sample complexity for policy optimization. ...
This paper proposes 𝚅𝚁𝙼𝙿𝙾 algorithm: a sample efficient policy gradient method with stochastic mirror descent. ...
Formally, we are satisfied with finding an -approximate first-order stationary point ( -FOSP) θ such that G ψ α,T (θ ) (θ ) 2 ≤ . (13) Particularly, for policy optimization (2), we would choose T (θ) = ...
arXiv:1906.10462v5
fatcat:tjrmnn7herd6dd3epq37o5x4ya
Non-Parametric Stochastic Policy Gradient with Strategic Retreat for Non-Stationary Environment
[article]
2022
arXiv
pre-print
In modern robotics, effectively computing optimal control policies under dynamically varying environments poses substantial challenges to the off-the-shelf parametric policy gradient methods, such as the ...
Specifically, our non-parametric kernel-based methodology embeds a policy distribution as the features in a non-decreasing Euclidean space, therefore allowing its search space to be defined as a very high ...
RELATED WORKS Non-Stationary Reinforcement Learning-Very recently, a lot of attention have been drawn to address non-stationary environments through DRL. ...
arXiv:2203.14905v1
fatcat:3bwzsnecmvghffqcay4k7kqdia
Least-Squares Policy Iteration
2003
Journal of machine learning research
We propose a new approach to reinforcement learning for control problems which combines value-function approximation with linear architectures and approximate policy iteration. ...
Our new algorithm, least-squares policy iteration (LSPI), learns the state-action value function which allows for action selection without a model and for incremental policy improvement within a policy-iteration ...
Acknowledgments We would like to thank Jette Randløv and Preben Alstrøm for making the bicycle simulator available. ...
dblp:journals/jmlr/LagoudakisP03
fatcat:5svocd5wxjbejlp7x2fl2qdrh4
On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction
[article]
2022
arXiv
pre-print
In this paper, we study the convergence properties of off-policy policy improvement algorithms with state-action density ratio correction under function approximation setting, where the objective function ...
We prove that O-SPIM converges to a stationary point with total complexity O(ϵ^-4), which matches the convergence rate of some recent actor-critic algorithms in the on-policy setting. ...
Among the plethora of works studying off-policy policy evaluation with linear function approximation, [Liu et al., 2020] connected the GTD family and stochastic gradient optimization, and established ...
arXiv:2106.00993v2
fatcat:jyp57emhqfdznk7i4qqkdiojfi
Cautious Policy Programming: Exploiting KL Regularization in Monotonic Policy Improvement for Reinforcement Learning
[article]
2022
arXiv
pre-print
CPP leverages this lower bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation. ...
In this paper, we propose cautious policy programming (CPP), a novel value-based reinforcement learning (RL) algorithm that can ensure monotonic policy improvement during learning. ...
We can first compute Q π K (s, a) − V π K (s), ∀s, a for the current policy, and then update the policy to obtain π K+1 (a|s). ...
arXiv:2107.05798v3
fatcat:g3uicog4tzhgjph2g6xwlujkby
Learning cost-efficient control policies with XCSF
2011
Proceedings of the 13th annual conference on Genetic and evolutionary computation - GECCO '11
Furthermore, we show that an additional Cross-Entropy Policy Search method can improve the global performance of the parametric controller. ...
In this paper we present a method based on the "learning from demonstration" paradigm to get a cost-efficient control policy in a continuous state and action space. ...
This results in the possibility to learn stationary policies from the model. ...
doi:10.1145/2001576.2001743
dblp:conf/gecco/MarinDRS11
fatcat:jb67mwejk5huxmtygggbjw2hbq
Markovian inventory policy with application to the paper industry
2002
Computers and Chemical Engineering
Using data collected from a large paper manufacturer, we develop inventory policies for the finished products. ...
algorithm to obtain the optimal policy. ...
Stochastic processes can be classified by their index, their state space, and other properties such as stationary vs. non-stationary and jump vs. smooth sample path, etc. ...
doi:10.1016/s0098-1354(02)00113-8
fatcat:dzztkddlbzccnin4oxqkl7ycee
Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial Imitation Learning
[article]
2022
arXiv
pre-print
We show that forcing the learned reward function to be local Lipschitz-continuous is a sine qua non condition for the method to perform well. ...
We consider the case of off-policy generative adversarial imitation learning, and perform an in-depth review, qualitative and quantitative, of the method. ...
APPENDIX B for a review of sequential decision making under uncertainty in non-stationary MDPs). ...
arXiv:2006.16785v3
fatcat:vtb6fvqrqbf35hnbyzob3utz2u
« Previous
Showing results 1 — 15 out of 453 results