62,533 Hits in 2.4 sec

Dual Policy Iteration [article]

Wen Sun, Geoffrey J. Gordon, Byron Boots, J. Andrew Bagnell
2019 arXiv   pre-print
In this work we study this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory.  ...  Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (e.g., ExIt from [2], AlphaGo-Zero from [27]).  ...  Recently, a new class of API algorithms, which we call Dual Policy Iteration (DPI), has begun to emerge.  ... 
arXiv:1805.10755v2 fatcat:4hm5l5aq4rgk3mzic4oze54ama

An Analysis of Primal-Dual Algorithms for Discounted Markov Decision Processes [article]

Randy Cogill
2016 arXiv   pre-print
Finally, we show that the iterations of the primal-dual algorithm can be interpreted as repeated application of the policy iteration algorithm to a special class of Markov decision processes.  ...  When considered alongside recent results characterizing the computational complexity of the policy iteration algorithm, this observation could provide new insights into the computational complexity of  ...  These recent results for policy iteration provide a promising direction for analyzing the number of iterations required by the primal-dual algorithm.  ... 
arXiv:1601.04175v1 fatcat:hxyxogbaobe3xl2uy3sb7q6tlq

Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning [article]

Qingkai Liang, Fanyu Que, Eytan Modiano
2018 arXiv   pre-print
In this paper, we propose a policy search method for CMDPs called Accelerated Primal-Dual Optimization (APDO), which incorporates an off-policy trained dual variable in the dual update procedure while  ...  Existing methods for CMDPs only use on-policy data for dual updates, which results in sample inefficiency and slow convergence.  ...  The off-policy training is executed for 5 × 10 5 primal-dual iterations.  ... 
arXiv:1802.06480v1 fatcat:sc32g4kvxjew5elivuoaofwwuu

The Simplex and Policy-Iteration Methods Are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

Yinyu Ye
2011 Mathematics of Operations Research  
We prove that the classic policy-iteration method (Howard 1960) , including the Simplex method (Dantzig 1947) with the most-negative-reduced-cost pivoting rule, is a strongly polynomial-time algorithm  ...  The result is surprising since the Simplex method with the same pivoting rule was shown to be exponential for solving a general linear programming (LP) problem, the Simplex (or simple policy-iteration)  ...  by the policy-iteration method after T iterations.  ... 
doi:10.1287/moor.1110.0516 fatcat:elehu5k54jewvcy3fozo6xzrgu

Stochastic iterative dynamic programming: a Monte Carlo approach to dual control

Adrian M. Thompson, William R. Cluett
2005 Automatica  
Practical exploitation of optimal dual control (ODC) theory continues to be hindered by the difficulties involved in numerically solving the associated stochastic dynamic programming (SDPs) problems.  ...  Also, being a generalization of iterative dynamic programming (IDP) to the stochastic domain, the new algorithm exhibits reduced sensitivity to the hyper-state dimension and, consequently, is particularly  ...  Algorithms such as value iteration, 2 policy iteration, Q-learning and neuro-dynamic programming are well-known dynamic programming approaches that employ Monte Carlo sampling in stochastic settings (  ... 
doi:10.1016/j.automatica.2004.12.003 fatcat:licxql2z75cbreyqbmcjnrmyku

On Connections between Constrained Optimization and Reinforcement Learning [article]

Nino Vieillard, Olivier Pietquin, Matthieu Geist
2019 arXiv   pre-print
We link Conservative Policy Iteration to Frank-Wolfe, Mirror-Descent Modified Policy Iteration to Mirror Descent, and Politex (Policy Iteration Using Expert Prediction) to Dual Averaging.  ...  We have made this connection clear on three cases: Frank Wolfe and Conservative policy iteration, Mirror Descent and MD Modified Policy Iteration, and Dual Averaging and Politex.  ...  Politex and Dual Averaging Finally, we show a new connection between the recent Politex (Policy Iteration Using Expert Prediction) algorithm [Lazic et al., 2019] , and the Dual Averaging (DA, Nesterov  ... 
arXiv:1910.08476v2 fatcat:n3ajeezfcvgghnztbonzstukzy

Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Running Time [article]

Mengdi Wang
2017 arXiv   pre-print
By leveraging the value-policy duality and binary-tree data structures, the algorithm adaptively samples state-action-state transitions and makes exponentiated primal-dual updates.  ...  We show that it finds an ϵ-optimal policy using nearly-linear run time in the worst case.  ...  Let us also mention that there are two prior attempts (by the author of this paper) to use primal-dual iteration for online policy estimation of MDP.  ... 
arXiv:1704.01869v3 fatcat:674xgzshbrhajmzdhlqobcxdde

Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning [article]

Yichen Chen, Mengdi Wang
2016 arXiv   pre-print
The SPD methods update a few coordinates of the value and policy estimates as a new state transition is observed. These methods use small storage and has low computational complexity per iteration.  ...  The SPD methods find an absolute-ϵ-optimal policy, with high probability, using O(|S|^4 |A|^2σ^2 /(1-γ)^6ϵ^2) iterations/samples for the infinite-horizon discounted-reward MDP and O(|S|^4 |A|^2H^6σ^2 /  ...  Theorem 4 shows that the averaged dual iterateλ gives a randomized policy that approximates the optimal policy π * .  ... 
arXiv:1612.02516v1 fatcat:ykd32bzwjrgh3k5kkh4az7vdgi

Primal-Dual π Learning: Sample Complexity and Sublinear Run Time for Ergodic Markov Decision Problems [article]

Mengdi Wang
2017 arXiv   pre-print
The π learning method is model-free and makes primal-dual updates to the policy and value vectors as new data are revealed.  ...  between the value and policy.  ...  Primal-Dual Convergence Each iteration of Algorithm 1 performs a primal-dual update for the minimax problem (3). Our first result concerns the convergence of the primal-dual iteration.  ... 
arXiv:1710.06100v1 fatcat:3xvcmfhqffgo5k42yiukewfi4y

Dual Representations for Dynamic Programming and Reinforcement Learning

Tao Wang, Michael Bowling, Dale Schuurmans
2007 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning  
With this reformulation, we then derive novel dual forms of dynamic programming, including policy evaluation, policy iteration and value iteration.  ...  Finally, we scale these techniques up to large domains by introducing approximation, and develop new approximate off-policy learning algorithms that avoid the divergence problems associated with the primal  ...  Nevertheless, as we will show, there exists a dual form for every standard DP and RL algorithm, including policy evaluation, policy iteration, Bellman iteration, temporal difference (TD) estimation, Sarsa  ... 
doi:10.1109/adprl.2007.368168 fatcat:bqzocyh4fbdgpp7vjs5y4oxcay

Implicit dual control based on particle filtering and forward dynamic programming

David S. Bayard, Alan Schumitzky
2008 International Journal of Adaptive Control and Signal Processing  
The proposed implicit dual control approach is novel in that it combines a particle filter with a policy-iteration method for forward dynamic programming.  ...  Implicit dual control methods synthesize stochastic control policies by systematically approximating the stochastic dynamic programming equations of Bellman, in contrast to explicit dual control methods  ...  DEFINITION 4.1 A policy is said to be a policy iteration with respect to policy iteration with respect to policy if at every k and I k they are related as, (4.1) The policy iteration formula (4.1) is  ... 
doi:10.1002/acs.1094 pmid:21132112 pmcid:PMC2994585 fatcat:of2t2irwfnaftlcmesvrubtigm

Approximate dynamic programming using support vector regression

Brett Bethke, Jonathan P. How, Asuman Ozdaglar
2008 2008 47th IEEE Conference on Decision and Control  
This paper presents a new approximate policy iteration algorithm based on support vector regression (SVR).  ...  A key contribution of this paper is to present an extension of the SVR problem to carry out approximate policy iteration by minimizing the Bellman error at selected states.  ...  Fig. 2 . 2 Comparison of the optimal policy (blue) with the policy found by the support vector policy iteration algorithm (green) found in 5 iterations.  ... 
doi:10.1109/cdc.2008.4739322 dblp:conf/cdc/BethkeHO08 fatcat:rq2h43bfffc67aqzccllac2524


Jong Min Lee, Jay H. Lee
2005 IFAC Proceedings Volumes  
Bellman equation is iterated.  ...  An optimal control policy of a dual adaptive control problem can be derived by solving a stochastic dynamic programming problem, which is computationally intractable using conventional solution methods  ...  Even though the optimal controller will have the desired dual feature, the DP formulation is intractable in all but simplest cases if the conventional solution approach (e.g., value iteration, policy iteration  ... 
doi:10.3182/20050703-6-cz-1902.00938 fatcat:ecynvoodaja33khb2ajipkaiya

A unified view of entropy-regularized Markov decision processes [article]

Gergely Neu and Anders Jonsson and Vicenç Gómez
2017 arXiv   pre-print
In particular, we show that the exact version of the TRPO algorithm of Schulman et al. (2015) actually converges to the optimal policy, while the entropy-regularized policy gradient methods of Mnih et  ...  Our approach is based on extending the linear-programming formulation of policy optimization in MDPs to accommodate convex regularization functions.  ...  of Mirror Descent and Dual Averaging, respectively, and that both can be interpreted as regularized policy iteration methods.  ... 
arXiv:1705.07798v1 fatcat:zcpgtleogfbw3lbknxdj5tjgim

Design Of Real-Time Implementable Distributed Suboptimal Control: An LQR Perspective

Hassan Jaleel, Jeff S. Shamma
2017 IEEE Transactions on Control of Network Systems  
We show through simulations that the performance under the proposed framework is close to the optimal performance and the suboptimal policy can be efficiently implemented online.  ...  We assume that iter is selected such thatμ iter is a stable policy. For policyμ iter , the associated cost to go function Jμ iter is Jμ iter = lim k→∞ T k µiter J.  ...  The dual decomposition algorithm was implemented with number of iterations iter = 2.  ... 
doi:10.1109/tcns.2017.2754362 fatcat:g5iwjjvfz5fmhdz32ogrqq7zgq
« Previous Showing results 1 — 15 out of 62,533 results