Filters








373 Hits in 7.8 sec

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs [article]

Lior Shani and Yonathan Efroni and Shie Mannor
<span title="2019-12-12">2019</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
Importantly, the adaptive scaling mechanism allows us to analyze TRPO in regularized MDPs for which we prove fast rates of Õ(1/N), much like results in convex optimization.  ...  Then, we consider sample-based TRPO and establish Õ(1/√(N)) convergence rate to the global optimum.  ...  Acknowledgments We would like to thank Amir Beck for illuminating discussions regarding Convex Optimization and Nadav Merlis for helpful comments.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1909.02769v2">arXiv:1909.02769v2</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/r2wzzvszevcsffpbcuxggtujgi">fatcat:r2wzzvszevcsffpbcuxggtujgi</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20200831163226/https://arxiv.org/pdf/1909.02769v2.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/d4/58/d458bda65a547317c774e86ffa2d6e8a8a374ce2.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1909.02769v2" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

Lior Shani, Yonathan Efroni, Shie Mannor
<span title="2020-04-03">2020</span> <i title="Association for the Advancement of Artificial Intelligence (AAAI)"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/wtjcymhabjantmdtuptkk62mlq" style="color: black;">PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE</a> </i> &nbsp;
Importantly, the adaptive scaling mechanism allows us to analyze TRPO in regularized MDPs for which we prove fast rates of Õ(1/N), much like results in convex optimization.  ...  Then, we consider sample-based TRPO and establish Õ(1/√N) convergence rate to the global optimum.  ...  Acknowledgments We would like to thank Amir Beck for illuminating discussions regarding Convex Optimization and Nadav Merlis for helpful comments.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1609/aaai.v34i04.6021">doi:10.1609/aaai.v34i04.6021</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/4b6bmrimubejze4a6abanoz2f4">fatcat:4b6bmrimubejze4a6abanoz2f4</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20201103215601/https://aaai.org/ojs/index.php/AAAI/article/download/6021/5877" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/3d/af/3daf38ec85a848b733320598925314868c0eaa19.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1609/aaai.v34i04.6021"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> Publisher / doi.org </button> </a>

Convex Optimization for Parameter Synthesis in MDPs [article]

Murat Cubuktepe, Nils Jansen, Sebastian Junges, Joost-Pieter Katoen, Ufuk Topcu
<span title="2021-06-30">2021</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
The techniques improve the runtime and scalability by multiple orders of magnitude compared to black-box CCP and SCP by merging ideas from convex optimization and probabilistic model checking.  ...  We develop two approaches that iteratively obtain locally optimal solutions.  ...  Therefore, we are not aware of any convergence rate results for the CCP method in the parameter synthesis problem. b) Convergence properties of SCP: The convergence rate statements of the trust region  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2107.00108v1">arXiv:2107.00108v1</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/xmbhqovctzamjgojyt62xj6chy">fatcat:xmbhqovctzamjgojyt62xj6chy</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20210708080640/https://arxiv.org/pdf/2107.00108v1.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/50/f4/50f4d37d56187b136e833cb24c6a3fe060158fa5.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2107.00108v1" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

Policy Optimization for Constrained MDPs with Provable Fast Global Convergence [article]

Tao Liu, Ruida Zhou, Dileep Kalathil, P. R. Kumar, Chao Tian
<span title="2022-02-03">2022</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
Previous results have shown that a primal-dual approach can achieve an 𝒪(1/√(T)) global convergence rate for both the optimality gap and the constraint violation.  ...  We propose a new algorithm called policy mirror descent-primal dual (PMD-PD) algorithm that can provably achieve a faster 𝒪(log(T)/T) convergence rate for both the optimality gap and the constraint violation  ...  We thank Dongsheng Ding and Tengyu Xu for generously sharing their code in [12, 29] as baselines.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2111.00552v2">arXiv:2111.00552v2</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/tzqu43v64rbd3b7iojwvwvy5te">fatcat:tzqu43v64rbd3b7iojwvwvy5te</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20220208204529/https://arxiv.org/pdf/2111.00552v2.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/2b/3c/2b3c45bc3c927431c11dbad5076535649c5f3659.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2111.00552v2" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift [article]

Alekh Agarwal, Sham M. Kakade, Jason D. Lee, Gaurav Mahajan
<span title="2020-10-14">2020</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering  ...  However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error  ...  We thank Nan Jiang, Bruno Scherrer, and Matthieu Geist for their comments with regards to the relationship between concentrability coefficients, the condition number, and the transfer error; this discussion  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1908.00261v5">arXiv:1908.00261v5</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/y4elqflzebgy7bwl7rbkt3xmb4">fatcat:y4elqflzebgy7bwl7rbkt3xmb4</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20201017002806/https://arxiv.org/pdf/1908.00261v5.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1908.00261v5" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

Truly Deterministic Policy Optimization [article]

Ehsan Saleh, Saba Ghaffari, Timothy Bretl, Matthew West
<span title="2022-05-30">2022</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
Since deterministic policy regularization is impossible using traditional non-metric measures such as the KL divergence, we derive a Wasserstein-based quadratic model for our purposes.  ...  We state conditions on the system model under which it is possible to establish a monotonic policy improvement guarantee, propose a surrogate function for policy gradient estimation, and show that it is  ...  This lead to the design of the Trust Region Policy Optimization (TRPO) [47] algorithm.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2205.15379v1">arXiv:2205.15379v1</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/7rs7epav7zborkmvfuusuqkqlm">fatcat:7rs7epav7zborkmvfuusuqkqlm</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20220611075539/https://arxiv.org/pdf/2205.15379v1.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/ff/b5/ffb510b03a2dc08edbe07f15edfee9529ca30f15.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2205.15379v1" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

RLOC: Terrain-Aware Legged Locomotion using Reinforcement Learning and Optimal Control [article]

Siddhant Gangapurwala, Mathieu Geisert, Romeo Orsolino, Maurice Fallon, Ioannis Havoutis
<span title="2020-12-05">2020</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
These policies account for changes in physical parameters and external perturbations.  ...  Additionally, we introduce two ancillary RL policies for corrective whole-body motion tracking and recovery control.  ...  efficiency in comparison to the widely used on-policy alternatives, trust region policy optimization (TRPO) [42] and proximal policy optimization (PPO) [43] , and has further been demonstrated to perform  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2012.03094v1">arXiv:2012.03094v1</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/ynp2g3ng6rbe3jsmqopz6rrwu4">fatcat:ynp2g3ng6rbe3jsmqopz6rrwu4</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20220514222905/https://arxiv.org/pdf/2012.03094v3.pdf" title="fulltext PDF download [not primary version]" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <span style="color: #f43e3e;">&#10033;</span> <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/cc/9b/cc9b539f89d3745e374d4459ee73034f570b1cd9.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2012.03094v1" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

Optimistic Policy Optimization with Bandit Feedback [article]

Yonathan Efroni, Lior Shani, Aviv Rosenberg, Shie Mannor
<span title="2020-06-18">2020</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
For this setting, we propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish Õ(√(S^2 A H^4 K)) regret for stochastic rewards.  ...  To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.  ...  Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. arXiv preprint arXiv:1909.02769, 2019. Sutton, R. S., McAllester, D. A., Singh, S.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2002.08243v2">arXiv:2002.08243v2</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/mzkftuzxk5e2ll4cgqon3he7iu">fatcat:mzkftuzxk5e2ll4cgqon3he7iu</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20200625195536/https://arxiv.org/pdf/2002.08243v2.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/b4/d0/b4d0444974fab1d0dee8964ed48e22ad6ec4c287.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2002.08243v2" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee [article]

Tengyu Xu, Yingbin Liang, Guanghui Lan
<span title="2021-05-31">2021</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an 𝒪(1/√(T)) convergence rate to the global optimal policy  ...  In general, such SRL problems have nonconvex objective functions subject to multiple nonconvex constraints, and hence are very challenging to solve, particularly to provide a globally optimal policy.  ...  Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. arXiv preprint arXiv:1909.02769, 2019.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2011.05869v3">arXiv:2011.05869v3</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/bsuvnjh4y5cw3f46g5a3vsc6la">fatcat:bsuvnjh4y5cw3f46g5a3vsc6la</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20210607222506/https://arxiv.org/pdf/2011.05869v3.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/70/0e/700e73bf63837c555c40a6d67d919cb8154c52a0.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2011.05869v3" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

A Survey of Optimization Methods from a Machine Learning Perspective [article]

Shiliang Sun, Zehui Cao, Han Zhu, Jing Zhao
<span title="2019-10-23">2019</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
Finally, we explore and give some challenges and open problems for the optimization in machine learning.  ...  The systematic retrospect and summary of the optimization methods from the perspective of machine learning are of great significance, which can offer guidance for both developments of optimization and  ...  △ t > 0 is the radius of the trust region.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1906.06821v2">arXiv:1906.06821v2</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/rcaas4ccpbdffhuvzcg2oryxr4">fatcat:rcaas4ccpbdffhuvzcg2oryxr4</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20200320190815/https://arxiv.org/pdf/1906.06821v2.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/ca/ec/caecce406e257601a9e44ae4e055cc0778974398.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1906.06821v2" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

Learning to Score Behaviors for Guided Policy Optimization [article]

Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, Anna Choromanska, Krzysztof Choromanski, Michael I. Jordan
<span title="2020-03-04">2020</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
We incorporate these regularizers into two novel on-policy algorithms, Behavior-Guided Policy Gradient and Behavior-Guided Evolution Strategies, which we demonstrate can outperform existing methods in  ...  We introduce a new approach for comparing reinforcement learning policies, using Wasserstein distances (WDs) in a newly defined latent behavioral space.  ...  Trust Region Policy Optimization: Though the original TRPO [31] construct the trust region based on KL-divergence, we propose to construct the trust region with WD.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1906.04349v4">arXiv:1906.04349v4</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/xzibkri3efagxov56tpnpcimxq">fatcat:xzibkri3efagxov56tpnpcimxq</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20200321113602/https://arxiv.org/pdf/1906.04349v4.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1906.04349v4" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

Cautiously Optimistic Policy Optimization and Exploration with Linear Function Approximation [article]

Andrea Zanette, Ching-An Cheng, Alekh Agarwal
<span title="2021-06-29">2021</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
However, the same properties also make them slow to converge and sample inefficient, as the on-policy requirement precludes data reuse and the incremental updates couple large iteration complexity into  ...  Policy optimization methods are popular reinforcement learning algorithms, because their incremental and on-policy nature makes them more stable than the value-based counterparts.  ...  Acknowledgments The authors are grateful to the reviewers for their helpful comments.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2103.12923v2">arXiv:2103.12923v2</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/azktjwvo5fg6hhe3quuuqb4jyi">fatcat:azktjwvo5fg6hhe3quuuqb4jyi</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20210702012304/https://arxiv.org/pdf/2103.12923v2.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/86/54/8654511be9205e0cb3786a7f962892360f2c6c34.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2103.12923v2" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

Provably Efficient Exploration in Policy Optimization [article]

Qi Cai, Zhuoran Yang, Chi Jin, Zhaoran Wang
<span title="2020-07-06">2020</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.  ...  In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration.  ...  Yang, Yining Wang, and Simon S. Du for helpful discussions.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1912.05830v3">arXiv:1912.05830v3</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/foczqpauincjdn5zmnobz46zta">fatcat:foczqpauincjdn5zmnobz46zta</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20200723153538/https://arxiv.org/pdf/1912.05830v3.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/1b/9a/1b9adbd6b68fbaeca9b68e076daa8852b13408f8.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1912.05830v3" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences [article]

Alan Chan, Hugo Silva, Sungsu Lim, Tadashi Kozuno, A. Rupam Mahmood, Martha White
<span title="2022-04-18">2022</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
our policy optimization algorithms.  ...  Throughout, we highlight that many policy gradient methods can be seen as an instance of API, with either the forward or reverse KL for the policy update, and discuss next steps for understanding and improving  ...  Special thanks as well to Nicolas Le Roux for comments on an earlier version of this work.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2107.08285v2">arXiv:2107.08285v2</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/ci736wwrsnhbxa3xxppnyqb4va">fatcat:ci736wwrsnhbxa3xxppnyqb4va</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20220622034655/https://arxiv.org/pdf/2107.08285v2.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/cf/98/cf98a855d7d88f24f035d01a8d588e0a9c6e0bb2.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2107.08285v2" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

Policy Optimization as Wasserstein Gradient Flows [article]

Ruiyi Zhang, Changyou Chen, Chunyuan Li, Lawrence Carin
<span title="2018-08-09">2018</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
Policy optimization is a core component of reinforcement learning (RL), and most existing RL methods directly optimize parameters of a policy based on maximizing the expected total reward, or its surrogate  ...  We place policy optimization into the space of probability measures, and interpret it as Wasserstein gradient flows.  ...  Acknowledgements We acknowledge Tuomas Haarnoja et al. for making their code public and thank Ronald Parr for insightful advice. This research was supported in part by DARPA, DOE, NIH, ONR and NSF.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1808.03030v1">arXiv:1808.03030v1</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/i3swiw5wrvdnnk7nry6ijir4rm">fatcat:i3swiw5wrvdnnk7nry6ijir4rm</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20191026124714/https://arxiv.org/pdf/1808.03030v1.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/85/da/85daa502481b61bc34f7082e3ff3d978abdc73b0.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/1808.03030v1" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>
&laquo; Previous Showing results 1 &mdash; 15 out of 373 results