A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Geometric Insights into the Convergence of Nonlinear TD Learning
[article]
2020
arXiv
pre-print
More precisely, we consider the expected learning dynamics of the TD(0) algorithm for value estimation. ...
While there are convergence guarantees for temporal difference (TD) learning when using linear function approximators, the situation for nonlinear models is far less understood, and divergent examples ...
We also thank the lab mates, especially Will Whitney, Aaron Zweig, and Min Jae Song, who provided useful discussions and feedback. This work was partially supported by the Alfred P. ...
arXiv:1905.12185v4
fatcat:qd5dhbji6ja7fpmbqks435s5pe
A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications
2011
Journal of Control Theory and Applications
We then describe some recent research by the authors on approximate policy iteration algorithms that offer convergence guarantees (with technical assumptions) for both parametric and nonparametric architectures ...
We review the literature on approximate dynamic programming, with the goal of better understanding the theory behind practical algorithms for solving dynamic programs with continuous and vector-valued ...
Residual gradient algorithm To overcome the instability of Q-learning or value iteration when implemented directly with a general function approximation, residual gradient algorithms, which perform gradient ...
doi:10.1007/s11768-011-0313-y
fatcat:ea6l7fzscjdbflgrft3b33b7ve
Faster Gradient-TD Algorithms
2013
Gradient-TD methods are a new family of learning algorithms that are stable and convergent under a wider range of conditions than previous reinforcement learning algorithms. ...
In this thesis, we examine this slowness through on-and off-policy experiments and introduce several variations of existing gradient-TD algorithms in search of faster gradient-TD methods. ...
These algorithms solve the problem of gradient-TD methods being slower than conventional-TD methods on on-policy problems and show promise in providing faster convergence on off-policy problems. ...
doi:10.7939/r3js95
fatcat:l7yxrw764zarrovhg6tvbqjiyq
Beyond Target Networks: Improving Deep Q-learning with Functional Regularization
[article]
2022
arXiv
pre-print
This leads to a faster yet more stable training method. ...
We analyze the convergence of our method theoretically and empirically validate our predictions on simple environments as well as on a suite of Atari environments. ...
While we have a theoretical understanding (Schoknecht & Merke, 2003) of why and how TD(0) may converge faster than its symmetric alternative, residual gradient (Baird, 1995) , this is not the case for ...
arXiv:2106.02613v3
fatcat:r55ll26mr5b6xohq4plplx7nci
TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?
[article]
2020
arXiv
pre-print
always be better than SGD. ...
Our theoretical findings demonstrate that including this additional preconditioning information is, surprisingly, comparable to normal semi-gradient TD if the optimal learning rate is found for both via ...
Td (0) converges provably
faster than the residual gradient algorithm. In Interna-
tional conference on machine learning, pp. 680-687,
2003.
Sutton, R. S. ...
arXiv:2007.02786v1
fatcat:iud2qvbqyzegpa2mkqem4rehji
Two Timescale Convergent Q-learning for Sleep--Scheduling in Wireless Sensor Networks
[article]
2014
arXiv
pre-print
Our proposed algorithm incorporates a policy gradient update using a one-simulation simultaneous perturbation stochastic approximation (SPSA) estimate on the faster timescale, while the Q-value parameter ...
This algorithm, unlike the two-timescale variant, does not possess theoretical convergence guarantees. ...
Our algorithms are simple, efficient and in the case of the two-timescale on-policy Q-learning based schemes, also provably convergent. ...
arXiv:1312.7292v2
fatcat:ktdruc6fpzerjfalepev576zxm
Two timescale convergent Q-learning for sleep-scheduling in wireless sensor networks
2014
Wireless networks
Our proposed algorithm incorporates a policy gradient update using a one-simulation simultaneous perturbation stochastic approximation estimate on the faster timescale, while the Q-value parameter (arising ...
This algorithm, unlike the two-timescale variant, does not possess theoretical convergence guarantees. ...
Our algorithms are simple, efficient and in the case of the two-timescale onpolicy Q-learning based schemes, also provably convergent. ...
doi:10.1007/s11276-014-0762-6
fatcat:5gcavzxh4bempep7x4uty57p5e
Full Gradient DQN Reinforcement Learning: A Provably Convergent Scheme
[article]
2021
arXiv
pre-print
We analyze the DQN reinforcement learning algorithm as a stochastic approximation scheme using the o.d.e. (for 'ordinary differential equation') approach and point out certain theoretical issues. ...
We then propose a modified scheme called Full Gradient DQN (FG-DQN, for short) that has a sound theoretical basis and compare it with the original scheme on sample problems. ...
Acknowledgement The authors are greatly obliged to Prof ...
arXiv:2103.05981v3
fatcat:ejfvc6ps7zdptbwd3awkku4mqq
Asynchronous Approximation of a Single Component of the Solution to a Linear System
[article]
2019
arXiv
pre-print
Our algorithm relies on the Neumann series characterization of the component x_i, and is based on residual updates. ...
This is equivalent to solving for x_i in x = Gx + z for some G and z such that the spectral radius of G is less than 1. ...
distributions exhibit faster convergence rates. ...
arXiv:1411.2647v4
fatcat:lobd2mcnmrfulbdkwdgava4am4
PID Accelerated Value Iteration Algorithm
2021
International Conference on Machine Learning
We present the error dynamics of these variants of VI, and provably (for certain classes of MDPs) and empirically (for more general classes) show that the convergence rate can be significantly improved ...
The key insight is the realization that the evolution of the value function approximations (V k ) k≥0 in the VI procedure can be seen as a dynamical system. ...
Acknowledgements We would like to thank the anonymous reviewers for their feedback. AMF acknowledges the funding from the Canada CIFAR AI Chairs program. ...
dblp:conf/icml/FarahmandG21
fatcat:3omjpn7pc5b5pcteswcm46jaqm
Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms
[article]
2021
arXiv
pre-print
the mean-field regime, (non-)convergence of policy-based methods for learning in games, etc. ...
Most of the successful RL applications, e.g., the games of Go and Poker, robotics, and autonomous driving, involve the participation of more than one single agent, which naturally fall into the realm of ...
Some other common policy evaluation algorithms with convergence guarantees include gradient TD methods with linear [61, 62, 63] , and nonlinear function approximations [64] . ...
arXiv:1911.10635v2
fatcat:ihlhtjlhnrdizbkcfzsnz5urfq
Low-rank Tensor Estimation via Riemannian Gauss-Newton: Statistical Optimality and Second-Order Convergence
[article]
2021
arXiv
pre-print
Different from the generic (super)linear convergence guarantee of RGN in the literature, we prove the first quadratic convergence guarantee of RGN for low-rank tensor estimation under some mild conditions ...
A deterministic estimation error lower bound, which matches the upper bound, is provided that demonstrates the statistical optimality of RGN. ...
The simulation studies show RGN offers much faster convergence compared to the existing approaches in the literature. ...
arXiv:2104.12031v2
fatcat:nnqncngurfg2pc23qlwbvjmafq
Temporal Difference Learning with Neural Networks - Study of the Leakage Propagation Problem
[article]
2018
arXiv
pre-print
Temporal-Difference learning (TD) [Sutton, 1988] with function approximation can converge to solutions that are worse than those obtained by Monte-Carlo regression, even in the simple case of on-policy ...
For reversible policies, the result can be interpreted as the tension between two terms of the loss function that TD minimises, as recently described by [Ollivier, 2018]. ...
Sutton et al. [2009b,a] introduce two modified algorithms for TD with linear function approximation that provably converge in the off-policy setting. ...
arXiv:1807.03064v1
fatcat:to63dgnhnbf47ml5zdmyxqnpxi
META-Learning Eligibility Traces for More Sample Efficient Temporal Difference Learning
[article]
2020
arXiv
pre-print
Temporal-Difference (TD) learning is a standard and very successful reinforcement learning approach, at the core of both algorithms that learn the value of a given policy, as well as algorithms which learn ...
To improve the sample efficiency of TD-learning, we propose a meta-learning method for adjusting the eligibility trace parameter, in a state-dependent manner. ...
Note that the convergence of linear semi-gradient TD(0) algorithm presented in Algorithm 7 does not follow from general results on SGD but a separate theorem. ...
arXiv:2006.08906v1
fatcat:z4vsafqmm5dqlluues7bv26buy
RISE: An Incremental Trust-Region Method for Robust Online Sparse Least-Squares Estimation
2014
IEEE Transactions on robotics
As a trust-region method, RISE is naturally robust to objective function nonlinearity and numerical ill-conditioning, and is provably globally convergent for a broad class of inferential cost functions ...
Consequently, RISE maintains the speed of current state-of-the-art online sparse least-squares methods while providing superior reliability. ...
ACKNOWLEDGMENTS The authors would like to thank F. Dellaert and R. Roberts for the RISE2 implementation in the GTSAM library. ...
doi:10.1109/tro.2014.2321852
fatcat:7p2fgpqchbb3fea4l5yct4wyai
« Previous
Showing results 1 — 15 out of 122 results