Filters








17,742 Hits in 3.8 sec

Bounds on sample size for policy evaluation in Markov environments [article]

Leonid Peshkin, Sayan Mukherjee
2001 arXiv   pre-print
Typically, the value of a policy is estimated from results of simulating that very policy in the environment.  ...  Stochastic optimization algorithms used in the field rely on estimates of the value of a policy.  ...  This reduction in a sample size could be explained by the fact that the former algorithm uses all trajectories for evaluation of any policy, while the latter uses just a subset of trajectories.  ... 
arXiv:cs/0105027v1 fatcat:xwrrsybtavdlje6juni63oj3ry

Bounds on Sample Size for Policy Evaluation in Markov Environments [chapter]

Leonid Peshkin, Sayan Mukherjee
2001 Lecture Notes in Computer Science  
This implies both that the same set of samples can be evaluated on any hypothesis, and that the observed error is a good estimate of the true error.  ...  However in many cases the environment state is described by a vector of several variables, which makes the environment state size exponential in the number of variables.  ...  The reduction in a sample size could be explained by the fact that the former algorithm uses all trajectories for evaluation of any policy, while the latter uses just a subset of trajectories.  ... 
doi:10.1007/3-540-44581-1_41 fatcat:4zs2nlmw25bipdkpadv277vo6e

Verified Probabilistic Policies for Deep Reinforcement Learning [article]

Edoardo Bacci, David Parker
2022 arXiv   pre-print
In this paper, we tackle the problem of verifying probabilistic policies for deep reinforcement learning, which are used to, for example, tackle adversarial environments, break symmetries and manage trade-offs  ...  Deep reinforcement learning is an increasingly popular technique for synthesising policies to control an agent's interaction with its environment.  ...  As can be seen, the various configurations result in different safety probability bounds and runtimes for the same environments, so we are primarily interested in the impact that these choices have on  ... 
arXiv:2201.03698v1 fatcat:6q6tle2d45aphn6gicqci7h5f4

Learning from Scarce Experience [article]

Leonid Peshkin, Christian R. Shelton
2002 arXiv   pre-print
The latter performs optimization on this estimate. We show positive empirical results and provide the sample complexity bound.  ...  Searching the space of policies directly for the optimal policy has been one popular method for solving partially observable reinforcement learning problems.  ...  ACKNOWLEDGMENTS The authors would like to thank Leslie Kaelbling for helpful discussions and comments on the manuscript. C.S. was supported by grants from ONR contracts Nos.  ... 
arXiv:cs/0204043v1 fatcat:x52vyrqrcnd2jcocd6qfyd5wau

Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting [article]

Yue Wang, Wei Chen, Yuting Liu, Zhi-Ming Ma, Tie-Yan Liu
2018 arXiv   pre-print
In this paper, in the realistic Markov setting, we derive the finite sample bounds for the general convex-concave saddle point problems, and hence for the GTD algorithms.  ...  To the best of our knowledge, our analysis is the first to provide finite sample bounds for the GTD algorithms in Markov setting.  ...  In Theorem 1, we present our finite sample bound for the general convex-concave saddle point problem; in Theorem 2, we provide the finite sample bounds for GTD algorithms in both on-policy and off-policy  ... 
arXiv:1809.08926v1 fatcat:qi7vr52fzvc7zila7kcjdqlih4

Inferring the Optimal Policy using Markov Chain Monte Carlo [article]

Brandon Trabucco, Albert Qu, Simon Li, Ganeshkumar Ashokavardhanan
2019 arXiv   pre-print
In order to resolve these problems, we propose a technique using Markov Chain Monte Carlo to generate samples from the posterior distribution of the parameters conditioned on being optimal.  ...  Existing methods for estimating the optimal stochastic control policy rely on high variance estimates of the policy descent.  ...  The On Policy MH Algorithm depends on having an accurate estimate of the expected future reward, and this may not be available in certain environments, where the rewards samples have noise, or high variance  ... 
arXiv:1912.02714v1 fatcat:mflokdp2kjda7mvrlpdc7z4sxi

Fast stochastic motion planning with optimality guarantees using local policy reconfiguration

Ryan Luna, Morteza Lahijanian, Mark Moll, Lydia E. Kavraki
2014 2014 IEEE International Conference on Robotics and Automation (ICRA)  
During the abstraction, an efficient sampling-based method for stochastic optimal control is used to construct several policies within a discrete region of the state space in order for the system to transit  ...  The motion of the system is abstracted to a class of uncertain Markov models known as bounded-parameter Markov decision processes (BMDPs).  ...  Vardi for his helpful discussions and insights, as well as Ryan Christiansen and the other Kavraki Lab members for valuable input on this work.  ... 
doi:10.1109/icra.2014.6907293 dblp:conf/icra/LunaLMK14 fatcat:3zev3k23dvbazmf735bbwnfbru

PAC-Bayesian Policy Evaluation for Reinforcement Learning [article]

Mahdi MIlani Fard, Joelle Pineau, Csaba Szepesvari
2012 arXiv   pre-print
We show how this bound can be used to perform model-selection in a transfer learning scenario.  ...  This paper introduces the first PAC-Bayesian bound for the batch reinforcement learning problem with function approximation.  ...  Acknowledgements This work was supported in part by AICML, AITF (formerly iCore and AIF), the PASCAL2 Network of Excellence under EC (grant no. 216886), the NSERC Discovery Grant program and the National  ... 
arXiv:1202.3717v1 fatcat:eeqovahf3jfn3pq2gl4yizw47y

Online Bootstrap Inference For Policy Evaluation in Reinforcement Learning [article]

Pratik Ramprasad, Yuantong Li, Zhuoran Yang, Zhaoran Wang, Will Wei Sun, Guang Cheng
2022 arXiv   pre-print
The method is shown to be distributionally consistent for statistical inference in policy evaluation, and numerical experiments are included to demonstrate the effectiveness of this algorithm at statistical  ...  In particular, we focus on the temporal difference (TD) learning and Gradient TD (GTD) learning algorithms, which are themselves special instances of linear stochastic approximation under Markov noise.  ...  (a) Sensitivity to initial step size α 0 (b) Sensitivity to learning rate parameter η On-Policy Value Inference For FrozenLake RL Environment Next, we consider the Frozenlake environment from OpenAI  ... 
arXiv:2108.03706v2 fatcat:djoqecbvnbazxi2rabz26k4bfe

Evaluating the Performance of Reinforcement Learning Algorithms [article]

Scott M. Jordan, Yash Chandak, Daniel Cohen, Mengxue Zhang, Philip S. Thomas
2020 arXiv   pre-print
Performance evaluations are critical for quantifying algorithmic advances in reinforcement learning.  ...  both on a single environment and when aggregated across environments.  ...  on various versions of this manuscript.  ... 
arXiv:2006.16958v2 fatcat:wnppcnymcfafffsaxa46o5veoi

Learning Markov Games with Adversarial Opponents: Efficient Algorithms and Fundamental Limits [article]

Qinghua Liu, Yuanhao Wang, Chi Jin
2022 arXiv   pre-print
While most existing works in Markov games focus exclusively on the former objective, it remains open whether we can achieve both objectives simultaneously.  ...  To address this problem, this work studies no-regret learning in Markov games with adversarial opponents when competing against the best fixed policy in hindsight.  ...  Acknowledge We thank Zhuoran Yang for valuable discussions.  ... 
arXiv:2203.06803v2 fatcat:bhcp63awpvgx7nfwncpqu5pe4q

Unknown mixing times in apprenticeship and reinforcement learning [article]

Tom Zahavy, Alon Cohen, Haim Kaplan, Yishay Mansour
2020 arXiv   pre-print
In contrast, we build on ideas from Markov chain theory and derive sampling algorithms that do not require such an upper bound.  ...  We derive and analyze learning algorithms for apprenticeship learning, policy evaluation, and policy gradient for average reward criteria.  ...  In this case, the learner has to know a bound on the diameter in order to bound the sample complexity.  ... 
arXiv:1905.09704v2 fatcat:7hfh5ygulzcrfjzocpzh3q464e

Bayesian Reinforcement Learning via Deep, Sparse Sampling [article]

Divya Grover, Debabrota Basu, Christos Dimitrakakis
2020 arXiv   pre-print
We propose an optimism-free Bayes-adaptive algorithm to induce deeper and sparser exploration with a theoretical bound on its performance relative to the Bayes optimal policy, with a lower computational  ...  Experimental results on different environments show that in comparison to the state-of-the-art, our algorithm is both computationally more efficient, and obtains significantly higher reward in discrete  ...  We choose Policy Iteration (PI) and a variant of Real Time Dynamic Programming (RTDP) for different sizes of environments.  ... 
arXiv:1902.02661v4 fatcat:bwofz6wmjfamdm7avnwkrzcfm4

Value Function Approximation in Zero-Sum Markov Games [article]

Michail Lagoudakis, Ron Parr
2012 arXiv   pre-print
We demonstrate the viability of value function approximation for Markov games by using the Least squares policy iteration (LSPI) algorithm to learn good policies for a soccer domain and a flow control  ...  We generalize error bounds from MDPs to Markov games and describe generalizations of reinforcement learning algorithms to Markov games.  ...  Acknowledgements We are grateful to Carlos Guestrin and Michael Littman for helpful discussions. Michail G. Lagoudakis was partially supported by the Lilian Boudouri Foundation.  ... 
arXiv:1301.0580v1 fatcat:jcezxeag7zhgpkvfovmttq67cm

On-Policy Deep Reinforcement Learning for the Average-Reward Criterion [article]

Yiming Zhang, Keith W. Ross
2021 arXiv   pre-print
We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL). We first consider bounding the difference of the long-term average reward for two policies.  ...  Based on this bound, we develop an iterative procedure which produces a sequence of monotonically improved policies for the average reward criterion.  ...  We also thank Shuyang Ling, Che Wang, Zining (Lily) Wang, and Yanqiu Wu for the insightful discussions on this work.  ... 
arXiv:2106.07329v1 fatcat:wfajd73gtbhnndh2mr4lo37ep4
« Previous Showing results 1 — 15 out of 17,742 results