34,437 Hits in 5.4 sec

Variance-Aware Off-Policy Evaluation with Linear Function Approximation [article]

Yifei Min and Tianhao Wang and Dongruo Zhou and Quanquan Gu
2022 arXiv   pre-print
We study the off-policy evaluation (OPE) problem in reinforcement learning with linear function approximation, which aims to estimate the value function of a target policy based on the offline data collected  ...  More specifically, for time-inhomogeneous episodic linear Markov decision processes (MDPs), we propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman  ...  The main contributions of this paper are summarized as follows: • We develop VA-OPE (Variance-Aware Off-Policy Evaluation), an algorithm for OPE that effectively utilizes the variance information from  ... 
arXiv:2106.11960v2 fatcat:d6qwsxcddjcrhdagwezsrprm3e

The Mirage of Action-Dependent Baselines in Reinforcement Learning [article]

George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard E. Turner, Zoubin Ghahramani, Sergey Levine
2018 arXiv   pre-print
To better understand this development, we decompose the variance of the policy gradient estimator and numerically show that learned state-action-dependent baselines do not in fact reduce variance over  ...  Policy gradient methods are a widely used class of model-free reinforcement learning algorithms where a state-dependent baseline is used to reduce gradient estimator variance.  ...  To fix one deficiency with the value function approximator, we propose a new horizon-aware parameterization of the value function.  ... 
arXiv:1802.10031v3 fatcat:w5fjkx2tunhini6kn6qm32w2ie

Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning [article]

Shangtong Zhang, Bo Liu, Shimon Whiteson
2022 arXiv   pre-print
MVPI enjoys great flexibility in that any policy evaluation method and risk-neutral control method can be dropped in for risk-averse control off the shelf, in both on- and off-policy settings.  ...  We present a mean-variance policy iteration (MVPI) framework for risk-averse control in a discounted infinite horizon MDP optimizing the variance of a per-step reward random variable.  ...  In complicated environments with function approximation, pursuing the exact policy that minimizes the variance of the total reward is usually intractable.  ... 
arXiv:2004.10888v6 fatcat:op4ehao6mng5tgfcg65vsgvmnu

Pareto-efficient Acquisition Functions for Cost-Aware Bayesian Optimization [article]

Gauthier Guinet, Valerio Perrone, Cédric Archambeau
2020 arXiv   pre-print
It efficiently tunes machine learning algorithms under the implicit assumption that hyperparameter evaluations cost approximately the same.  ...  finer control over the cost-accuracy trade-off.  ...  Most acquisition functions used in BO implicitly assume that all hyperparameter evaluations cost approximately the same.  ... 
arXiv:2011.11456v2 fatcat:tavpve6sjjd5pgp5fqba2gnbbi

Reinforcement Learning with the Use of Costly Features [chapter]

Robby Goetschalckx, Scott Sanner, Kurt Driessens
2008 Lecture Notes in Computer Science  
To this end, we introduce a new cost-sensitive sparse linear regression paradigm for value function approximation in reinforcement learning where the learner is able to select only those costly features  ...  For example, search-based features may be useful for value prediction, but their computational cost must be traded off with their impact on value accuracy.  ...  To do this, we approximate the value function using costsensitive sparse linear regression techniques, trading off prediction errors with the costs induced by using a feature.  ... 
doi:10.1007/978-3-540-89722-4_10 fatcat:js6bji2qyjbcxizohb2lsehgie

RAPTOR: End-to-end Risk-Aware MDP Planning and Policy Learning by Backpropagation [article]

Noah Patton, Jihwan Jeong, Michael Gimelfarb, Scott Sanner
2021 arXiv   pre-print
We address this shortcoming by optimizing for risk-aware Deep Reactive Policies (RaDRP) in our framework.  ...  We evaluate and compare these two forms of RAPTOR on three highly stochastic do-mains, including nonlinear navigation, HVAC control, and linear reservoir control, demonstrating the ability to manage risk  ...  Also, the optimal policy π * t can be well-approximated provided that π θ is a sufficiently expressive class of function approximators.  ... 
arXiv:2106.07260v1 fatcat:3fg4fwrzevhp7ayizd4etdfp5a

Multiplicative Controller Fusion: Leveraging Algorithmic Priors for Sample-efficient Reinforcement Learning and Safe Sim-To-Real Transfer

Krishan Rana, Vibhavari Dasagi, Ben Talbot, Michael Milford, Niko Sunderhauf
2020 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)  
The end-to-end based approach is shown to exhibit the worst performance with very high variance. The baseline approach also shows very high variance and is shown to converge to a suboptimal policy.  ...  and an unseen environment to evaluate its performance when presented with unknown states.  ... 
doi:10.1109/iros45743.2020.9341372 fatcat:57nxu3gtxva7njwhifdemki6em

Learning Value Functions in Deep Policy Gradients using Residual Variance [article]

Yannis Flet-Berliac, Reda Ouhamma, Odalric-Ambrym Maillard, Philippe Preux
2021 arXiv   pre-print
In our method, the critic uses a new state-value (resp. state-action-value) function approximation that learns the value of the states (resp. state-action pairs) relative to their mean value rather than  ...  Furthermore, we validate our method in tasks with sparse rewards, where we provide experimental evidence and theoretical insights.  ...  the bias-variance trade-off towards reducing the variance while introducing a small bias.  ... 
arXiv:2010.04440v3 fatcat:z2d6bty4rvahvmuzvhjh2quo6y

Gradient-Aware Model-Based Policy Search

Pierluca D'Oro, Alberto Maria Metelli, Andrea Tirinzoni, Matteo Papini, Marcello Restelli
Then, we integrate this procedure into a batch policy improvement algorithm, named Gradient-Aware Model-based Policy Search (GAMPS), which iteratively learns a transition model and uses it, together with  ...  the approximate transition model.  ...  Therefore, the MVG limits the bias effect of p to the Q-function approximation Q π, p . 2 At the same time, it enjoys a smaller variance w.r.t. a Monte Carlo estimator, especially in an off-policy setting  ... 
doi:10.1609/aaai.v34i04.5791 fatcat:amvdhtdpqzgclc6ce4upl4lz3u

Adaptive Estimator Selection for Off-Policy Evaluation [article]

Yi Su, Pavithra Srinath, Akshay Krishnamurthy
2020 arXiv   pre-print
We develop a generic data-driven method for estimator selection in off-policy policy evaluation settings.  ...  In both case studies, our method compares favorably with existing methods.  ...  For off-policy evaluation, we are not aware of any other approaches that achieve any form of oracle inequality.  ... 
arXiv:2002.07729v2 fatcat:lvmmwowrezgiraflrlxxxdffpm

Stochastic Variance Reduction Methods for Policy Evaluation [article]

Simon S. Du, Jianshu Chen, Lihong Li, Lin Xiao, Dengyong Zhou
2017 arXiv   pre-print
In this paper, we focus on policy evaluation with linear function approximation over a fixed dataset.  ...  Policy evaluation is a crucial step in many reinforcement-learning procedures, which estimates a value function that predicts states' long-term value under a given policy.  ...  In this paper, we study policy evaluation by minimizing the mean squared projected Bellman error (MSPBE) with linear approximation of the value function.  ... 
arXiv:1702.07944v2 fatcat:iiu3yrlnsjbvta7d7sspeeky6m

Uncertainty-Aware Policy Optimization: A Robust, Adaptive Trust Region Approach [article]

James Queeney, Ioannis Ch. Paschalidis, Christos G. Cassandras
2020 arXiv   pre-print
The resulting algorithm, Uncertainty-Aware Trust Region Policy Optimization, generates robust policy updates that adapt to the level of uncertainty present throughout the learning process.  ...  When combined with small sample sizes, these methods can result in unstable learning due to their reliance on high-dimensional sample-based estimates.  ...  Note that (10) is a minimization of a linear function of u subject to a convex quadratic constraint in u.  ... 
arXiv:2012.10791v1 fatcat:euqqdcoi4rea3i52ww7p3poyte

Uncertainty aware grasping and tactile exploration

Stanimir Dragiev, Marc Toussaint, Michael Gienger
2013 2013 IEEE International Conference on Robotics and Automation  
When we develop algorithms for grasping with robotic hands it is not enough to assume the best estimate of the environment -if there is a measure of uncertainty we need to account for it.  ...  This paper presents a control law which augments a grasp controller with the ability to prefer known or unseen regions of an object; this leads to the introduction of two motion primitives: an explorative  ...  In order to define a policy for such grasp series, we need to decide what to trade off. We can support such decisions by local and global variance measures.  ... 
doi:10.1109/icra.2013.6630564 dblp:conf/icra/DragievTG13 fatcat:lry6pzrq3jcd5b5sqkhukcegci

Counterfactual Learning of Stochastic Policies with Continuous Actions: from Models to Offline Evaluation [article]

Houssam Zenati, Alberto Bietti, Matthieu Martin, Eustache Diemert, Julien Mairal
2021 arXiv   pre-print
along with multiple synthetic, yet realistic, evaluation setups.  ...  Finally, we propose an evaluation protocol for offline policies in real-world logged systems, which is challenging since policies cannot be replayed on test data, and we release a new large-scale dataset  ...  Additionally, we introduce a benchmark protocol for reliably evaluating policies using off-policy evaluation.  ... 
arXiv:2004.11722v5 fatcat:wjalqh2eujc6fhk4j6354wcyqq

Multiplicative Controller Fusion: Leveraging Algorithmic Priors for Sample-efficient Reinforcement Learning and Safe Sim-To-Real Transfer [article]

Krishan Rana, Vibhavari Dasagi, Ben Talbot, Michael Milford, Niko Sünderhauf
2020 arXiv   pre-print
Importantly, the policy can learn to improve beyond the performance of the sub-optimal prior since the prior's influence is annealed gradually.  ...  However, learning long-horizon tasks on real robot hardware can be intractable, and transferring a learned policy from simulation to reality is still extremely challenging.  ...  The end-to-end based approach is shown to exhibit the worst performance with very high variance. The baseline approach also shows very high variance and is shown to converge to a suboptimal policy.  ... 
arXiv:2003.05117v3 fatcat:mla6parifnbh7jbu7cqcfylv6i
« Previous Showing results 1 — 15 out of 34,437 results