Filters








359,883 Hits in 7.3 sec

Importance Sampling Policy Evaluation with an Estimated Behavior Policy [article]

Josiah P. Hanna, Scott Niekum, Peter Stone
2019 arXiv   pre-print
In this paper, we study importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate.  ...  We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from  ...  in D by importance sampling with an estimated behavior policy.  ... 
arXiv:1806.01347v3 fatcat:hoaz73rfd5hqlpgp7hin5epz24

Data-Efficient Policy Evaluation Through Behavior Policy Search [article]

Josiah P. Hanna, Philip S. Thomas, Peter Stone, Scott Niekum
2017 arXiv   pre-print
We derive an analytic expression for the optimal behavior policy --- the behavior policy that minimizes the mean squared error of the resulting estimates.  ...  We show that the data collected from deploying a different policy, commonly called the behavior policy, can be used to produce unbiased estimates with lower mean squared error than this standard technique  ...  Importance Sampling Importance Sampling is a method for reweighting returns from a behavior policy, θ, such that they are unbiased returns from the evaluation policy.  ... 
arXiv:1706.03469v1 fatcat:x3jiksf7gzac5fstknxata2n7m

Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation [article]

Zhaohan Daniel Guo, Philip S. Thomas, Emma Brunskill
2017 arXiv   pre-print
We propose using policies over temporally extended actions, called options, and show that combining these policies with importance sampling can significantly improve performance for long-horizon problems  ...  We further generalize these special cases to a general covariance testing rule that can be used to decide which weights to drop in an IS estimate, and derive a new IS algorithm called Incremental Importance  ...  Acknowledgements The research reported here was supported in part by an ONR Young Investigator award, an NSF CAREER award, and by the Institute of Education Sciences, U.S. Department of Education.  ... 
arXiv:1703.03453v2 fatcat:ymxepyiinrfzbkp3hhwxhwmeae

Importance sampling in reinforcement learning with an estimated behavior policy

Josiah P. Hanna, Scott Niekum, Peter Stone
2021 Machine Learning  
AbstractIn reinforcement learning, importance sampling is a widely used method for evaluating an expectation under the distribution of data of one policy when the data has in fact been generated by a different  ...  Importance sampling requires computing the likelihood ratio between the action probabilities of a target policy and those of the data-producing behavior policy.  ...  sampling with an estimated behavior policy.  ... 
doi:10.1007/s10994-020-05938-9 fatcat:djw7yjw5gzec3ggb22xropu6du

Importance Sampling for Fair Policy Selection

Shayan Doroudi, Philip S. Thomas, Emma Brunskill
2018 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence  
Finally, we provide a practical importance sampling-based estimator to help mitigate the unfairness due to varying trajectory lengths.  ...  We then give an example that shows importance sampling is systematically unfair in a practically relevant setting; namely, we show that it unreasonably favors shorter trajectory lengths.  ...  We showed that importance sampling is unfair when used for policy selection even though it is an unbiased estimator for policy evaluation.  ... 
doi:10.24963/ijcai.2018/729 dblp:conf/ijcai/DoroudiTB18 fatcat:urunwvz5brf3jehokounp4nb34

Case-based off-policy policy evaluation using prototype learning [article]

Anton Matsson, Fredrik D. Johansson
2021 arXiv   pre-print
Importance sampling (IS) is often used to perform off-policy policy evaluation but is prone to several issues, especially when the behavior policy is unknown and must be estimated from data.  ...  an accuracy comparable to baseline estimators.  ...  In this work, we study OPPE of sequential decision-making policies using importance sampling with an unknown behavior policy.  ... 
arXiv:2111.11113v1 fatcat:rnf2uiodmff7xeipkm76uvqgye

A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems [article]

Rafael Figueiredo Prudencio, Marcos R. O. A. Maximo, Esther Luna Colombini
2022 arXiv   pre-print
pixel observations, sustaining conversations with humans, and controlling robotic agents.  ...  Offline RL is a paradigm that learns exclusively from static datasets of previously collected interactions, making it feasible to extract policies from large and diverse training datasets.  ...  Importance Sampling With importance sampling, we first fit an estimate of the behavior policy πβ (a|s) using D e .  ... 
arXiv:2203.01387v2 fatcat:euobvze7kre3fi7blalnbbgefm

Infinite-horizon Off-Policy Policy Evaluation with Multiple Behavior Policies [article]

Xinyun Chen, Lu Wang, Yizhe Hang, Heng Ge, Hongyuan Zha
2019 arXiv   pre-print
We consider off-policy policy evaluation when the trajectory data are generated by multiple behavior policies.  ...  With careful analysis, we show that EMP gives rise to estimates with reduced variance for estimating the state stationary distribution correction while it also offers a useful induction bias for estimating  ...  IMPORTANCE SAMPLING POLICY EVALUATION USING EXACT AND ESTIMATED BEHAVIOR POLICY As for short-horizon off-policy policy evaluation, importance sampling policy evaluation (IS) methods (Precup et al., 2001  ... 
arXiv:1910.04849v1 fatcat:opmzzszinbctbhw6gwo4hmcmaa

Off-Policy Evaluation of the Performance of a Robot Swarm: Importance Sampling to Assess Potential Modifications to the Finite-State Machine That Controls the Robots

Federico Pagnozzi, Mauro Birattari
2021 Frontiers in Robotics and AI  
In this paper, we propose a technique based on off-policy evaluation to estimate how the performance of an instance of control software—implemented as a probabilistic finite-state machine—would be impacted  ...  To evaluate the technique, we apply it to control software generated with an AutoMoDe method, Chocolate−6S .  ...  Given a set of episodes E generated with policy b, there are two main ways of using ρ τ(s) to estimate v π (s): ordinary importance sampling and weighted importance sampling (WIS).  ... 
doi:10.3389/frobt.2021.625125 pmid:33996923 pmcid:PMC8117342 fatcat:an5pgp7a4naxzp2yevzit2xnqi

High-Confidence Off-Policy Evaluation

Philip Thomas, Georgios Theocharous, Mohammad Ghavamzadeh
2015 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
Such off-policy evaluation methods, which estimate the performance of a policy using trajectories collected from the execution of other policies, heretofore have not provided confidences regarding the  ...  In this paper we propose an off-policy method for computing a lower confidence bound on the expected return of a policy.  ...  Our approach is straightforward-for each trajectory, we use importance sampling to generate an importance weighted return, which is an unbiased estimate of the expected return of the evaluation policy  ... 
doi:10.1609/aaai.v29i1.9541 fatcat:liehfsaknvbbtbdihxduvtz6mi

Off-Policy Policy Gradient with Stationary Distribution Correction

Yao Liu, Adith Swaminathan, Alekh Agarwal, Emma Brunskill
2019 Conference on Uncertainty in Artificial Intelligence  
Here we build on recent progress for estimating the ratio of the state distributions under behavior and evaluation policies for policy evaluation, and present an off-policy policy gradient optimization  ...  We present an illustrative example of why this is important and a theoretical convergence guarantee for our approach.  ...  Acknowledgements We acknowledge a NSF CAREER award, an ONR Young Investigator Award, and support from Siemens.  ... 
dblp:conf/uai/LiuSAB19 fatcat:oyaqya3m4bc6bhq72c4tpls3vm

Off-Policy Policy Gradient with State Distribution Correction [article]

Yao Liu, Adith Swaminathan, Alekh Agarwal, Emma Brunskill
2019 arXiv   pre-print
Here we build on recent progress for estimating the ratio of the state distributions under behavior and evaluation policies for policy evaluation, and present an off-policy policy gradient optimization  ...  We present an illustrative example of why this is important and a theoretical convergence guarantee for our approach.  ...  We then evaluate these policies using the off-policy policy evaluation (OPPE) method in Liu et al. [2018a] . The evaluation is performed with an additional dataset sampled from the behavior policy.  ... 
arXiv:1904.08473v2 fatcat:dynblg47ezekblujbtqekjpyda

Stacked calibration of off-policy policy evaluation for video game matchmaking

Eric Laufer, Raul Chandias Ferrari, Li Yao, Olivier Delalleau, Yoshua Bengio
2013 2013 IEEE Conference on Computational Inteligence in Games (CIG)  
We consider an industrial strength application of recommendation systems for video-game matchmaking in which off-policy policy evaluation is important but where standard approaches can hardly be applied  ...  Furthermore, we observe that when the estimated reward function and the policy are obtained by training from the same off-policy dataset, the policy evaluation using the estimated reward function is biased  ...  the importance sampling estimator becomes meaningless.  ... 
doi:10.1109/cig.2013.6633642 dblp:conf/cig/Thibodeau-LauferFYDB13 fatcat:g2qr6jdu2zcplfnxoikk3bv5m4

Doubly Robust Off-Policy Actor-Critic Algorithms for Reinforcement Learning [article]

Riashat Islam, Raihan Seraj, Samin Yeasar Arnob, Doina Precup
2019 arXiv   pre-print
Off-policy actor-critic algorithms require an off-policy critic evaluation step, to estimate the value of the new policy after every policy gradient update.  ...  We extend the doubly robust estimator from off-policy policy evaluation (OPE) to actor-critic algorithms that consist of a reward estimator performance model.  ...  Such an estimate of the behavior policy leads to a lower mean squared error for off policy evaluation compared with the true behavior policy.  ... 
arXiv:1912.05109v1 fatcat:plnttwxjrncz3fh25ig4h5q2a4

Diverse Exploration for Fast and Safe Policy Improvement

Andrew Cohen, Lei Yu, Robert Wright
2018 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
We study an important yet under-addressed problem of quickly and safely improving policies in online reinforcement learning domains.  ...  We provide DE theory explaining why diversity in behavior policies enables effective exploration without sacrificing exploitation.  ...  HCOPE applies importance sampling (Precup, Sutton, and Singh 2000) to produce an unbiased estimator of ρ(π p ) from a trajectory generated by a behavior policy, π q .  ... 
doi:10.1609/aaai.v32i1.11758 fatcat:rab4rdrvajafhnek6x3kvd6nj4
« Previous Showing results 1 — 15 out of 359,883 results