A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
Importance Sampling Policy Evaluation with an Estimated Behavior Policy
[article]
2019
arXiv
pre-print
In this paper, we study importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate. ...
We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from ...
in D by importance sampling with an estimated behavior policy. ...
arXiv:1806.01347v3
fatcat:hoaz73rfd5hqlpgp7hin5epz24
Data-Efficient Policy Evaluation Through Behavior Policy Search
[article]
2017
arXiv
pre-print
We derive an analytic expression for the optimal behavior policy --- the behavior policy that minimizes the mean squared error of the resulting estimates. ...
We show that the data collected from deploying a different policy, commonly called the behavior policy, can be used to produce unbiased estimates with lower mean squared error than this standard technique ...
Importance Sampling Importance Sampling is a method for reweighting returns from a behavior policy, θ, such that they are unbiased returns from the evaluation policy. ...
arXiv:1706.03469v1
fatcat:x3jiksf7gzac5fstknxata2n7m
Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation
[article]
2017
arXiv
pre-print
We propose using policies over temporally extended actions, called options, and show that combining these policies with importance sampling can significantly improve performance for long-horizon problems ...
We further generalize these special cases to a general covariance testing rule that can be used to decide which weights to drop in an IS estimate, and derive a new IS algorithm called Incremental Importance ...
Acknowledgements The research reported here was supported in part by an ONR Young Investigator award, an NSF CAREER award, and by the Institute of Education Sciences, U.S. Department of Education. ...
arXiv:1703.03453v2
fatcat:ymxepyiinrfzbkp3hhwxhwmeae
Importance sampling in reinforcement learning with an estimated behavior policy
2021
Machine Learning
AbstractIn reinforcement learning, importance sampling is a widely used method for evaluating an expectation under the distribution of data of one policy when the data has in fact been generated by a different ...
Importance sampling requires computing the likelihood ratio between the action probabilities of a target policy and those of the data-producing behavior policy. ...
sampling with an estimated behavior policy. ...
doi:10.1007/s10994-020-05938-9
fatcat:djw7yjw5gzec3ggb22xropu6du
Importance Sampling for Fair Policy Selection
2018
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
Finally, we provide a practical importance sampling-based estimator to help mitigate the unfairness due to varying trajectory lengths. ...
We then give an example that shows importance sampling is systematically unfair in a practically relevant setting; namely, we show that it unreasonably favors shorter trajectory lengths. ...
We showed that importance sampling is unfair when used for policy selection even though it is an unbiased estimator for policy evaluation. ...
doi:10.24963/ijcai.2018/729
dblp:conf/ijcai/DoroudiTB18
fatcat:urunwvz5brf3jehokounp4nb34
Case-based off-policy policy evaluation using prototype learning
[article]
2021
arXiv
pre-print
Importance sampling (IS) is often used to perform off-policy policy evaluation but is prone to several issues, especially when the behavior policy is unknown and must be estimated from data. ...
an accuracy comparable to baseline estimators. ...
In this work, we study OPPE of sequential decision-making policies using importance sampling with an unknown behavior policy. ...
arXiv:2111.11113v1
fatcat:rnf2uiodmff7xeipkm76uvqgye
A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems
[article]
2022
arXiv
pre-print
pixel observations, sustaining conversations with humans, and controlling robotic agents. ...
Offline RL is a paradigm that learns exclusively from static datasets of previously collected interactions, making it feasible to extract policies from large and diverse training datasets. ...
Importance Sampling With importance sampling, we first fit an estimate of the behavior policy πβ (a|s) using D e . ...
arXiv:2203.01387v2
fatcat:euobvze7kre3fi7blalnbbgefm
Infinite-horizon Off-Policy Policy Evaluation with Multiple Behavior Policies
[article]
2019
arXiv
pre-print
We consider off-policy policy evaluation when the trajectory data are generated by multiple behavior policies. ...
With careful analysis, we show that EMP gives rise to estimates with reduced variance for estimating the state stationary distribution correction while it also offers a useful induction bias for estimating ...
IMPORTANCE SAMPLING POLICY EVALUATION USING EXACT AND ESTIMATED BEHAVIOR POLICY As for short-horizon off-policy policy evaluation, importance sampling policy evaluation (IS) methods (Precup et al., 2001 ...
arXiv:1910.04849v1
fatcat:opmzzszinbctbhw6gwo4hmcmaa
Off-Policy Evaluation of the Performance of a Robot Swarm: Importance Sampling to Assess Potential Modifications to the Finite-State Machine That Controls the Robots
2021
Frontiers in Robotics and AI
In this paper, we propose a technique based on off-policy evaluation to estimate how the performance of an instance of control software—implemented as a probabilistic finite-state machine—would be impacted ...
To evaluate the technique, we apply it to control software generated with an AutoMoDe method, Chocolate−6S . ...
Given a set of episodes E generated with policy b, there are two main ways of using ρ τ(s) to estimate v π (s): ordinary importance sampling and weighted importance sampling (WIS). ...
doi:10.3389/frobt.2021.625125
pmid:33996923
pmcid:PMC8117342
fatcat:an5pgp7a4naxzp2yevzit2xnqi
High-Confidence Off-Policy Evaluation
2015
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
Such off-policy evaluation methods, which estimate the performance of a policy using trajectories collected from the execution of other policies, heretofore have not provided confidences regarding the ...
In this paper we propose an off-policy method for computing a lower confidence bound on the expected return of a policy. ...
Our approach is straightforward-for each trajectory, we use importance sampling to generate an importance weighted return, which is an unbiased estimate of the expected return of the evaluation policy ...
doi:10.1609/aaai.v29i1.9541
fatcat:liehfsaknvbbtbdihxduvtz6mi
Off-Policy Policy Gradient with Stationary Distribution Correction
2019
Conference on Uncertainty in Artificial Intelligence
Here we build on recent progress for estimating the ratio of the state distributions under behavior and evaluation policies for policy evaluation, and present an off-policy policy gradient optimization ...
We present an illustrative example of why this is important and a theoretical convergence guarantee for our approach. ...
Acknowledgements We acknowledge a NSF CAREER award, an ONR Young Investigator Award, and support from Siemens. ...
dblp:conf/uai/LiuSAB19
fatcat:oyaqya3m4bc6bhq72c4tpls3vm
Off-Policy Policy Gradient with State Distribution Correction
[article]
2019
arXiv
pre-print
Here we build on recent progress for estimating the ratio of the state distributions under behavior and evaluation policies for policy evaluation, and present an off-policy policy gradient optimization ...
We present an illustrative example of why this is important and a theoretical convergence guarantee for our approach. ...
We then evaluate these policies using the off-policy policy evaluation (OPPE) method in Liu et al. [2018a] . The evaluation is performed with an additional dataset sampled from the behavior policy. ...
arXiv:1904.08473v2
fatcat:dynblg47ezekblujbtqekjpyda
Stacked calibration of off-policy policy evaluation for video game matchmaking
2013
2013 IEEE Conference on Computational Inteligence in Games (CIG)
We consider an industrial strength application of recommendation systems for video-game matchmaking in which off-policy policy evaluation is important but where standard approaches can hardly be applied ...
Furthermore, we observe that when the estimated reward function and the policy are obtained by training from the same off-policy dataset, the policy evaluation using the estimated reward function is biased ...
the importance sampling estimator becomes meaningless. ...
doi:10.1109/cig.2013.6633642
dblp:conf/cig/Thibodeau-LauferFYDB13
fatcat:g2qr6jdu2zcplfnxoikk3bv5m4
Doubly Robust Off-Policy Actor-Critic Algorithms for Reinforcement Learning
[article]
2019
arXiv
pre-print
Off-policy actor-critic algorithms require an off-policy critic evaluation step, to estimate the value of the new policy after every policy gradient update. ...
We extend the doubly robust estimator from off-policy policy evaluation (OPE) to actor-critic algorithms that consist of a reward estimator performance model. ...
Such an estimate of the behavior policy leads to a lower mean squared error for off policy evaluation compared with the true behavior policy. ...
arXiv:1912.05109v1
fatcat:plnttwxjrncz3fh25ig4h5q2a4
Diverse Exploration for Fast and Safe Policy Improvement
2018
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
We study an important yet under-addressed problem of quickly and safely improving policies in online reinforcement learning domains. ...
We provide DE theory explaining why diversity in behavior policies enables effective exploration without sacrificing exploitation. ...
HCOPE applies importance sampling (Precup, Sutton, and Singh 2000) to produce an unbiased estimator of ρ(π p ) from a trajectory generated by a behavior policy, π q . ...
doi:10.1609/aaai.v32i1.11758
fatcat:rab4rdrvajafhnek6x3kvd6nj4
« Previous
Showing results 1 — 15 out of 359,883 results