A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Debiasing Samples from Online Learning Using Bootstrap
[article]
2021
arXiv
pre-print
It has been recently shown in the literature that the sample averages from online learning experiments are biased when used to estimate the mean reward. To correct the bias, off-policy evaluation methods, including importance sampling and doubly robust estimators, typically calculate the conditional propensity score, which is ill-defined for non-randomized policies such as UCB. This paper provides a procedure to debias the samples using bootstrap, which doesn't require the knowledge of the
arXiv:2108.00236v2
fatcat:un6od643qrgmjky3xwifpguuii