Filters








100,581 Hits in 3.5 sec

Supervised Off-Policy Ranking [article]

Yue Jin, Yue Zhang, Tao Qin, Xudong Zhang, Jian Yuan, Houqiang Li, Tie-Yan Liu
2022 arXiv   pre-print
Inspired by the two observations, in this work, we study a new problem, supervised off-policy ranking (SOPR), which aims to rank a set of target policies based on supervised learning by leveraging off-policy  ...  Off-policy evaluation (OPE) is to evaluate a target policy with data generated by other policies. Most previous OPE methods focus on precisely estimating the true performance of a policy.  ...  Supervised Off-Policy Evaluation/Ranking In this section, we first give some notations and then formally describe the problems of supervised off-policy evaluation and supervised off-policy ranking.  ... 
arXiv:2107.01360v2 fatcat:h5qriluafjdavaiyapgy63igee

Ranking Policy Gradient [article]

Kaixiang Lin, Jiayu Zhou
2019 arXiv   pre-print
Towards the sample-efficient RL, we propose ranking policy gradient (RPG), a policy gradient method that learns the optimal rank of a set of discrete actions.  ...  These results lead to a general off-policy learning framework, which preserves the optimality, reduces variance, and improves the sample-efficiency.  ...  Ablation Study The effectiveness of pairwise ranking policy and off-policy learning as supervised learning.  ... 
arXiv:1906.09674v3 fatcat:h33pqii4nnec3lavzrbfjp6iry

Towards Off-Policy Learning for Ranking Policies with Logged Feedback

Teng Xiao, Suhang Wang
2022 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
In this paper, we propose a new off-policy value ranking (VR) algorithm that can simultaneously maximize user long-term rewards and optimize the ranking metric offline for improved sample efficiency in  ...  We theoretically and empirically show that the EM process guides the leaned policy to enjoy the benefit of integration of the future reward and ranking metric, and learn without any online interactions  ...  Off-Policy Value Ranking In this section, we propose an EM-style algorithm to approximate the posterior, which results in our off-policy value ranking algorithm.  ... 
doi:10.1609/aaai.v36i8.20849 fatcat:yv6xq6blgzgj7eyp6mziaukvfe

Page 88 of Educational Research Bulletin Vol. 27, Issue 4 [page]

1948 Educational Research Bulletin  
In only 9 of the 29 institutions using off-campus laboratory schools did the principal hold academic rank.  ...  Fifty-three per cent of the off- campus teachers were paid for their supervision.  ... 

Self-Supervised Reinforcement Learning for Recommender Systems [article]

Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M. Jose
2020 arXiv   pre-print
As a result, learning the policy from logged implicit feedback is of vital importance, which is challenging due to the pure off-policy setting and lack of negative rewards (feedback).  ...  Based on such an approach, we propose two frameworks namely Self-Supervised Q-learning(SQN) and Self-Supervised Actor-Critic(SAC).  ...  But their method doesn't address the off-policy problem.  ... 
arXiv:2006.05779v2 fatcat:azyxfyn7m5ds3b75oyohqkexze

Supervised Advantage Actor-Critic for Recommender Systems [article]

Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M. Jose
2021 arXiv   pre-print
However, the direct use of RL algorithms in the RS setting is impractical due to challenges like off-policy training, huge action spaces and lack of sufficient reward signals.  ...  We call this method Supervised Negative Q-learning (SNQN).  ...  [1] proposed to calculate a propensity score to perform off-policy correction for off-policy learning.  ... 
arXiv:2111.03474v1 fatcat:yyus7232sjecjhxzvyyj42pkou

A Sample-Efficient Actor-Critic Algorithm for Recommendation Diversification

Shuang Li, Yanghui Yan, Ju Ren, Yuezhi Zhou, Yaoxue Zhang
2020 Chinese journal of electronics  
To further stabilize and improve the performance, we also add policy-filtered critic supervision loss.  ...  The actor acts as the ranking policy, while the introduced critic predicts the expected future rewards of each candidate action.  ...  For A3C-GAE and AC-QSA, we use the same LSTM network to model the ranking policy and critic network just like our methods. For the supervised methods, i.e.  ... 
doi:10.1049/cje.2019.10.004 fatcat:uehoqrz4lnbpboezjua6i7o47u

Work-life Balance by Area, Actual Situation and Expectations – the Overlapping Opinions of Employers and Employees in Slovenia

Tatjana Kozjek, Nina Tomaževič, Janez Stare
2014 Organizacija  
Results: The results of our research show that Slovenian organisations must pay more attention to flexible working time, the employees' ability to take time off to care for family members, time and stress  ...  test and assigned rank.  ...  Ranks were assigned in such a way that the area with the highest mean and median assessment of WLB was given a rank of 1.  ... 
doi:10.2478/orga-2014-0004 fatcat:e24uiodrtzaxxdahphofwxkynm

DRL4IR: 3rd Workshop on Deep Reinforcement Learning for Information Retrieval

Xiangyu Zhao, Xin Xin, Weinan Zhang, Li Zhao, Dawei Yin, Grace Hui Yang
2022 Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval  
and re-ranking, etc.  ...  Based on the ranked information list, users can provide their feedbacks.  ...  As a result, the training of DRL-based IR policies or value functions can only be off-policy or offline [24] .  ... 
doi:10.1145/3477495.3531703 fatcat:5gmafvsikrb7njqhke4x2kmfou

Large-scale Validation of Counterfactual Learning Methods: A Test-Bed [article]

Damien Lefortier, Adith Swaminathan, Xiaotao Gu, Thorsten Joachims, Maarten de Rijke
2017 arXiv   pre-print
Our results show experimental evidence that recent off-policy learning methods can improve upon state-of-the-art supervised learning techniques on a large-scale real-world data set.  ...  Recent approaches for off-policy evaluation and learning in these settings appear promising.  ...  The policy π behaves like a uniformly random ranking policy with probability , and with probability 1 − , behaves like the logging policy.  ... 
arXiv:1612.00367v2 fatcat:nsulmff7yvdhvgpikutypbmrgu

Model Selection for Offline Reinforcement Learning: Practical Considerations for Healthcare Settings [article]

Shengpu Tang, Jenna Wiens
2021 arXiv   pre-print
In this work, we investigate a model selection pipeline for offline RL that relies on off-policy evaluation (OPE) as a proxy for validation performance.  ...  To balance this trade-off between accuracy of ranking and computational efficiency, we propose a simple two-stage approach to accelerate model selection by avoiding potentially unnecessary computation.  ...  The views and conclusions in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the National Science  ... 
arXiv:2107.11003v1 fatcat:lgaccst6vrc3reshwqa7sqao2m

Off-policy evaluation for slate recommendation [article]

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, Imed Zitouni
2017 arXiv   pre-print
This paper studies the evaluation of policies that recommend an ordered set of items (e.g., a ranking) based on some context---a common scenario in web search, ads, and recommendation.  ...  A thorough empirical evaluation on real-world data reveals that our estimator is accurate in a variety of settings, including as a subroutine in a learning-to-rank task, where it achieves competitive performance  ...  We then use our estimator for off-policy optimization, i.e., to learn ranking policies, competitively with supervised learning that uses more information.  ... 
arXiv:1605.04812v3 fatcat:ldjoyer6c5dd5fydduexm5c27i

Modelling MTPL insurance claim events: Can machine learning methods overperform the traditional GLM approach?

Dávid Burka, László Kovács, László Szepesváry
2021 Hungarian Statistical Review  
We define cut-off values in such a way that   Y True   P   1 cut-off 1 claim claim      P is set for the policy.  ...  in a given year, then we can easily fit a supervised learning model to this target variable based on some features of the policies.  ... 
doi:10.35618/hsr2021.02.en034 fatcat:oylxwv6xxrbbfdwnhukrlp2gn4

A Deep Reinforcement Learning Chatbot [article]

Iulian V. Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rosemary Ke, Sai Rajeshwar, Alexandre de Brebisson (+5 others)
2017 arXiv   pre-print
We evaluate the policies Supervised AMT, Off-policy REINFORCE and Q-learning AMT.  ...  We tested six dialogue manager policies: Evibot + Alicebot, Supervised AMT, Supervised Learned Reward, Off-policy REINFORCE, Off-policy REINFORCE Learned Reward and Q-learning AMT.  ... 
arXiv:1709.02349v2 fatcat:ocymhb6py5cyjpubii2kiwc7u4

CAB: Continuous Adaptive Blending Estimator for Policy Evaluation and Learning [article]

Yi Su and Lequn Wang and Michele Santacatterina and Thorsten Joachims
2019 arXiv   pre-print
Both offline A/B-testing and off-policy learning require a counterfactual estimator that evaluates how some new policy would have performed, if it had been used instead of the logging policy.  ...  The ability to perform offline A/B-testing and off-policy learning using logged contextual bandit feedback is highly desirable in a broad range of applications, including recommender systems, search engines  ...  In particular, unlike in supervised learning, the counterfactual estimator can have vastly different bias and variance for different policies in Π, such that trading off bias and variance of the estimator  ... 
arXiv:1811.02672v4 fatcat:3yczwbrmfzgoxlje43nxaqvuny
« Previous Showing results 1 — 15 out of 100,581 results