Filters








94,328 Hits in 4.1 sec

IMO^3: Interactive Multi-Objective Off-Policy Optimization [article]

Nan Wang, Hongning Wang, Maryam Karimzadehgan, Branislav Kveton, Craig Boutilier
2022 arXiv   pre-print
We theoretically show that IMO^3 identifies a near-optimal policy with high probability, depending on the amount of feedback from the designer and training data for off-policy estimation.  ...  Most real-world optimization problems have multiple objectives. A system designer needs to find a policy that trades off these objectives to reach a desired operating point.  ...  Multi-Objective Off-Policy Evaluation and Optimization In this section, we discuss how to evaluate a policy π using logged data generated by another (say, production) policy, and optimize π w.r.t. any  ... 
arXiv:2201.09798v2 fatcat:rzstagf2cbhghmtuznopsvz434

Counterfactual Learning from Bandit Feedback under Deterministic Logging: A Case Study in Statistical Machine Translation [article]

Carolin Lawrence, Artem Sokolov, Stefan Riezler
2017 arXiv   pre-print
The goal of counterfactual learning for statistical machine translation (SMT) is to optimize a target SMT system from logged data that consist of user feedback to translations that were predicted by another  ...  We show that counterfactual learning from deterministic bandit logs is possible nevertheless by smoothing out deterministic components in learning.  ...  The crucial trick to obtain unbiased estimators to evaluate and to optimize the off-policy system is to correct the sampling bias of the logging policy.  ... 
arXiv:1707.09118v3 fatcat:zfr4mwzo6je35bdx7tu5hdk364

Counterfactual Learning from Bandit Feedback under Deterministic Logging : A Case Study in Statistical Machine Translation

Carolin Lawrence, Artem Sokolov, Stefan Riezler
2017 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing  
The goal of counterfactual learning for statistical machine translation (SMT) is to optimize a target SMT system from logged data that consist of user feedback to translations that were predicted by another  ...  We show that counterfactual learning from deterministic bandit logs is possible nevertheless by smoothing out deterministic components in learning.  ...  The crucial trick to obtain unbiased estimators to evaluate and to optimize the off-policy system is to correct the sampling bias of the logging policy.  ... 
doi:10.18653/v1/d17-1272 dblp:conf/emnlp/LawrenceSR17 fatcat:5v3ostiaujcydfif5zehaghsyq

On Multi-objective Policy Optimization as a Tool for Reinforcement Learning [article]

Abbas Abdolmaleki, Sandy H. Huang, Giulia Vezzani, Bobak Shahriari, Jost Tobias Springenberg, Shruti Mishra, Dhruva TB, Arunkumar Byravan, Konstantinos Bousmalis, Andras Gyorgy, Csaba Szepesvari, Raia Hadsell (+2 others)
2021 arXiv   pre-print
optimization step.  ...  that have improved the robustness and efficiency of deep reinforcement learning (RL) algorithms can, in one way or another, be understood as introducing additional objectives, or constraints, in the policy  ...  When we evaluate a trade-off-conditioned policy, we condition it on trade-offs linearly spaced from 0.05 to 1.0.  ... 
arXiv:2106.08199v1 fatcat:uqpsvp7u4ranvk2psymtf3ugse

Mirror Descent Policy Optimization [article]

Manan Tomar, Lior Shani, Yonathan Efroni, Mohammad Ghavamzadeh
2021 arXiv   pre-print
Inspired by this, we propose an efficient RL algorithm, called mirror descent policy optimization (MDPO).  ...  We derive on-policy and off-policy variants of MDPO, while emphasizing important design choices motivated by the existing theory of MD in RL.  ...  We derive on-policy and off-policy variants of MDPO and perform a thorough empirical evaluation against multiple well established algorithms.  ... 
arXiv:2005.09814v5 fatcat:22paaybxjrbg3lardmh2dflg6i

Large-scale Validation of Counterfactual Learning Methods: A Test-Bed [article]

Damien Lefortier, Adith Swaminathan, Xiaotao Gu, Thorsten Joachims, Maarten de Rijke
2017 arXiv   pre-print
Recent approaches for off-policy evaluation and learning in these settings appear promising.  ...  This paper presents our test-bed, the sanity checks we ran to ensure its validity, and shows results comparing state-of-the-art off-policy learning methods like doubly robust optimization, POEM, and reductions  ...  Recent approaches for off-policy evaluation and learning in these settings appear promising [1, 2, 4] , but highlight the need for accurately logging propensities of the logged actions.  ... 
arXiv:1612.00367v2 fatcat:nsulmff7yvdhvgpikutypbmrgu

OPIRL: Sample Efficient Off-Policy Inverse Reinforcement Learning via Distribution Matching [article]

Hana Hoshino, Kei Ota, Asako Kanezaki, Rio Yokota
2022 arXiv   pre-print
However, prior IRL algorithms use on-policy transitions, which require intensive sampling from the current policy for stable and optimal performance.  ...  To tackle this problem, we present Off-Policy Inverse Reinforcement Learning (OPIRL), which (1) adopts off-policy data distribution instead of on-policy and enables significant reduction of the number  ...  Off-Policy Learning from Observation Our approach takes its inspiration from OPOLO [17] which, similar to prior off-policy RL algorithms, removes all on-policy transitions and incorporates replay buffers  ... 
arXiv:2109.04307v2 fatcat:bebegsycufgwxktxrxa5prh7ty

Automatic Enforcement of Data Use Policies with DataLawyer

Prasang Upadhyaya, Magdalena Balazinska, Dan Suciu
2015 Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15  
You may show content from multiple providers, but Yelp data should stand on its own (Yelp [18]) Disallow aggregations but allow joins and unions  ...  We introduce novel algorithms to efficiently evaluate policies that can cut policy-checking overheads to only a few percent of the total query runtime.  ...  For comparison, we provide the runtime with this optimization turned off.  ... 
doi:10.1145/2723372.2723721 dblp:conf/sigmod/UpadhyayaBS15 fatcat:eqbeaqdk2fbfzhmrhyaff5uvxu

Off-policy Learning over Heterogeneous Information for Recommendation

Xiangmeng Wang, Qian Li, Dianer Yu, Guandong Xu
2022 Proceedings of the ACM Web Conference 2022  
As a result, the policy learned from such off-line logged data tends to be biased from the true behaviour policy.  ...  Much off-policy learning, referring to the procedure of policy optimization with access only to logged feedback data, has been a popular research topic in reinforcement learning.  ...  This counterfactual question is not easy to address, since the target policy is different from the historical logging policy in the off-policy setting [51, 52] .  ... 
doi:10.1145/3485447.3512072 fatcat:gxnqvtttjbexvgtufcfk4knvpy

Personalization for Web-based Services using Offline Reinforcement Learning [article]

Pavlos Athanasios Apostolopoulos, Zehui Wang, Hanson Wang, Chad Zhou, Kittipat Virochsiri, Norm Zhou, Igor L. Markov
2021 arXiv   pre-print
We address challenges of learning such policies through model-free offline Reinforcement Learning (RL) with off-policy training.  ...  We articulate practical challenges, compare several ML techniques, provide insights on training and evaluation of RL models, and discuss generalizations.  ...  Hence, policies are trained offline using logged interactions from any type of prior policies (Offline RL in Section 3).  ... 
arXiv:2102.05612v1 fatcat:sj6ba75lrrecpc7h7xn6e46e34

Variance-Optimal Augmentation Logging for Counterfactual Evaluation in Contextual Bandits [article]

Aaron David Tucker, Thorsten Joachims
2022 arXiv   pre-print
policy is very different from the target policy being evaluated.  ...  To this effect, this paper introduces Minimum Variance Augmentation Logging (MVAL), a method for constructing logging policies that minimize the variance of the downstream evaluation or learning problem  ...  Off-policy evaluation.  ... 
arXiv:2202.01721v1 fatcat:63e5kmj5njg2lkaxuwnujjyd4e

Trust-PCL: An Off-Policy Trust Region Method for Continuous Control [article]

Ofir Nachum, Mohammad Norouzi, Kelvin Xu, Dale Schuurmans
2018 arXiv   pre-print
Thus, Trust-PCL is able to maintain optimization stability while exploiting off-policy data to improve sample efficiency.  ...  To address this problem, we propose an off-policy trust region method, Trust-PCL.  ...  Trust-PCL is off-policy, so to evaluate its performance we alternate between collecting experience and training on batches of experience sampled from the replay buffer.  ... 
arXiv:1707.01891v3 fatcat:juu5x7ygdfbn7mvv7xagr3lzne

Policy Evaluation and Optimization with Continuous Treatments [article]

Nathan Kallus, Angela Zhou
2018 arXiv   pre-print
We study the problem of policy evaluation and learning from batched contextual bandit data when treatments are continuous, going beyond previous work on discrete treatments.  ...  Our policy estimator is consistent and we characterize the optimal bandwidth.  ...  Figure 4 : 4 Out-of-sample error of empirically optimal policy from off-policy evaluation as n increases.  ... 
arXiv:1802.06037v1 fatcat:figmrszcefcgbo3zhsq44mpely

Trajectory-Based Off-Policy Deep Reinforcement Learning [article]

Andreas Doerr, Michael Volpp, Marc Toussaint, Sebastian Trimpe, Christian Daniel
2019 arXiv   pre-print
This work addresses these weaknesses by combining recent improvements in the reuse of off-policy data and exploration in parameter space with deterministic behavioral policies.  ...  Incorporation of previous rollouts via importance sampling greatly improves data-efficiency, whilst stochastic optimization schemes facilitate the escape from local optima.  ...  Typically, Importance Sampling (IS) techniques are employed to evaluate a target policy based on rollouts obtained from behavioural policies (i.e. from off-policy samples).  ... 
arXiv:1905.05710v1 fatcat:6po2azo7yndsrjmh4ewcdnfmum

Off-policy Learning for Remote Electrical Tilt Optimization [article]

Filippo Vannella, Jaeseong Jeong, Alexandre Proutiere
2020 arXiv   pre-print
We formulate the problem of devising such a policy using the off-policy CMAB framework. We propose CMAB learning algorithms to extract optimal tilt update policies from the data.  ...  We address the problem of Remote Electrical Tilt (RET) optimization using off-policy Contextual Multi-Armed-Bandit (CMAB) techniques.  ...  Specifically, we aim at learning an optimal policy from offline data collected another policy, referred to as the logging policy.  ... 
arXiv:2005.10577v1 fatcat:7na2gn6cjnb7pe4fgaipleqegy
« Previous Showing results 1 — 15 out of 94,328 results