Filters








4,297 Hits in 5.1 sec

Batch Policy Gradient Methods for Improving Neural Conversation Models [article]

Kirthevasan Kandasamy, Yoram Bachrach, Ryota Tomioka, Daniel Tarlow, David Carter
2017 arXiv   pre-print
We demonstrate empirically that such strategies are not appropriate for this setting and develop an off-policy batch policy gradient method (BPG).  ...  Previous reinforcement learning work for natural language processing uses on-policy updates and/or is designed for on-line learning settings.  ...  ACKNOWLEDGEMENTS We would like to thank Christoph Dann for the helpful conversations and Michael Armstrong for helping us with the Amazon Mechanical Turk experiments.  ... 
arXiv:1702.03334v1 fatcat:ay4lppskufbf3mg4p7xo7s4q6q

Learning to Expand: Reinforced Pseudo-relevance Feedback Selection for Information-seeking Conversations [article]

Haojie Pan, Cen Chen, Minghui Qiu, Liu Yang, Feng Ji, Jun Huang, Haiqing Chen
2020 arXiv   pre-print
Intelligent personal assistant systems for information-seeking conversations are increasingly popular in real-world applications, especially for e-commerce companies.  ...  We have also deployed our method on online production in an e-commerce company, which shows a significant improvement over the existing online ranking system.  ...  The policy network can be updated by the gradient as follows: Θ ← Θ + 1 ∑︁ =1 r ∇ Θ Θ (S ). (11) Here, is the learning rate, is the batch size.  ... 
arXiv:2011.12771v1 fatcat:frpxvpshr5bb3jksiqrkv7lzku

Model-Ensemble Trust-Region Policy Optimization [article]

Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, Pieter Abbeel
2018 arXiv   pre-print
In this paper, we analyze the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and show that the learned policy  ...  Altogether, our approach Model-Ensemble Trust-Region Policy Optimization (ME-TRPO) significantly reduces the sample complexity compared to model-free deep RL methods on challenging continuous control benchmark  ...  ACKNOWLEDGEMENT The authors thank Stuart Russell, Abishek Gupta, Carlos Florensa, Anusha Nagabandi, Haoran Tang, and Gregory Kahn for helpful discussions and feedbacks. T.  ... 
arXiv:1802.10592v2 fatcat:2p2vevibdraf3ehcpkqjgnunni

Learning from Easy to Complex: Adaptive Multi-curricula Learning for Neural Dialogue Generation [article]

Hengyi Cai, Hongshen Chen, Cheng Zhang, Yonghao Song, Xiaofang Zhao, Yangxi Li, Dongsheng Duan, Dawei Yin
2020 arXiv   pre-print
The noise and uneven complexity of query-response pairs impede the learning efficiency and effects of the neural dialogue generation models.  ...  generation model.  ...  Acknowledgments This work is supported by the National Natural Science Foundation of China-Joint Fund for Basic Research of General Technology under Grant U1836111 and U1736106.  ... 
arXiv:2003.00639v2 fatcat:dmsquwgmefe6ngrijzlbdl6q5u

Human-centric Dialog Training via Offline Reinforcement Learning [article]

Natasha Jaques, Judy Hanwen Shen, Asma Ghandeharioun, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Shane Gu, Rosalind Picard
2020 arXiv   pre-print
The novel offline RL method is viable for improving any existing generative dialog model using a static dataset of human feedback.  ...  We start by hosting models online, and gather human feedback from real-time, open-ended conversations, which we then use to train and improve the models using offline reinforcement learning (RL).  ...  part of the model can improve conversation quality (See et al., 2019; Mehri and Eskenazi, 2020).  ... 
arXiv:2010.05848v1 fatcat:fxelzo2gubahrfjvk34jdwthfi

Learning from Easy to Complex: Adaptive Multi-Curricula Learning for Neural Dialogue Generation

Hengyi Cai, Hongshen Chen, Cheng Zhang, Yonghao Song, Xiaofang Zhao, Yangxi Li, Dongsheng Duan, Dawei Yin
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
The noise and uneven complexity of query-response pairs impede the learning efficiency and effects of the neural dialogue generation models.  ...  generation model.  ...  Acknowledgments This work is supported by the National Natural Science Foundation of China-Joint Fund for Basic Research of General Technology under Grant U1836111 and U1736106.  ... 
doi:10.1609/aaai.v34i05.6244 fatcat:f5wq42cuereglej4yv2h2oqixm

Sample-efficient Deep Reinforcement Learning for Dialog Control [article]

Kavosh Asadi, Jason D. Williams
2016 arXiv   pre-print
For RL, a policy gradient approach is natural, but is sample inefficient.  ...  On two tasks, these methods reduce the number of dialogs/episodes required by about a third, vs. standard policy gradient methods.  ...  Method 3: Experience replay for policy network Our third method improves over the second method by applying experience replay to the policy network.  ... 
arXiv:1612.06000v1 fatcat:hvypp6lirzfghdbrws23nrv53e

Online Hyper-parameter Learning for Auto-Augmentation Strategy [article]

Chen Lin, Minghao Guo, Chuming Li, Yuan Xin, Wei Wu, Dahua Lin, Wanli Ouyang, Junjie Yan
2019 arXiv   pre-print
Our proposed OHL-Auto-Aug eliminates the need of re-training and dramatically reduces the cost of the overall search process, while establishes significantly accuracy improvements over baseline models.  ...  In this paper, we propose Online Hyper-parameter Learning for Auto-Augmentation (OHL-Auto-Aug), an economical solution that learns the augmentation policy distribution along with network training.  ...  For fair comparison, we compute the total training iterations with conversion under a same batch size 1024 and denote as '#Iterations'.  ... 
arXiv:1905.07373v2 fatcat:tknjbqyk6bhtfikdmecx7uxjqa

Online Hyper-Parameter Learning for Auto-Augmentation Strategy

Chen Lin, Minghao Guo, Chuming Li, Xin Yuan, Wei Wu, Junjie Yan, Dahua Lin, Wanli Ouyang
2019 2019 IEEE/CVF International Conference on Computer Vision (ICCV)  
Our proposed OHL-Auto-Aug eliminates the need of re-training and dramatically reduces the cost of the overall search process, while establishes significantly accuracy improvements over baseline models.  ...  In this paper, we propose Online Hyper-parameter Learning for Auto-Augmentation (OHL-Auto-Aug), an economical solution that learns the augmentation policy distribution along with network training.  ...  For fair comparison, we compute the total training iterations with conversion under a same batch size 1024 and denote as '#Iterations'.  ... 
doi:10.1109/iccv.2019.00668 dblp:conf/iccv/LinGLYWYLO19 fatcat:phqk7plgf5h45cmini2et47mqi

Domain Transfer in Dialogue Systems without Turn-Level Supervision [article]

Joachim Bingel, Victor Petrén Bach Hansen, Ana Valeria Gonzalez, Paweł Budzianowski, Isabelle Augenstein, Anders Søgaard
2019 arXiv   pre-print
We also show our method can improve models trained using turn-level supervision by subsequent fine-tuning optimization toward dialog-level rewards.  ...  To address these limitations, we propose a method, based on reinforcement learning, for transferring DST models to new domains without turn-level supervision.  ...  When applying policy gradient methods in practice, larger batch sizes have shown to lead to more accurate policy updates (Papini et al., 2017) , but due to the relatively small training sets we found  ... 
arXiv:1909.07101v1 fatcat:logglgkwfrd7hazddpbrs26lhm

Say What I Want: Towards the Dark Side of Neural Dialogue Models [article]

Haochen Liu, Tyler Derr, Zitao Liu, Jiliang Tang
2019 arXiv   pre-print
Neural dialogue models have been widely adopted in various chatbot applications because of their good performance in simulating and generalizing human conversations.  ...  However, there exists a dark side of these models -- due to the vulnerability of neural networks, a neural dialogue model can be manipulated by users to say what they want, which brings in concerns about  ...  Therefore, previous works have proposed many methods to estimate it and its gradient, which is then used to update the parameters θ of the policy (i.e., π θ ).  ... 
arXiv:1909.06044v3 fatcat:lbnzfin3knazrg42wfcsrm73jy

The UMD Neural Machine Translation Systems at WMT17 Bandit Learning Task [article]

Amr Sharaf, Shi Feng, Khanh Nguyen, Kianté Brantley, Hal Daumé III
2017 arXiv   pre-print
Targeting these two challenges (adaptation and bandit learning), we built a standard neural machine translation system and extended it in two ways: (1) robust reinforcement learning techniques to learn  ...  Acknowledgements The authors thank the anonymous reviewers for many helpful comments.  ...  We would like to thank the task organizers: Pavel Danchenko, Hagen Fuerstenau, Julia Kreutzer, Stefan Riezler, Artem Sokolov, Kellen Sunderland, and Witold Szymaniak for organizing the task and for their  ... 
arXiv:1708.01318v2 fatcat:5a2nf5liyzdq7fpukscwxsag2i

Learning What Data to Learn [article]

Yang Fan and Fei Tian and Tao Qin and Jiang Bian and Tie-Yan Liu
2017 arXiv   pre-print
Taking neural network training with stochastic gradient descent (SGD) as an example, comprehensive experiments with respect to various neural network modeling (e.g., multi-layer perceptron networks, convolutional  ...  In contrast to previous studies in data selection that is mainly based on heuristic strategies, NDF is quite generic and thus can be widely suitable for many machine learning tasks.  ...  ., 2016) , in which we randomly sample a subset of training data to train the policy of NDF (Step 1 and 2) with policy gradient method, and apply the data filtration model to the training process on the  ... 
arXiv:1702.08635v1 fatcat:hbjjcmqza5fwpknnqwe2a6vwuy

Learning to Selectively Transfer

Chen Qu, Feng Ji, Minghui Qiu, Liu Yang, Zhiyu Min, Haiqing Chen, Jun Huang, W. Bruce Croft
2019 Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining - WSDM '19  
However, the emerging deep transfer models do not fit well with most existing data selection methods, because the data selection policy and the transfer learning model are not jointly trained, leading  ...  We further investigate different settings of states, rewards, and policy optimization methods to examine the robustness of our method.  ...  ACKNOWLEDGMENTS This work was supported in part by the Center for Intelligent Information Retrieval.  ... 
doi:10.1145/3289600.3290978 dblp:conf/wsdm/QuJQYMCHC19 fatcat:sgu2qoa4v5fvpewakuhl3ix36i

Iterative Policy Learning in End-to-End Trainable Task-Oriented Neural Dialog Models [article]

Bing Liu, Ian Lane
2017 arXiv   pre-print
Our experiment results show that the proposed method leads to promising improvements on task success rate and total task reward comparing to supervised training and single-agent RL training baseline models  ...  Both the dialog agent and the user simulator are designed with neural network models that can be trained end-to-end.  ...  Policy Gradient RL For policy optimization with RL, policy gradient method is preferred over Q-learning in our system as the policy network parameters can be initialized with the ActDist parameters learnied  ... 
arXiv:1709.06136v1 fatcat:jv3qrel6yjebda3ubrvsfpbhke
« Previous Showing results 1 — 15 out of 4,297 results