KPTimes: A Large-Scale Dataset for Keyphrase Generation on News Documents

Ygor Gallina, Florian Boudin, Beatrice Daille
2019 Proceedings of the 12th International Conference on Natural Language Generation  
Keyphrase generation is the task of predicting a set of lexical units that conveys the main content of a source text. Existing datasets for keyphrase generation are only readily available for the scholarly domain and include nonexpert annotations. In this paper we present KPTimes, a large-scale dataset of news texts paired with editor-curated keyphrases. Exploring the dataset, we show how editors tag documents, and how their annotations differ from those found in existing datasets. We also
more » ... asets. We also train and evaluate state-of-the-art neural keyphrase generation models on KPTimes to gain insights on how well they perform on the news domain. The dataset is available online at https:// github.com/ygorg/KPTimes.
doi:10.18653/v1/w19-8617 dblp:conf/inlg/GallinaBD19 fatcat:dlasakuin5ge5gx4uqbvjyodsi