Multi-document summarization of scientific corpora

Ozge Yeloglu, Evangelos Milios, Nur Zincir-Heywood
2011 Proceedings of the 2011 ACM Symposium on Applied Computing - SAC '11  
In this paper, we investigated four approaches for scientific corpora summarization when only gold-standard keyterms available. MEAD with built-in default vocabulary, MEAD with corpus specific vocabulary extracted by Keyphrase Extraction Algorithm (KEA), LexRank (a state-of-the-art summarization algorithm based on random walk) and W3SS (summarization algorithm based on keyword density) are tested on two Computer Science research paper collections. We use a content evaluation method, pyramid
more » ... od, instead of the well-known ROUGE metrics since there are no gold-standard summaries available for our data. Evaluations with pyramid method indicates that including a corpus specific vocabulary to the traditional summarization methods improves the performance but not significantly. On the other hand, visual inspection shows us that current content evaluation methods, which use only the gold-standard keyterm information, are not intuitive and focus must turn into better evaluation techniques especially for the multidocument summarization problem. Even though the pyramid method looks for important keyterms in the resulting summaries, it cannot distinguish between a general introductory sentence about the area and a specific sentence on the core idea, if they both contain the same keyterm. Also, our results show that the state of the art summarization method LexRank is not feasible for scientific corpus summarization because of its high computational cost.
doi:10.1145/1982185.1982243 dblp:conf/sac/YelogluMZ11 fatcat:zklj7gjueffdtgrgughbx34e54