8 Hits in 4.3 sec

SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization [article]

Yang Gao, Wei Zhao, Steffen Eger
2020 arXiv   pre-print
We study unsupervised multi-document summarization evaluation metrics, which require neither human-written reference summaries nor human annotations (e.g. preferences, ratings, etc.).  ...  Compared to the state-of-the-art unsupervised evaluation metrics, SUPERT correlates better with human ratings by 18-39%.  ...  However, as their method is designed for evaluating single-document summaries, it correlates poorly with the Pyramid scores for multi-document summaries (see §3). Unsupervised Evaluation.  ... 
arXiv:2005.03724v1 fatcat:eu3l5nvln5f7dlp2pnx6ldg5sy

SueNes: A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling [article]

Forrest Sheng Bao, Hebi Li, Ge Luo, Minghui Qiu, Yinfei Yang, Youbiao He, Cen Chen
2022 arXiv   pre-print
Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries.  ...  In cross-domain tests, our strategy outperforms baselines with promising improvements, and show a great advantage in gauging linguistic qualities over all metrics.  ...  Acknowledgments Bao, Luo, Li, and He's work in this paper is partially supported by National Science Foundation (NSF) grants No. MCB-1821828 and No. CNS-1817089.  ... 
arXiv:2005.06377v3 fatcat:z3nb67e5tjbpzea2kirmwutdvy

SummEval: Re-evaluating Summarization Evaluation [article]

Alexander R. Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, Dragomir Radev
2021 arXiv   pre-print
We hope that this work will help promote a more complete evaluation protocol for text summarization as well as advance research in developing evaluation metrics that better correlate with human judgments  ...  and unified API for evaluating summarization models across a broad range of automatic metrics, 5) we assemble and share the largest and most diverse, in terms of model types, collection of human judgments  ...  Acknowledgements We thank all authors for sharing model outputs and Tony Wong for assistance with annotations.  ... 
arXiv:2007.12626v4 fatcat:imi3aqmlszehxlzffivbpq4mam

FFCI: A Framework for Interpretable Automatic Evaluation of Summarization [article]

Fajri Koto and Timothy Baldwin and Jey Han Lau
2022 arXiv   pre-print
We then apply the developed metrics in evaluating a broad range of summarization models across two datasets, with some surprising findings.  ...  In this paper, we propose FFCI, a framework for fine-grained summarization evaluation that comprises four elements: faithfulness (degree of factual consistency with the source), focus (precision of summary  ...  Acknowledgments In this research, the first author is supported by the Australia Awards Scholarship (AAS), funded by the Department of Foreign Affairs and Trade (DFAT), Australia.  ... 
arXiv:2011.13662v3 fatcat:e3ks7fc6cvdp7jyseteiqygimq

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

Mingkai Deng, Bowen Tan, Zhengzhong Liu, Eric Xing, Zhiting Hu
2021 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing   unpublished
Yamshchikov, Viacheslav Shibaev, Nikolay In Proceedings of the Workshop on New Frontiers Khlebnikov, and Alexey Tikhonov. 2020.  ...  Association for Computational Lin- ric for news article summarization. In Proceed- guistics.  ... 
doi:10.18653/v1/2021.emnlp-main.599 fatcat:g63kgfi7ijbxndie5sxw4kgud4

Semantically Driven Sentence Fusion: Modeling and Evaluation

Eyal Ben-David, Orgad Keller, Eric Malmi, Idan Szpektor, Roi Reichart
2020 Findings of the Association for Computational Linguistics: EMNLP 2020   unpublished
Current training and evaluation schemes for this task are based on single reference ground-truths and do not account for valid fusion variants.  ...  We apply this method to a large-scale dataset and use the augmented dataset for both model training and evaluation.  ...  Acknowledgments We would like to thank the members of the IE@Technion NLP group and Roee Aharoni, for their valuable feedback and advice.  ... 
doi:10.18653/v1/2020.findings-emnlp.135 fatcat:445ahblot5aypc7eymwkl7cv7q

Perturbation CheckLists for Evaluating NLG Evaluation Metrics

Ananya B. Sai, Tanay Dixit, Dev Yashpal Sheth, Sreyas Mohan, Mitesh M. Khapra
2021 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing   unpublished
Evaluation of text generation: A survey. PERT: towards new frontiers in unsupervised evalu- CoRR, abs/2006.14799.  ...  ation metrics for multi-document summarization. In Proceedings of the 58th Annual Meeting of the As- Leshem Choshen and Omri Abend. 2018.  ... 
doi:10.18653/v1/2021.emnlp-main.575 fatcat:nvicwmagqrfx7malsmalooj67i

Global Explainability of BERT-Based Evaluation Metrics by Disentangling along Linguistic Factors

Marvin Kaster, Wei Zhao, Steffen Eger
2021 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing   unpublished
PERT: Towards new frontiers in unsupervised evalu- 2021. Scientific credibility of machine translation ation metrics for multi-document summarization.  ...  This yields side the field of machine translation; for exam- insights into which linguistic information signals ple, SUPERT (Gao et al., 2020) for summarization.  ... 
doi:10.18653/v1/2021.emnlp-main.701 fatcat:swvph7cvtbap7p7vjwiku76xza