New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles

Kiet Van Nguyen, Tin Van Huynh, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen
2022 ACM Transactions on Asian and Low-Resource Language Information Processing  
Machine reading comprehension is a natural language understanding task where the computing system is required to read a text and then find the answer to a specific question posed by a human. Large-scale and high-quality corpora are necessary for evaluating machine reading comprehension models. Furthermore, machine reading comprehension (MRC) for the health sector has potential for practical applications; nevertheless, MRC research in this domain is currently scarce. This paper presents
more » ... sQA, a new corpus for the Vietnamese language to evaluate MRC models for the healthcare textual domain. The corpus consists of 22,057 human-generated question-answer pairs. Crowd-workers create the questions and answers on a collection of 4,416 online Vietnamese healthcare news articles, where the answers are textual spans extracted from the corresponding articles. We introduce a process for creating a high-quality corpus for the Vietnamese machine reading comprehension task. Linguistically, our corpus accommodates diversity in question and answer types. In addition, we conduct experiments and compare the effectiveness of different MRC methods based on the neural networks and transformer architectures. Experimental results on our corpus show that the MRC system based on ALBERT architecture outperforms the neural network architectures and the BERT-based approach, an exact match score of 65.26% and an F1-score of 84.89%. The best machine model achieves about 10.90% F1-score less efficiently than humans, which proves that exploring machine models on UIT-ViNewsQA to surpass humans is challenging for researchers in the future. Our corpus is publicly available on our website 1 for research purposes.
doi:10.1145/3527631 fatcat:mgnypmkaavce7k6hyp4j7c5ce4