Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering

Wenhan Xiong, Hong Wang, William Yang Wang
2021 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume   unpublished
Commonly used information retrieval methods such as TF-IDF in open-domain question answering (QA) systems are insufficient to capture deep semantic matching that goes beyond lexical overlaps. Some recent studies consider the retrieval process as maximum inner product search (MIPS) using dense question and paragraph representations, achieving promising results on several informationseeking QA datasets. However, the pretraining of the dense vector representations is highly resource-demanding,
more » ... , requires a very large batch size and lots of training steps. In this work, we propose a sample-efficient method to pretrain the paragraph encoder. First, instead of using heuristically created pseudo questionparagraph pairs for pretraining, we use an existing pretrained sequence-to-sequence model to build a strong question generator that creates high-quality pretraining data. Second, we propose a simple progressive pretraining algorithm to ensure the existence of effective negative samples in each batch. Across three opendomain QA datasets, our method consistently outperforms a strong dense retrieval baseline that uses 6 times more computation for training. On two of the datasets, our method achieves more than 4-point absolute improvement in terms of answer exact match.
doi:10.18653/v1/2021.eacl-main.244 fatcat:c7r6jcp4xrgmfbwm47xmbjwuoi