Filters








45 Hits in 1.1 sec

ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension [article]

Dheeru Dua, Ananth Gottumukkala, Alon Talmor, Sameer Singh, Matt Gardner
2019 arXiv   pre-print
Reading comprehension is one of the crucial tasks for furthering research in natural language understanding. A lot of diverse reading comprehension datasets have recently been introduced to study various phenomena in natural language, ranging from simple paraphrase matching and entity typing to entity tracking and understanding the implications of the context. Given the availability of many such datasets, comprehensive and reliable evaluation is tedious and time-consuming for researchers
more » ... on this problem. We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets, encouraging and facilitating testing a single model's capability in understanding a wide variety of reading phenomena. The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning for general reading facility. As more suitable datasets are released, they will be added to the evaluation server. We also collect and include synthetic augmentations for these datasets, testing how well models can handle out-of-domain questions.
arXiv:1912.12598v1 fatcat:qmdmyoj73zhfldcllcybxqr3da

Evaluating Models' Local Decision Boundaries via Contrast Sets [article]

Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hanna Hajishirzi (+14 others)
2020 arXiv   pre-print
Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the
more » ... t authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets---up to 25\% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.
arXiv:2004.02709v2 fatcat:zwreyqnxiveyvpktpwazmczfv4

Dynamic Sampling Strategies for Multi-Task Reading Comprehension

Ananth Gottumukkala, Dheeru Dua, Sameer Singh, Matt Gardner
2020 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics   unpublished
Building general reading comprehension systems, capable of solving multiple datasets at the same time, is a recent aspirational goal in the research community. Prior work has focused on model architectures or generalization to held out datasets, and largely passed over the particulars of the multi-task learning set up. We show that a simple dynamic sampling strategy, selecting instances for training proportional to the multi-task model's current performance on a dataset relative to its
more » ... k performance, gives substantive gains over prior multi-task sampling strategies, mitigating the catastrophic forgetting that is common in multi-task learning. We also demonstrate that allowing instances of different tasks to be interleaved as much as possible between each epoch and batch has a clear benefit in multitask performance over forcing task homogeneity at the epoch or batch level. Our final model shows greatly increased performance over the best model on ORB, a recently-released multitask reading comprehension benchmark.
doi:10.18653/v1/2020.acl-main.86 fatcat:mcjii6it6fda3gmnqlpiggpz4i

Evaluating Models' Local Decision Boundaries via Contrast Sets

Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi (+14 others)
2020 Findings of the Association for Computational Linguistics: EMNLP 2020   unpublished
Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture the abilities a dataset is intended to test. We propose a more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we
more » ... mend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, and IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets-up to 25% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.
doi:10.18653/v1/2020.findings-emnlp.117 fatcat:lnvj4ujjozh5pocryw7b233sne

Towards Interpretable Reasoning over Paragraph Effects in Situation [article]

Mucheng Ren, Xiubo Geng, Tao Qin, Heyan Huang, Daxin Jiang
2020 arXiv   pre-print
Dheeru Dua, Ananth Gottumukkala, Alon Talmor, Sameer Singh, and Matt Gardner. 2019a.  ...  Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, et al. 2020.  ... 
arXiv:2010.01272v1 fatcat:ojs4ocwvtje7xfhd2wu4iezhf4

Towards Debiasing NLU Models from Unknown Biases [article]

Prasetya Ajie Utama, Nafise Sadat Moosavi, Iryna Gurevych
2020 arXiv   pre-print
Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, et al. 2020.  ... 
arXiv:2009.12303v4 fatcat:to6ealbpvjftdc2ksrfzp6xqau

Easy, Reproducible and Quality-Controlled Data Collection with Crowdaq [article]

Qiang Ning, Hao Wu, Pradeep Dasigi, Dheeru Dua, Matt Gardner, Robert L. Logan IV, Ana Marasovic, Zhen Nie
2020 arXiv   pre-print
Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, Daniel  ... 
arXiv:2010.06694v1 fatcat:5jnkjtuz4vehzea7dmntepwnja

Synthesizing Adversarial Negative Responses for Robust Response Ranking and Evaluation [article]

Prakhar Gupta, Yulia Tsvetkov, Jeffrey P. Bigham
2021 arXiv   pre-print
Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco,  ... 
arXiv:2106.05894v1 fatcat:hpesb4bivbfbzoyjrfu3u3urhe

Can NLI Models Verify QA Systems' Predictions? [article]

Jifan Chen, Eunsol Choi, Greg Durrett
2021 arXiv   pre-print
Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco,  ... 
arXiv:2104.08731v2 fatcat:dw5m5vg7ebh2poidozbgtwrabi

Improving Faithfulness in Abstractive Summarization with Contrast Candidate Generation and Selection

Sihao Chen, Fan Zhang, Kazoo Sone, Dan Roth
2021 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies   unpublished
Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel  ... 
doi:10.18653/v1/2021.naacl-main.475 fatcat:u6ur4mn5yjbwrjgyyfcoenvpa4

A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions [article]

Takuma Udagawa, Takato Yamazaki, Akiko Aizawa
2020 arXiv   pre-print
Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, et al. 2020.  ... 
arXiv:2010.03127v1 fatcat:s3bsa7qb6fdalopkntel2szdpa

SpartQA: : A Textual Question Answering Benchmark for Spatial Reasoning [article]

Roshanak Mirzaee, Hossein Rajaby Faghihi, Qiang Ning, Parisa Kordjmashidi
2021 arXiv   pre-print
Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco,  ... 
arXiv:2104.05832v1 fatcat:et2jdbr5tjh45hgkekyk74tify

Model Agnostic Answer Reranking System for Adversarial Question Answering

Sagnik Majumder, Chinmoy Samant, Greg Durrett
2021 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop   unpublished
Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, et al. 2020.  ... 
doi:10.18653/v1/2021.eacl-srw.8 fatcat:2lyhyprp3be57mlee764dzlrce

Benchmarking Machine Reading Comprehension: A Psychological Perspective [article]

Saku Sugawara, Pontus Stenetorp, Akiko Aizawa
2021 arXiv   pre-print
Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco,  ... 
arXiv:2004.01912v2 fatcat:lyypngwm4vbk7igfcjfmhkn5ja

NOPE: A Corpus of Naturally-Occurring Presuppositions in English [article]

Alicia Parrish, Sebastian Schuster, Alex Warstadt, Omar Agha, Soo-Hwan Lee, Zhuoye Zhao, Samuel R. Bowman, Tal Linzen
2021 arXiv   pre-print
Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, et al. 2020.  ... 
arXiv:2109.06987v1 fatcat:36x7qdrd6fdqnjxdgh4hjrhioy
« Previous Showing results 1 — 15 out of 45 results