On Evaluation of Natural Language Processing Tasks - Is Gold Standard Evaluation Methodology a Good Solution?

Vojtěch Kovář, Miloš Jakubíček, Aleš Horák
2016 Proceedings of the 8th International Conference on Agents and Artificial Intelligence  
The paper discusses problems in state of the art evaluation methods used in natural language processing (NLP). Usually, some form of gold standard data is used for evaluation of various NLP tasks, ranging from morphological annotation to semantic analysis. We discuss problems and validity of this type of evaluation, for various tasks, and illustrate the problems on examples. Then we propose using application-driven evaluations, wherever it is possible. Although it is more expensive, more
more » ... ated and not so precise, it is the only way to find out if a particular tool is useful at all. STATE OF THE ART: GOLD STANDARDS "Gold standard" for an NLP task is a data set of natural language texts annotated by humans for correct solutions of that particular task. Examples include: • treebanks, for syntactic analysis -natural language corpora where every sentence is annotated by its correct syntactic tree (Marcus et al., 1993; Hajič, 2006) • parallel corpora, for machine translation where each sentence or segment in the source language 540
doi:10.5220/0005824805400545 dblp:conf/icaart/KovarJH16 fatcat:g4bjvhxfybdqlfjv2opefth34y