Impact of Review-Set Selection on Human Assessment for Text Classification

Adam Roegiest, Gordon V. Cormack
2016 Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval - SIGIR '16  
In a laboratory study, human assessors were significantly more likely to judge the same documents as relevant when they were presented for assessment within the context of documents selected using random or uncertainty sampling, as compared to relevance sampling. The effect is substantial and significant [0.54 vs. 0.42, p<0.0002] across a population of documents including both relevant and non-relevant documents, for several definitions of ground truth. This result is in accord with Smucker and
more » ... rd with Smucker and Jethani's SIGIR 2010 finding that documents were more likely to be judged relevant when assessed within low-precision versus high-precision ranked lists. Our study supports the notion that relevance is malleable, and that one should take care in assuming any labeling to be ground truth, whether for training, tuning, or evaluating text classifiers.
doi:10.1145/2911451.2914709 dblp:conf/sigir/RoegiestC16 fatcat:vfnc4axawfgq5pnlry6al7qmxa