MRG_UWaterloo and WaterlooCormack Participation in the TREC 2017 Common Core Track

Maura R. Grossman, Gordon V. Cormack
2017 Text Retrieval Conference  
The MRG_UWaterloo group from the University of Waterloo used a Continuous Active Learning ("CAL") approach [1] to identify and manually review a substantial fraction of the relevant documents for each of the 250 Common Core topics. Our primary goal was to create, with less effort, a set of relevance assessments ("qrels") comparable to the official Common Core Track qrels (cf. [12, 2]). To this end, we adapted for live, human-in-the-loop use, the AutoTAR CAL implementation, 1 which had
more » ... ed superior effectiveness as a baseline for the TREC 2016 Total Recall Tracks [9, 5]. In total, for 250 topics, the authors spent 64.1 hours assessing 42,587 documents (on average, 15.4 mins/topic; 5.4 secs/doc), judging 30,124 of them to be relevant (70.7%). While the principal outcome of the MRG_UWaterloo effort was a set of relevant documents for each topic, it was necessary to submit ranked lists of 10,000 documents for each topic, to be evaluated using the standard rankbased measures calculated by "trec_eval." 2 In theory, according to the probability ranking principle, the optimal strategy to maximize these measures is to construct a ranked list of the 10,000 most-likely relevant documents, with the documents ordered by their likelihood of relevance. In practice, the official qrels used for TREC evaluation are influenced by the submitted runs, confounding the theoretical optimal strategy. Participants were asked to prioritize their runs, and each participant was assured only that the ten highest-ranked documents from their highest-priority submission would be assessed for relevance, and included in the qrels. An unspecified number of additional highlyranked documents were also to be included, depending on the results of assessing the higher-ranked documents, relative to the results of other participants' runs (cf. [6]). Overall, the qrels for each topic represent a non-statistical sample of the document population, biased heavily toward documents that one or more runs deemed to have a high likelihood of relevance. To estimate the precision of our alternate qrels according to the TREC assessors, we applied a random permutation to the documents we assessed as relevant. These documents, in the order determined by the random permutation, were afforded the highest ranks in our highest-priority run ("MRGrandrel"), thus assuring that a random sample (i.e., the first ten) would be assessed by TREC. The remaining documents were scored using the final AutoTAR model, and ranked from highest to lowest score. Our secondary and tertiary runs ("MRGrankel" and "MRGrankall") were ordered slightly differently. The ranked lists in MRGrankrel consisted of all documents that we assessed as relevant, ordered by score; followed by the top-scoring documents that we did not assess or assessed as non-relevant, ordered by score. The ranked lists in MRGrankall consisted of the top-scoring documents, ordered by score, regardless of whether or not we had assessed them as relevant.
dblp:conf/trec/GrossmanC17 fatcat:l5qrmbbapreqdc2nabbuisksoy