TREC 2015 Total Recall Track Overview

Adam Roegiest, Gordon V. Cormack, Charles L. A. Clarke, Maura R. Grossman
2015 Text Retrieval Conference  
The primary purpose of the Total Recall Track is to evaluate, through controlled simulation, methods designed to achieve very high recall -as close as practicable to 100% -with a human assessor in the loop. Motivating applications include, among others, electronic discovery in legal proceedings [2] , systematic review in evidencebased medicine [11] , and the creation of fully labeled test collections for information retrieval ("IR") evaluation [8] . A secondary, but no less important, purpose
more » ... to develop a sandboxed virtual test environment within which IR systems may be tested, while preventing the disclosure of sensitive test data to participants. At the same time, the test environment also operates as a "black box," affording participants confidence that their proprietary systems cannot easily be reverse engineered. The task to be solved in the Total Recall Track is the following: Given a simple topic description -like those used for ad-hoc and Web search -identify the documents in a corpus, one at a time, such that, as nearly as possible, all relevant documents are identified before all non-relevant documents. Immediately after each document is identified, its ground-truth relevance or non-relevance is disclosed. * Current affiliation: University of Waterloo. The views expressed herein are solely those of the author and should not be attributed to her former firm or its clients. 1∼gvcormac/trecvm/. collections, and summary results were supplied to participants for their own runs, as well as for the BMI runs. The system architecture for the Track is detailed in a separate Notebook paper titled Total Recall Track Tools Architecture Overview [16] . The TREC 2015 Total Recall Track attracted 10 participants, including three industrial groups that submitted "manual athome" runs, two academic groups that submitted only "automatic athome" runs, and five academic groups that submitted both "automatic athome" and "sandbox" runs. The 2015 At-Home collections consisted of three datasets and 30 topics. The Jeb Bush emails 2 were collected and assessed for 10 topics by the Track coordinators. The "Illicit Goods" and "Local Politics" datasets, along with 10 topics for each, were derived from the Dynamic Domain datasets 3 and assessed by the Total Recall coordinators. These collections continue to be available through the Total Recall Server to 2015 participants, and were made available to 2016 participants for training purposes. The Sandbox collections consisted of two datasets and 23 topics. On-site access to former Governor Tim Kaine's email collection at the Library of Virginia 4 was arranged by the Track coordinators, where a "Sandbox appliance" was used to conduct and evaluate participant runs according to topics that corresponded to archival category labels previously applied by the Library's Senior State Records Archivist: "Not a Public Record," "Open Public Record," "Restricted Public Record," and "Virginia Tech Shooting Record." The coordinators also secured approval to use the MIMIC II clinical dataset 5 as the second Sandbox dataset. The textual documents from this dataset -consisting of discharge summaries, nurses' notes, and radiology reports -were used as the corpus; the 19 top-level codes in the ICD-9 hierarchy 6 were used as the "topics." The principal tool for comparing runs was a gain curve, which plots recall (i.e., the proportion of all relevant documents submitted to the Web server for review) as a function of effort (i.e., the total number of documents submitted to the Web server for review). A run that achieves higher recall with less effort demonstrates superior effectiveness, particularly at high recall levels. The traditional recall-precision curve conveys similar information, plotting precision (i.e., the proportion of documents submitted to the Web server that are relevant) as a function of recall (i.e., the proportion of all relevant documents submitted to the Web server for review). Both curves convey similar information, but are influenced differently by prevalence or richness (i.e., the proportion of documents in the collection that are relevant), and convey different impressions when averaged over topics with different richness. A gain curve or recall-precision curve is blind to the important consideration of when to stop a retrieval effort. In general, the density of relevant documents diminishes as effort increases, and at some point, the benefit of identifying more relevant documents no longer justifies the review effort required to find them. Participants were asked to "call their shot," or to indicate when they thought a "reasonable" result had been achieved; that is, to specify the point at which they would recommend terminating the review process because further effort would be "disproportionate." They were not actually required to stop at this point, they were simply given the option to indicate, contemporaneously, when they would have chosen to stop had they been required to do so. For this point, we report traditional set-based measures such as recall, precision, and F 1 . To evaluate the appropriateness of various possible stopping points, the Track coordinators devised a new parametric measure: recall @ aR + b, for various values of a and b. Recall @ aR + b is defined to be the recall achieved when aR + b documents have been submitted to the Web server, where R is the number of relevant documents in the collection. In its simplest form recall @aR + b [a = 1; b = 0] is equivalent to R-precision, which has been used since TREC 1 as an evaluation measure for relevance ranking. R-precision might equally well be called R-recall, as precision and recall are, by definition, equal when R documents have been reviewed. The parameters a and b allow us to explore the recall that might be achieved when a times as many documents, plus an additional b documents are reviewed. The parameter a admits that it may be reasonable to review more than one document for every relevant one that is found; the parameter b admits that it may be reasonable to review a fixed number of additional documents, over and above the number that are relevant. For example, if there are 100 relevant documents in the collection, it may be reasonable to review 200 documents (a = 2), plus an additional 100 documents (b = 100), for a total of 300 documents, in order to achieve high recall. In this Track Overview paper, we report all combinations of a ∈ {1, 2, 4} and b ∈ {0, 100, 1000}. At the time of 2015 Total Recall Track, the coordinators had hoped to be able to implement facet-based variants of the recall measures described above (see Cormack & Grossman [3]), but suitable relevance assessments for the facets were not available in time. We therefore decided to implement such measures in a future Track. The rationale for facet-based measures derives from the fact that, due to a number of factors including assessor disagreement,
dblp:conf/trec/RoegiestCCG15 fatcat:zgu7xglcf5bi7mvtpjllsgid2i