Communications of the ACM
In contrast to most areas of computer science research, information retrieval research has a rich tradition of experimentation. In the 1960's, the librarian Cyril Cleverdon and his colleagues at the College of Aeronautics, Cranfield, England, UK ran a series of tests to determine appropriate indexing languages for information retrieval [Cle67] . The findings were highly controversial at the time, though the tests are better known today for the experimental methodology they introduced. This
... troduced. This so-called Cranfield methodology was picked up by other research groups, most notably by Gerard Salton's SMART group at Cornell University [Sal71], and was sufficiently established by 1981 that Karen Spärck Jones edited the book Information Retrieval Experiment [Spä81]. The Text REtrieval Conference (TREC) [VH05], started in 1992, is a modern manifestation of the Cranfield methodology that attests to the power of appropriate experimentation. The state of the art in retrieval system effectiveness has doubled since TREC began and most commercial retrieval systems, including web search engines, contain technology originally developed in TREC. The fundamental goal of a retrieval system is to help its users find information contained in large stores of free text. The problem is challenging because natural language is rich and complex: searchers and authors can easily express the same concept in widely different ways. Algorithms must be efficient due to the amount of text to be searched. The situation is further complicated by the fact that different information-seeking tasks are best supported in different ways, and different individual users have different opinions as to precisely what information should be retrieved. The core of the Cranfield methodology is to abstract away from the details of particular tasks and users to a benchmark task called a test collection. A test collection consists of a set of documents; a set information need statements called topics; and relevance judgments, a mapping of which documents should be retrieved for which topics. The abstract retrieval task is to produce a ranking of the document set for each topic such that relevant documents are ranked above nonrelevant documents. The Cranfield methodology facilitates research by providing a convenient paradigm for comparing retrieval technologies in a laboratory setting. The methodology is useful since the ability to perform the abstract task well is necessary (though not sufficient) to support a wide range of information-seeking tasks. The original Cranfield experiments created a test collection consisting of 1400 documents and a set of 225 requests. Many retrieval experiments were run in the twenty years following the Cranfield tests and several other test collections were built, but by 1990 there was growing dissatisfaction with the methodology. While some research groups did use the same test collections, there was no concerted effort to work with the same data, to use the same evaluation measures, or to compare results across systems to consolidate findings. The available test collections were so small that operators of commercial retrieval systems were unconvinced that the techniques developed using test collections would scale to their much larger document sets. Even some experimenters were questioning whether test collections had out-lived their usefulness. At this time, the National Institute of Standards and Technology (NIST) was asked to build a large test collection for use in evaluating text retrieval technology developed as part of the Defense Advanced Research Projects Agency's (DARPA) TIPSTER project. NIST proposed that in addition to building a large test collection, it would also organize a workshop to investigate the larger issues surrounding test collection use. DARPA agreed, and TREC was born.