Including summaries in system evaluation

Andrew Turpin, Falk Scholer, Kalvero Jarvelin, Mingfang Wu, J. Shane Culpepper
2009 Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval - SIGIR '09  
In batch evaluation of retrieval systems, performance is calculated based on predetermined relevance judgements applied to a list of documents returned by the system for a query. This evaluation paradigm, however, ignores the current standard operation of search systems which require the user to view summaries of documents prior to reading the documents themselves. In this paper we modify the popular IR metrics MAP and P@10 to incorporate the summary reading step of the search process, and
more » ... the effects on system rankings using TREC data. Based on a user study, we establish likely disagreements between relevance judgements of summaries and of documents, and use these values to seed simulations of summary relevance in the TREC data. Re-evaluating the runs submitted to the TREC Web Track, we find the average correlation between system rankings and the original TREC rankings is 0.8 (Kendall τ ), which is lower than commonly accepted for system orderings to be considered equivalent. The system that has the highest MAP in TREC generally remains amongst the highest MAP systems when summaries are taken into account, but many other systems become equivalent to the top ranked system depending on the simulated summary relevance. Given that system orderings alter when summaries are taken into account, the small amount of effort required to judge summaries in addition to documents (19 seconds vs 88 seconds on average in our data) should be undertaken when constructing test collections.
doi:10.1145/1571941.1572029 dblp:conf/sigir/TurpinSJWC09 fatcat:hf4mbs22prgazcvq2auzx3jb6q