How does clickthrough data reflect retrieval quality?

Filip Radlinski, Madhu Kurup, Thorsten Joachims
2008 Proceeding of the 17th ACM conference on Information and knowledge mining - CIKM '08  
Automatically judging the quality of retrieval functions based on observable user behavior holds promise for making retrieval evaluation faster, cheaper, and more user centered. However, the relationship between observable user behavior and retrieval quality is not yet fully understood. We present a sequence of studies investigating this relationship for an operational search engine on the arXiv.org e-print archive. We find that none of the eight absolute usage metrics we explore (e.g., number
more » ... f clicks, frequency of query reformulations, abandonment) reliably reflect retrieval quality for the sample sizes we consider. However, we find that paired experiment designs adapted from sensory analysis produce accurate and reliable statements about the relative quality of two retrieval functions. In particular, we investigate two paired comparison tests that analyze clickthrough data from an interleaved presentation of ranking pairs, and we find that both give accurate and consistent results. We conclude that both paired comparison tests give substantially more accurate and sensitive evaluation results than absolute usage metrics in our domain.
doi:10.1145/1458082.1458092 dblp:conf/cikm/RadlinskiKJ08 fatcat:kfdvw7cbjjaw7gwmclem3v4x7u