Evaluating Web Search with a Bejeweled Player Model
Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '17
e design of a Web search evaluation metric is closely related with how the user's interaction process is modeled. Each behavioral model results in a di erent metric used to evaluate search performance. In these models and the user behavior assumptions behind them, when a user ends a search session is one of the prime concerns because it is highly related to both bene t and cost estimation. Existing metric design usually adopts some simpli ed criteria to decide the stopping time point: (1) upper
... me point: (1) upper limit for bene t (e.g. RR, AP); (2) upper limit for cost (e.g. Precision@N, DCG@N). However, in many practical search sessions (e.g. exploratory search), the stopping criterion is more complex than the simpli ed case. Analyzing bene t and cost of actual users' search sessions, we nd that the stopping criteria vary with search tasks and are usually combination e ects of both bene t and cost factors. Inspired by a popular computer game named Bejeweled, we propose a Bejeweled Player Model (BPM) to simulate users' search interaction processes and evaluate their search performances. In the BPM, a user stops when he/she either has found su cient useful information or has no more patience to continue. Given this assumption, a new evaluation framework based on upper limits (either xed or changeable as search proceeds) for both bene t and cost is proposed. We show how to derive a new metric from the framework and demonstrate that it can be adopted to revise traditional metrics like Discounted Cumulative Gain (DCG), Expected Reciprocal Rank (ERR) and Average Precision (AP). To show e ectiveness of the proposed framework, we compare it with a number of existing metrics in terms of correlation between user satisfaction and the metrics based on a dataset that collects users' explicit satisfaction feedbacks and assessors' relevance judgements. Experiment results show that the framework is be er correlated with user satisfaction feedbacks.