Continuous Result Delta Evaluation of IR Systems

Gabriela González-Sáez
2022 Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval  
Classical evaluation of information retrieval systems evaluates a system in a static test collection. In the case of Web search, the evaluation environment (EE) is continuously changing and the hypothesis of using a static test collection is not representative of this changing reality. Moreover, the changes in the evaluation environment, as the document set, the topics set, the relevance judgments, and the chosen metrics, have an impact on the performance measurement [1, 4] . To the best of our
more » ... knowledge, there is no way to evaluate two versions of a search engine with evolving EEs. We aim at proposing a continuous framework to evaluate different versions of a search engine in different evaluation environments. The classical paradigm relies on a controlled test collection (i.e., set of topics, corpus of documents and relevant assessments) as a stable and meaningful EE that guarantees the reproducibility of system results. We propose to take into account multiple EEs for the evaluation of systems, in a dynamic test collection (DTC). A DTC is a list of test collections based on a controlled evolution of a static test collection. The DTC allows us to quantify and relate the differences between the test collection elements, called Knowledge delta (𝐾 Δ), and the performance differences between systems evaluated on these varying test collections, called Result delta (𝑅Δ). Finally, the continuous evaluation is characterized by 𝐾 Δs and 𝑅Δs. The related changes in both deltas will allow for interpreting the evaluations in systems performances. The expected contributions of the thesis are: (i) a pivot strategy based on 𝑅Δ to compare systems evaluated in different EEs; (ii) a formalization of DTC to simulate the continuous evaluation and provide significant 𝑅Δ in evolving contexts; and (iii) a continuous evaluation framework that incorporates 𝐾 Δ to explain 𝑅Δ of evaluated systems. It is not possible to measure the 𝑅Δ of two systems evaluated in different EEs, because the performance variations are dependent on the changes in the EEs. [1] . To get an estimation of this 𝑅Δ measure, we propose to use a reference system, called the pivot system, which would be evaluated within the two EEs considered. Then, the 𝑅Δ value is measured using the relative distance between the pivot system and each evaluated system. Our results [2, 3] show that using the pivot strategy we improve the correctness of the ranking of systems (RoS) evaluated in two EEs (i.e., similarity with the RoS evaluated in the ground truth), compared to the RoS constructed
doi:10.1145/3477495.3531686 fatcat:a3x5kwpaljhgxnimfs4v2fq23m