A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is
IR evaluation measures are o en compared in terms of rank correlation between two system rankings, agreement with the users' preferences, the swap method, and discriminative power. While we view the agreement with real users as the most important, this paper proposes to use the Worst-case Con dence interval Width (WCW) curves to supplement it in test-collection environments. WCW is the worst-case width of a con dence interval (CI) for the di erence between any two systems, given a topic setdblp:conf/ntcir/Sakai17 fatcat:gul3wm7conheppwz5sfcy2zcba