Collecting high quality overlapping labels at low cost

Hui Yang, Anton Mityagin, Krysta M. Svore, Sergey Markov
2010 Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval - SIGIR '10  
This paper studies quality of human labels used to train search engines' rankers. Our specific focus is performance improvements obtained by using overlapping relevance labels, which collecting multiple human judgments for each training sample. The paper explores whether, when, and for which should obtain overlapping training labels, as well as labels per sample are needed. The proposed scheme collects additional labels only for a subset of training samples, specifically for those that are
more » ... ed relevant by a Our experiments show that this labeling schem NDCG of two Web search rankers on several real with a low labeling overhead of around 1.4 labels per sample This labeling scheme also outperforms several overlapping labels, such as simple k-overlap, majority vote, the highest labels, etc. Finally, the paper presents a study of how many overlapping labels are needed to get the best in retrieval accuracy. 11-overlap: This is the k-overlap method with k=11.This experimental setting uses all 11 labels in the Clean label set as the training data. This setting is not applicable to Clean+. Mv3: This is the majority vote method over 3 overlapping labels. The 3 overlapping labels are drawn randomly from the Clean label set. This experiment setting is not applicable to Clean+. Mv11: This is the majority vote method over 11 overlapping labels. The 11 overlapping labels are all from the Clean label set. This setting is not applicable to Clean+. If-good-3: This is the if-good-k labeling scheme, with k=3. The 3 overlapping labels for Clean are randomly drawn from the 11 labels for each query-url pair. This setting is applicable to both Clean and Clean+. If-good-x3: This experimental setting combines the idea of selective labeling and weighting labels. If a label is "Good or above", the label is assigned a weight which is θ times of the weight of other labels. In this setting, θ=3. This setting is applicable to both Clean and Clean+. Highest-3: This experimental setting uses the most relevant label of each query-url pair for training. In particular, this setting uses the highest label among k overlapping labels, with k=3. The 3 overlapping labels for Clean are randomly drawn from the 11 labels per sample. This setting is not applicable to Clean+. Good-till-bad: This is the Good-till-bad labeling scheme. The upper limit of overlapping labels is k, with k=11 for the Clean label set. This setting is not applicable to Clean+. Note that for the Clean label set, which contains 11 labels for each query-url pair, there is a constraint of k≤11. We performed a random sampling from these 11 labels if k<11. Due to the randomness of getting k labels when k<11, the averaged results of
doi:10.1145/1835449.1835526 dblp:conf/sigir/YangMSM10 fatcat:snhupbfacffg5caoh5xbal6xgm