Pruning policies for two-tiered inverted index with correctness guarantee

Alexandros Ntoulas, Junghoo Cho
2007 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '07  
The Web search engines maintain large-scale inverted indexes which are queried thousands of times per second by users eager for information. In order to cope with the vast amounts of query loads, search engines prune their index to keep documents that are likely to be returned as top results, and use this pruned index to compute the first batches of results. While this approach can improve performance by reducing the size of the index, if we compute the top results only from the pruned index we
more » ... the pruned index we may notice a significant degradation in the result quality: if a document should be in the top results but was not included in the pruned index, it will be placed behind the results computed from the pruned index. Given the fierce competition in the online search market, this phenomenon is clearly undesirable. In this paper, we study how we can avoid any degradation of result quality due to the pruning-based performance optimization, while still realizing most of its benefit. Our contribution is a number of modifications in the pruning techniques for creating the pruned index and a new result computation algorithm that guarantees that the top-matching pages are always placed at the top search results, even though we are computing the first batch from the pruned index most of the time. We also show how to determine the optimal size of a pruned index and we experimentally evaluate our algorithms on a collection of 130 million Web pages.
doi:10.1145/1277741.1277776 dblp:conf/sigir/NtoulasC07 fatcat:jijpelxkfvem5ep4ilajszzsji