Selective Search

Anagha Kulkarni, Jamie Callan
2015 ACM Transactions on Information Systems  
The traditional search solution for large collections divides the collection into subsets (shards), and processes the query against all shards in parallel (exhaustive search). The search cost and the computational requirements of this approach are often prohibitively high for organizations with few computational resources. This article investigates and extends an alternative: selective search, an approach that partitions the dataset based on document similarity to obtain topic-based shards, and
more » ... searches only a few shards that are estimated to contain relevant documents for the query. We propose shard creation techniques that are scalable, efficient, self-reliant, and create topic-based shards with low variance in size, and high-density of relevant documents. The experimental results demonstrate that selective search's effectiveness is on par with that of exhaustive search, and the corresponding search costs are substantially lower with the former. Also, the majority of the queries perform as well or better with selective search. An oracle experiment that uses optimal shard ranking for a query indicates that selective search can outperform exhaustive search's effectiveness. Comparison with a query optimization technique shows higher improvements in efficiency with selective search. The overall best efficiency is achieved when the two techniques are combined in an optimized selective search approach. availability of large and information-rich collections. An increasing number of organizations and enterprises need search solutions that can process large volumes of data. The search approaches adopted by commercial search engines, however, cannot be prescribed as is to the new search applications where the operational requirements and constraints are often drastically different. For instance, commercial search engines can assume availability of large computing clusters when designing the search solution, but most small-scale organizations and research groups cannot. Search engines typically divided the large datasets into smaller subsets (shards) that are distributed across multiple computing nodes, and searched in parallel to provide rapid interactive search. We refer to this as the exhaustive search approach. The strict requirements about query response time and system throughput enforced by search engines are also not shared by many other search applications. Often batch query processing or more relaxed query response times are acceptable. Such observations motivate the work presented in this article, where the goal is to study the problem of searching large textual collections in low-resource environments. Our previous work introduced selective search, an approach that can process large datasets using few computational resources [Kulkarni and Callan 2010a; 2010b; Kulkarni 2013] . This search technique first partitions the corpus, based on documents' similarity, into topic-based shards. During query evaluation only a few of the shards, that are estimated to contain relevant documents for the query, are searched. Lastly, the results from searched shards are merged to compile the final result list. Since the query is processed against only a few shards, as opposed to all shards, which is the case for exhaustive search, the computational requirements of selective search are much lower. The first step of dividing the collection into topic-based shards is motivated by the cluster hypothesis, as per which, closely associated documents tend to be relevant to the same requests. This hypothesis suggests that if similar documents are grouped together then the relevant documents for a query also get concentrated into a few shards. Such an organization of the dataset creates shards that are semantically homogenous, and each shard can be seen as representing a distinct topic; thus the name topic-based. We develop document allocation policies that can divide large datasets into topic-based shards efficiently, and scale well with collection size. The proposed allocation policies are also widely applicable since they do not require knowledge of the query traffic, and do not use any external knowledge sources. Furthermore, one of the approaches is designed to create shards of nearly uniform size. This is important because such shards support better load-balancing, and provide low-variance in query run times. Most existing collection partitioning techniques lack some or all of these properties. When a dataset is partitioned into topical shards, selective search proposes that the relevant shards for a query can be identified using resource selection or resource ranking algorithms, which have been studied extensively in the field of distributed information retrieval (DIR) (also, referred to as federated search) [Callan 2000; Shokouhi and Si 2011]. We employ a widely-used resource ranking algorithm, ReDDE [Si and Callan 2003], to operationalize the step of estimating the relevant shards for the query. Selective search evaluates the query against the top T shards in this ranking, and the returned results are merged to compile the final ranked list of documents. We undertake a thorough empirical analysis of the selective search approach using some of the largest available document collections. The experimental results show that compared to other shard creation techniques, which have been used in prior work, the topic-based allocation policy consistently supports the best selective search effectiveness. The topic-based shards exhibit the highest concentration of relevant documents for a query, which validates the cluster hypothesis. The shards created with the size-
doi:10.1145/2738035 fatcat:fistpgm5abemdeecnpiqmt4szi