Efficient multi-keyword search over p2p web

Hanhua Chen, Hai Jin, Jiliang Wang, Lei Chen, Yunhao Liu, Lionel M. Ni
2008 Proceeding of the 17th international conference on World Wide Web - WWW '08  
Current search mechanisms of DHT-based P2P systems can well handle a single keyword search problem. Other than single keyword search, multi-keyword search is quite popular and useful in many real applications. Simply using the solution for single keyword search will require distributed intersection/union operations in wide area networks, leading to unacceptable traffic cost. As it is well known that Bloom Filter (BF) is effective in reducing traffic, we would like to use BF encoding to handle
more » ... lti-keyword search. Applying BF is not difficult, but how to get optimal results is not trivial. In this study we show, through mathematical proof, that the optimal setting of BF in terms of traffic cost is determined by the global statistical information of keywords, not the minimized false positive rate as claimed by previous methods. Through extensive experiments, we demonstrate how to obtain optimal settings. We further argue that the intersection order between sets is important for multi-keyword search. Thus, we design optimal order strategies based on BF for both "and" and "or" queries. To better evaluate the performance of this design, we conduct extensive simulations on TREC WT10G test collection and the query log of a commercial search engine. Results show that our design significantly reduces the search traffic of existing approach by 73%. We conduct comprehensive trace-driven simulations on TREC WT10G [12] test collection and the query log of a commercial search engine to evaluate the performance of this design. Results We first show how we achieve the optimal settings of BF by analyzing the targeted function defined in Section 3.2 for minimizing communication cost using Matlab. Based on the analysis results, we then compare our optimal BF design with the straightforward BF algorithm [25] , which reduces the communication cost by minimizing the false positive of a BF using comprehensive simulations. Optimal setting of bloom filter In this section, we show how to achieve the minimized communication cost defined in Section 3.2 by using optimal settings of BFs. We analyze the communication cost quantified by Eq. (4) with Matlab. We consider three typical situations |X|<|Y|, |X|=|Y|, and |X|>|Y|. We set r to 250 bits based on the research results conducted on Google search engine, which show that the average URL length measured in character is 31.2 characters [13] . We adjust the parameters m and k and examine how the value of f(m, k) changes. We find that the intersection order is critical for minimizing the communication cost. When |X| is not greater than |Y|, the communication cost can be minimized. The value of f(m, k) is significantly influenced by the variable m. The minimal value of f (k, m) can be achieved when m is set as an optimal value. The minimal communication cost changes very slightly when we adjust the value of parameter k while fixing the value of parameter m. The results demonstrate that the optimal BF is determined by the popularities of keywords and the intersection order. Much benefit can be achieved if we transmit the BF for the set of a less popular keyword to the DHT node responsible for a popular keyword during the process of distributed intersection. Based on these observations, given |X|, |Y| and k, the objective of our optimal BF based intersection algorithm is to enable each node intelligently choose the optimal m and the intersection order to achieve the minimal communication cost. In this design we first sort the keywords for an intersection operation in increasing order according to their popularities, |X|<|Y|. By varying the values of |X| and |Y| we obtain a set of sample values for optimal m.
doi:10.1145/1367497.1367631 dblp:conf/www/ChenJWCLN08 fatcat:v7vx7lx4yzdjlcjzmi55et3epi