Diversified caching for replicated web search engines

Chuanfei Xu, Bo Tang, Man Lung Yiu
2015 2015 IEEE 31st International Conference on Data Engineering  
Commercial web search engines adopt parallel and replicated architecture in order to support high query throughput. In this paper, we investigate the effect of caching on the throughput in such a setting. A simple scheme, called uniform caching, would replicate the cache content to all servers. Unfortunately, it does not exploit the variations among queries, thus wasting memory space on caching the same cache content redundantly on multiple servers. To tackle this limitation, we propose a
more » ... ified caching problem, which aims to diversify the types of queries served by different servers, and maximize the sharing of terms among queries assigned to the same server. We show that it is NP-hard to find the optimal diversified caching scheme, and identify intuitive properties to seek good solutions. Then we present a framework with a suite of techniques and heuristics for diversified caching. Finally, we evaluate the proposed solution with competitors by using a real dataset and a real query log. |Q| T total (Q) , where the total processing time T total (Q) is defined as: with Q i ⊂ Q as a subset of queries assigned to server S i , and T Ci (Q i ) as the processing time of Q i by using C i (on S i ). As shown in Figure 1(b) , the majority of time is spent on processing at servers, rather than at the broker. Thus, we ignore the broker time in T total (Q). Our objective is to minimize T total (Q) in order to maximize the query throughput. This leads to two subproblems: (i) deciding the cache content C i of each server, (ii) deciding the subset of workload Q i for each server. To our best knowledge, existing caching techniques [4], [18], [25] have not considered the above architecture and exploited its optimization opportunities. A simple scheme,
doi:10.1109/icde.2015.7113285 dblp:conf/icde/XuTY15 fatcat:4sq3jh6pdjf4jlv62jpzqj4bvy