Scaling Out All Pairs Similarity Search with MapReduce

Gianmarco De Francisci Morales, Claudio Lucchese, Ranieri Baraglia
2010 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval  
Given a collection of objects, the All Pairs Similarity Search problem involves discovering all those pairs of objects whose similarity is above a certain threshold. In this paper we focus on document collections which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. The proposed algorithm is based on the inverted index approach and incorporates state-of-theart pruning techniques. This is the
more » ... first work that explores the feasibility of index pruning in a MapReduce algorithm. We evaluate several heuristics aimed at reducing the communication costs and the load imbalance. The resulting algorithm gives exact results up to 5x faster than the current best known solution that employs MapReduce.
dblp:conf/sigir/MoralesLB10 fatcat:juqfdmkv4belzbijwlk2uq4mxm