Incremental all pairs similarity search for varying similarity thresholds

Amit Awekar, Nagiza F. Samatova, Paul Breimyer
2009 Proceedings of the 3rd Workshop on Social Network Mining and Analysis - SNA-KDD '09  
All Pairs Similarity Search (AP SS) is a ubiquitous problem in many data mining applications and involves finding all pairs of records with similarity scores above a specified threshold. In this paper, we introduce the problem of Incremental All Pairs Similarity Search (IAP SS), where AP SS is performed multiple times over the same dataset by varying the similarity threshold. To the best of our knowledge, this is the first work that addresses the IAP SS problem. All existing solutions for AP SS
more » ... perform redundant computations by invoking AP SS independently for each threshold value. In contrast, our solution to the IAP SS problem avoids redundant computations by storing the history of previous AP SS invocations and using index splitting. While offering obvious benefits, the computation and I/O intensive nature of the IAP SS solution raises two key research challenges: (1) to develop efficient I/O techniques to manage computation history and (2) to efficiently identify and prune redundant computations. We address these challenges through the proposed (a) history binning technique that clusters record pairs based on similarity values and performs I/O during the similarity computation, and (b) splitting of inverted index that maps each dimension to a list of records that have a non-zero projection along that dimension. As a result, we evaluate the effectiveness of our techniques by demonstrating speed-ups in the order of 2X to over 10 5 X over the state-of-the-art AP SS algorithm for four real-world largescale datasets.
doi:10.1145/1731011.1731012 dblp:conf/kdd/AwekarSB09 fatcat:dajqomvt25hx5iarcib3gpwyym