Semantic indexing in structured peer-to-peer networks
Journal of Parallel and Distributed Computing
The past few years have seen tremendous advances in distributed storage infrastructure. Unstructured and structured overlay networks have been successfully used in a variety of applications, ranging from file-sharing to scientific data repositories. While unstructured networks benefit from low maintenance overhead, the associated search costs are high. On the other hand, structured networks have higher maintenance overheads, but facilitate bounded time search of installed keywords. When dealing
... with typical data sets, though, it is infeasible to install every possible search term as a keyword into the structured overlay. State-of-the art semantic indexing techniques have been successfully integrated into peer-to-peer (P2P) systems using semantic overlays. However, exiting approaches are based on the premise that the fundamental ingredient of semantic indexing, a semantic basis for the underlying data, is globally available, which is not likely to be the case in practice. Therefore, development of techniques to efficiently compute basis vectors for data distributed across peers is important for large-scale deployment of semantic indexing in P2P systems. In this paper, we present a novel structured overlay that integrates aspects of semantic indexing using non-orthogonal matrix decompositions, with the hash structure of the overlay. We adopt PROXIMUS, a recursive decomposition method for computing concise representations for binary data sets, to locally identify latent patterns in data distributed across peers. To enable efficient consolidation of patterns, we rely on distributed hash tables (DHT), commonly used in various applications in P2P networks. The discrete nature of non-orthogonal matrix decomposition is well suited to the binary key structure of DHTs, resulting in an indexing method, PMINER, that enables the network to deliver efficient and accurate responses to semantic queries. We present the algorithmic underpinnings of PMINER and demonstrate its excellent performance characteristics on real, as well as synthetic data sets. These large data sets are generally multi-attributed (high dimensional), discrete, and distributed. For typical applications, collecting all of the data to a single central location is infeasible, due to associated overheads, privacy concerns, and resource constraints. Therefore, querying and analyzing these data sets require the development of distributed techniques. Peer-to-peer (P2P) systems have emerged as attractive solutions for a number of distributed applications, such as file sharing , archival storage , backup , web caching  , and information retrieval systems , among others. More recently, researchers have explored the integration of data analysis and mining techniques into P2P systems  . Given the large amount of data already residing in P2P networks, decentralized solutions to data mining problems have the potential for new and exciting applications. His primary areas of research include parallel and distributed computing, scientific computing, and large-scale data handling and analysis. On these topics, he has (co)authored several papers and textbooks. At Purdue, he is affiliated with the Coordinated Systems Lab.