Guest Editors' Introduction: Special Section on Peer-to-Peer-Based Data Management
IEEE Transactions on Knowledge and Data Engineering
ae P EER-TO-PEER (P2P) computing has attracted much attention from both the academic community and industry. This is fueled by the successful deployment and adoption of many domain specific P2P systems. For example, Freenet and Gnutella enable users to share any digital files (e.g., music files, video, images), Napster allows sharing of (mp3) music files, ICQ facilitates exchanges of personal messages, SETI@home makes computing cycles of participants available, and LOCKSS pools storage
... to archive document collections. In P2P systems, autonomous peers (computers) are treated as equals, i.e., perform the same functions. They can join and leave the system at any time. These peers pool together their resources (data, storage, computing cycles) to enable new capabilities greater than the sum of the parts. Data can be exchanged between peers directly and underutilized resources can be tapped. The potential of such a highly distributed and decentralized system is tremendous. Interestingly, existing P2P systems lack data management capabilities that are typically found in DBMS. Although research in distributed (and heterogenous) databases has been pursued for many years, the database community has not been as aggressive in enhancing P2P systems with data management capabilities. We would add that the current P2P paradigm offers challenges beyond what has been previously done in the distributed database context. To list a few, the system may scale to over thousands or tens of thousands of peers which existing techniques cannot adequately handle, the dynamism of the system raises issues of information quality (e.g., completeness, consistency) that have not been previously considered, and the trustworthiness of the participating peers poses security threats not seen before. This special section aims to bring together current research activities that address some of these problems. The section contains six papers covering topics on data integration, search, consistency, trust, and identity. We hope this section will whet the appetite of our community to pursue this exciting field further. In a peer-based data management system, it is practically impossible to construct a global schema that mediates semantic differences of shared data across a large number of autonomous peers. The first paper, "The Piazza Peer Data Management System" by Alon Y. Halevy, Zazhary G. Ives, Jayant Madhavan, Peter Mork, Dan Suciu, and Igor Tatarinov, proposes a solution to facilitate ad hoc, decentralized sharing and administration of data, and defining of semantic relationships. Every peer can contribute new data and relate the data to existing concepts and schemas and define new schemas for other peers to use as frame of reference for their queries. The paper also discusses query answering and optimization algorithms. Replication and caching are very effective mechanisms that can bring the data/results closer to the users to improve performance. However, these mechanisms also introduce new challenges: Data that are replicated or cached have to be coherent with the source, updates to the data must be carefully disseminated from sources to their cached/replicated copies in other peers to minimize communication and computation overhead, and, in a P2P environment, the network should be resilient to failures so that data coherency is not completely lost even in the midst of failures. The second paper, "Resilient and Coherency presents a three-tier framework to support semantic-based retrieval of documents. The framework summarizes the information content at different granularities: individual document level, peer level, where all documents within a peer are summarized, and superpeer level, where all summaries of peers managed by a superpeer are further combined and summarized. With the framework, queries can be routed to peers with similar content quickly. Corresponding to each tier of the framework is an index structure to facilitate speedy retrieval. A critical issue in P2P systems is trust management-without a good solution to this problem, P2P systems are not likely to be deployed for serious applications. Essentially, peers need to manage the risk of communicating or cooperating with each other without prior experience and knowledge about each other.