Scalable percolation search on complex networks

Nima Sarshar, Oscar Boykin, Vwani Roychowdhury
2006 Theoretical Computer Science  
We introduce a scalable searching protocol for locating and retrieving content in random networks with heavy-tailed and in particular power-law (PL) degree distributions. The proposed algorithm is capable of finding any content in the network with probability one in time O(log N), with a total traffic that provably scales sub-linearly with the network size, N. Unlike other proposed solutions, there is no need to assume that the network has multiple copies of contents; the protocol finds all
more » ... ents reliably, even if every node in the network starts with a unique content. The scaling behavior of the size of the giant connected component of a random graph with heavy-tailed degree distributions under bond percolation is at the heart of our results. The percolation search algorithm can be directly applied to make unstructured peer-to-peer (P2P) networks, such as Gnutella, Limewire and other file-sharing systems (which naturally display heavy-tailed degree distributions and approximate scale-free network structures), scalable. For example, simulations of the protocol on the limewire crawl number 5 network [Ripeanu et al., Mapping the Gnutella network: properties of large-scale peer-to-peer systems and implications for system design, IEEE Internet Comput. J. 6 (1) (2002)], consisting of over 65,000 links and 10,000 nodes, shows that even for this snapshot network, the traffic can be reduced by a factor of at least 100, and yet achieve a hit-rate greater than 90%. 49 networks, optimal utilization of the heterogeneity of the resources available to the members of the network, calls for nonuniform connectivity distributions. To optimally utilize the resources of the network, nodes with more resources and capabilities should assume more central roles, often by acquiring more connections. Group (iii): A fairly recent class of work based on the idea that network formation protocols can be designed to ensure the existence of PL connectivity structure in the emergent complex networks. This group [21, 23] , argues that random PL (or other heavy-tailed) networks are in fact desirable topologies for communication networks of various kinds, and such topologies can be actively exploited to provide scalable global services in highly dynamic and ad hoc environments. The works of Group (iii) was motivated, among other things, by the controversy on the accuracy of dynamical models in accounting for characteristics observed in real networks, or even the sheer existence of some of the those characteristics. For instance, a couple of years after the discovery of PL relations in Internet [12], the authors in [10] reexamined the data on which the results in [12] were based and found that these data can only provide an incomplete view of the Internet topology. They showed that a more complete data set reveals that while the distribution of the connectivity in the Internet is still heavy-tailed, it deviates significantly from a strict PL. These findings thus challenge the extent of the validity of many PL relations found in the works of Group (i), as well as, the accuracy with which dynamical systems proposed in the works of Group (ii) (which predict the emergence of strict PL distributions) can model existing systems. Group (iii) avoids this debate and addresses the issue of Designer Complex Networks, i.e., vari-ants of the dynamical rules proposed in the work of Group (ii) can be used as the basic protocols for large-scale networks, resulting in the emergence of desired complex networks (e.g., networks with tunable PL exponents and multiple scale-free distributions), which can then be harnessed to provide global services in a robust fashion. Whether or not PL distributions exist in existing complex networks, one thing is certain; these PL distributions will emerge if someone actually builds a network from scratch with the dynamical rules proposed in the works of Group (ii). Thus is PL (or other heavy-tailed distributions) are in fact desirable for a communication network, autonomous dynamical protocols can be designed to result in such connectivity structures in the emerging network. The works of Group(iii), is thus based on this idea; network formation protocols can be designed to ensure the existence of PL connectivity structure in the emergent complex networks. The work presented in this paper, follows this philosophy. We show that an unstructured P2P communication system can exploit heavy-tailed connectivity structure of the network for scalable search. In separate works [21], we devise protocols that ensure the emergence of scale-free connectivity structures even in ad hoc and unreliable dynamical environments. Furthermore, we show how these protocols can take into account the heterogenous distribution of the resources between the nodes of the networks. For instance, the high-connectivity nodes (the hubs) in the emergent network can be guaranteed to be chosen from nodes with higher resources. Therefore, throughout this paper, we assume the network topology is a random PL network and we are only concerned about the performance of the proposed search algorithm on such networks. P2P networking systems consist of a large number of nodes or computers that operate in a decentralized manner to provide reliable global services, such as query resolutions (i.e., database searches), ad hoc point-to-point communications, and cluster or P2P computing. The existing P2P schemes can be broadly categorized into two types: (1) Unstructured P2P networks: Such networks include the popular music and video download services, such as Gnutella [13], Limewire [19], Kazaa [1], Morpheus [2], and Imesh [3]. They together account for millions of users dynamically connected in an ad hoc fashion, and creating a giant federated data base. The salient feature of such networks is that the data objects do not have global unique ids, and queries are done via a set of key words. (2) Structured P2P networks: These include systems under development, including Tapestry [25], Chord [24], PASTRY [20,11], and Viceroy [15], and are characterized by the fact that each content/item has a unique identification tag or key; e.g., an m-bit hash of the content is a common choice, leading to the popular characterization of such networks as Distributed Hash Table (DHT) P2P systems. As opposed to the unstructured networks, which are already being used by millions of users, most of the structured systems are in various stages of development, and it is not clear at all which system is best suited to provide a reliable, load-balanced, and fault-tolerant network. Moreover, unstructured searches using key-words constitute a dominant mechanism for locating content and resources, and for merging/mining already existing heterogeneous sets of data bases. Thus, unstructured P2P networking will continue to remain an important application domain. In spite of the great popularity of the unstructured P2P networks, systematic designs of provably robust and scalable networks have not been proposed, and most of the networks currently being used are still ad hoc (even though ingenious)
doi:10.1016/j.tcs.2005.12.014 fatcat:okmrh73chncs7mmq4iwkkvspz4