Guidelines for Online Network Crawling

Katchaguy Areekijseree, Ricky Laishram, Sucheta Soundarajan
2018 Proceedings of the 10th ACM Conference on Web Science - WebSci '18  
In recent years, researchers and data analysts have increasingly used online social network data to study human behavior. Before such study can begin, one must rst obtain appropriate data. This process poses many challenges: e.g. a this platform may provide a public API for accessing data, but such APIs are often rate limited, restricting the amount of data that an individual collect in a given amount of time. Thus, in order for the data collector to eciently collect data, she needs to make
more » ... lligent use of her limited API queries. The network science literature has proposed numerous network crawling methods, but it is not always easy for the data collector to select an appropriate method: methods that are successful on one network may fail on other networks. In this work, we demonstrate that the performance of network crawling methods is highly dependent on the structural properties of the network. To do that, we perform a detailed, hypothesis-driven analysis of the performance of eight popular crawling methods with respect to the task of maximizing node coverage. We perform experiments on both directed and undirected networks, under ve dierent query response models: complete, paginated, partial, in-out, and out responses. We identify three important network properties: community separation, average community size, and average node degree. We begin by performing controlled experiments on synthetic networks, and then verify our observations on real networks. Finally, we provide guidelines to data collectors on how to select an appropriate crawling method for a particular network.
doi:10.1145/3201064.3201066 dblp:conf/websci/AreekijsereeLS18 fatcat:uu5czxm7cnevhlzcywzg72i2la