Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics - NDA '16
In this paper, we describe NSCALESPARK, a framework for executing large-scale distributed graph analysis tasks on the Apache Spark platform. NSCALESPARK is motivated by the increasing interest in executing rich and complex analysis tasks over large graph datasets. There is much recent work on vertex-centric graph programming frameworks for executing such analysis tasks -these systems espouse a "think-like-a-vertex" (TLV) paradigm, with some example systems being Pregel, Apache Giraph, GPS,
... , and GraphX (built on top of Apache Spark). However, the TLV paradigm is not suitable for many complex graph analysis tasks that typically require processing of information aggregated over neighborhoods or subgraphs in the underlying graph. Instead, NSCALESPARK is based on a "think-like-a-subgraph" paradigm (also recently called "think-like-an-embedding" ). Here, the users specify computations to be executed against a large number of multi-hop neighborhoods or subgraphs of the data graph. NSCALESPARK builds upon our prior work on the NSCALE system  , which was built on top of the Hadoop MapReduce system. We describe how we reimplemented NSCALE on the Apache Spark platform, the key challenges therein, and the design decisions we made. NSCALESPARK uses a series of RDD transformations to extract and hold the relevant subgraphs in distributed memory with minimal footprint using a cost-based optimizer. Our in-memory graph data structure enables efficient graph computations over large-scale graphs. Our experimental results over several real world data sets and applications show orders-of-magnitude improvement in performance and total cost over GraphX and other vertex-centric approaches.