Distributed Set Reachability
Proceedings of the 2016 International Conference on Management of Data - SIGMOD '16
In this paper, we focus on the efficient and scalable processing of set-reachability queries over a distributed, directed data graph. A set-reachability query is a generalized form of a reachability query, in which we consider two sets S and T of source and target vertices, respectively, to be given as the query. The result of a set-reachability query are all pairs of source and target vertices (s, t), with s ∈ S and t ∈ T , where s is reachable to t (denoted as S ; T ). In case the data graph
... s partitioned into multiple, edge-and vertexdisjoint subgraphs (e.g., when distributed across multiple compute nodes in a cluster), we refer to the resulting setreachability problem as distributed set reachability. The key goal in processing a distributed set-reachability query over a partitioned data graph both efficiently and in a scalable manner is (1) to avoid redundant computations within the local compute nodes as much as possible, (2) to partially evaluate the local components of a set-reachability query S ; T among all compute nodes in parallel, and (3) to minimize both the size and number of messages exchanged among the compute nodes. Distributed set reachability has a plethora of applications in graph analytics and for query processing. The current W3C recommendation for SPARQL 1.1, for example, introduces a notion of labeled property paths which resolves to processing a form of generalized graph-pattern queries with set-reachability predicates. Moreover, analyzing dependencies among social-network communities inherently involves reachability checks between large sets of source and target vertices. Our experiments confirm very significant performance gains of our approach in comparison to state-of-theart graph engines such as Giraph ++ , and over a variety of graph collections with up to 1.4 billion edges.