SPARQL Graph Pattern Processing with Apache Spark

Hubert Naacke, Bernd Amann, Olivier Curé
2017 Proceedings of the Fifth International Workshop on Graph Data-management Experiences & Systems - GRADES'17  
A common way to achieve scalability for processing SPARQL queries is to choose MapReduce frameworks like Hadoop or Spark. Processing basic graph pattern (BGP) expressions generating large join plans over distributed data partitions is a major challenge in these frameworks. In this article, we study the use of two distributed join algorithms, partitioned join and broadcast join, for the evaluation of BGP expressions on top of Apache Spark. We compare five possible implementation and illustrate
more » ... e importance of cautiously choosing the physical data storage layer and of the possibility to use both join algorithms to efficiently take account of existing data partitioning schemes. Our experimentations with different SPARQL benchmarks over real-world and synthetic workloads emphasize that hybrid join plans introduce more flexibility and often achieve better performance than single kind join plans.
doi:10.1145/3078447.3078448 dblp:conf/grades/NaackeAC17 fatcat:yi4z2tmprzdvlkgwpebrujgkgy