Towards scalable RDF graph analytics on MapReduce

Padmashree Ravindra, Vikas V. Deshpande, Kemafor Anyanwu
2010 Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud - MDAC '10  
In order to exploit the growing amount of RDF data in decisionmaking, there is an increasing demand for analytics-style processing of such data. RDF data is modeled as a labeled graph that represents a collection of binary relations (triples). In this context, analytical queries can be interpreted as consisting of three main constructs namely pattern matching, grouping and aggregation, and require several join operations to reassemble them into n-ary relations relevant to the given query,
more » ... traditional OLAP systems where data is suitably organized. MapReduce-based parallel processing systems like Pig have gained success in processing scalable analytical workloads. However, these systems offer only relational algebra style operators which would require an iterative n-tuple reassembly process in which intermediate results need to be materialized. This leads to high I/O costs that negatively impacts performance. In this paper, we propose UDFs that (i) re-factor analytical processing on RDF graphs in a way that enables more parallelized processing (ii) perform a look-ahead processing to reduce the cost of subsequent operators in the query execution plan. These functions have been integrated into the Pig Latin function library and the experimental results show up to 50% improvement in execution times for certain classes of queries. An important impact of this work is that it could serve as the foundation for additional physical operators in systems such as Pig for more efficient graph processing. Processing of RDF data usually requires several joins and grouping operations, which cannot effectively be pushed to the database. Yet another approach optimizes multi-way joins [10] by providing strategies to efficiently partition and replicate the tuples of a relation on reducer processes in a way that minimizes the communication cost. This work is complementary to our approach, and by integrating the partitioning scheme into Pig, we can further improve the performance of join operations. The RDF community has also recently embraced the parallel data processing paradigm as described by the MapReduce model, and there have been efforts to perform scalable RDF reasoning [12] by materializing the closure of the related graph and hence perform efficient reasoning using the resultant ordering of inferring rules. There have been MapReduce-based approaches for pattern matching [13], [14] by decomposing graphs into RDF molecules.
doi:10.1145/1779599.1779604 fatcat:tvk3s4hhhrazbo5gxn4i4pus44