Efficient Distributed SPARQL Queries on Apache Spark

Saleh Albahli
2019 International Journal of Advanced Computer Science and Applications  
RDF is a widely-accepted framework for describing metadata in the web due to its simplicity and universal graphlike data model. Owing to the abundance of RDF data, existing query techniques are rendered unsuitable. To this direction, we adopt the processing power of Apache Spark to load and query a large dataset much more quickly than classical approaches. In this paper, we have designed experiments to evaluate the performance of several queries ranging from single attribute selection to
more » ... on, filtering and sorting multiple attributes in the dataset. We further experimented with the performance of queries using distributed SPARQL query on Apache Spark GraphX and studied different stages involved in this pipeline. The execution of distributed SPARQL query on Apache Spark GraphX helped us study its performance and gave insights into which stages of the pipeline can be improved. The query pipeline comprised of Graph loading, Basic Graph Pattern and Result calculating. Our goal is to minimize the time during graph loading stage in order to improve overall performance and cut the costs of data loading.
doi:10.14569/ijacsa.2019.0100874 fatcat:lolh5mvrhrgttjdfm3uqbrqdam