Scalable distributed indexing and query processing over Linked Data
Journal of Web Semantics
Linked Data is becoming the core part of modern Web applications and thus e cient access to structured information expressed in RDF gains paramount importance. A number of e cient local RDF stores exist already, while distributed indexing and distributed query processing over Linked Data with similar e ciency and data management features as known from traditional database and data integration systems are only starting to develop. Distributed approaches will necessarily co-exist with centralized
... schemes, as data will be owned by di↵erent stakeholders who may not want to provide their complete data sets to a central place. Additionally, central / integrated storage may be prohibited for organizational or legal reasons in certain areas. To support decentralized schemes, only a few attempts in this direction exist so far, but they are limited in terms of capabilities and the degree of distribution vs. e ciency, query expressivity, and scalability. To remedy this situation, the approach and proof-of-concept prototype presented in this paper provides a solution for these open challenges. As we argue for widely distributed systems as a possible answer to scalability issues, we first identify and discuss the main challenges and based on this analysis, we propose an approach for e cient and scalable query processing over distributed Linked Data sources, taking into account the latest advances in database technology. Our system is based on a layered architecture that makes use of the advantages of decentralized indexing and query processing approaches, which have been researched and matured over the last decade. Our approach is based on a logical algebra for queries over RDF data and a related physical query algebra to enable optimization, both on the logical and physical layers in query processing. The introduced operators and strategies for processing complex query plans make excessive use of parallelism and other optimization paradigms of distributed query processing. Our query processing framework includes a sophisticated cost model to enable cost-e cient query planning and query execution. We extensively evaluate our approach through an experimental evaluation of a real proof-of-concept deployment, which demonstrates the e ciency, applicability, and correctness of the proposed concepts.