RDF-3X

Thomas Neumann, Gerhard Weikum
2008 Proceedings of the VLDB Endowment  
RDF is a data representation format for schema-free structured information that is gaining momentum in the context of Semantic-Web corpora, life sciences, and also Web 2.0 platforms. The "pay-as-you-go" nature of RDF and the flexible pattern-matching capabilities of its query language SPARQL entail efficiency and scalability challenges for complex queries including long join paths. This paper presents the RDF-3X engine, an implementation of SPARQL that achieves excellent performance by pursuing
more » ... a RISC-style architecture with a streamlined architecture and carefully designed, puristic data structures and operations. The salient points of RDF-3X are: 1) a generic solution for storing and indexing RDF triples that completely eliminates the need for physical-design tuning, 2) a powerful yet simple query processor that leverages fast merge joins to the largest possible extent, and 3) a query optimizer for choosing optimal join orders using a cost model based on statistical synopses for entire join paths. The performance of RDF-3X, in comparison to the previously best state-of-the-art systems, has been measured on several large-scale datasets with more than 50 million RDF triples and benchmark queries that include pattern matching and long join paths in the underlying data graphs. Select ?title Where { ?m ?title. ?m ?c. ?c ?a. ?a "Johnny Depp" } Here each of the conjunctions, denoted by a dot, corresponds to a join. The whole query can also be seen as graph pattern that needs to be matched in the RDF data graph. In SPARQL, predicates can also be variables or wildcards, thus allowing schema-agnostic queries. RDF engines for storing, indexing, and querying have been around for quite a few years; especially, the Jena frame- 647 Copyright 2008 VLDB Endowment, ACM, ISBN 978-1-60558-305-1 work by HP Labs has gained significant popularity [46] , and Oracle also provides RDF support for semantic data integration in life sciences and enterprises [11, 29] . However, with the exception of the VLDB 2007 paper by Abadi et al. [1] , none of the prior implementations could demonstrate convincing efficiency, failing to scale up towards large datasets and high load. [1] achieves good performance by grouping triples with the same property name into property tables, mapping these onto a column store, and creating materialized views for frequent joins. Managing large-scale RDF data includes technical challenges for the storage layout, indexing, and query processing:
doi:10.14778/1453856.1453927 fatcat:sexj2dl7crfihaij4bic7zkmnu