Semantics and complexity of SPARQL

Jorge Pérez, Marcelo Arenas, Claudio Gutierrez
2009 ACM Transactions on Database Systems  
SPARQL is the standard language for querying RDF data. In this article, we address systematically the formal study of the database aspects of SPARQL, concentrating in its graph pattern matching facility. We provide a compositional semantics for the core part of SPARQL, and study the complexity of the evaluation of several fragments of the language. Among other complexity results, we show that the evaluation of general SPARQL patterns is PSPACE-complete. We identify a large class of SPARQL
more » ... ns, defined by imposing a simple and natural syntactic restriction, where the query evaluation problem can be solved more efficiently. This restriction gives rise to the class of well-designed patterns. We show that the evaluation problem is coNP-complete for well-designed patterns. Moreover, we provide several rewriting rules for well-designed patterns whose application may have a considerable impact in the cost of evaluating SPARQL queries. 16:2 • J. Pérez et al. with its release in 1998 as Recommendation of the World Wide Web Consortium (W3C), the natural problem of querying RDF data was raised. Since then, several designs and implementations of RDF query languages have been proposed (see Haase et al. [2004] and Furche et al. [2006] for detailed comparisons of RDF query languages). In 2004, the RDF Data Access Working Group, part of the W3C Semantic Web Activity, released a first public working draft of a query language for RDF, called SPARQL [Prud'hommeaux and Seaborne 2008]. 1 Since then, SPARQL has been rapidly adopted as the standard for querying semantic Web data. In January 2008, SPARQL became a W3C Recommendation. RDF is a directed labeled graph data format and, thus, SPARQL is essentially a graph-matching query language. SPARQL queries are composed by three parts. The pattern matching part, includes several interesting features of pattern matching of graphs, like optional parts, union of patterns, nesting, filtering values of possible matchings, and the possibility of choosing the data source to be matched by a pattern. The solution modifiers, once the output of the pattern has been computed (in the form of a table of values of variables), allow to modify these values applying classical operators like projection, distinct, order, and limit. Finally, the output of a SPARQL query can be of different types: yes/no queries, selections of values of the variables which match the patterns, construction of new RDF data from these values, and descriptions of resources. The definition of a formal semantics for SPARQL has played a key role in the standardization process of this query language. Although taken one by one the features of SPARQL are intuitive and simple to describe and understand, it turns out that the combination of them makes SPARQL into a complex language. Reaching a consensus in the W3C standardization process about a formal semantics for SPARQL was not an easy task. The initial efforts to define SPARQL were driven by use cases, mostly by specifying the expected output for particular example queries. In fact, the interpretations of examples and the exact outcomes of cases not covered in the initial drafts of the SPARQL specification were a matter of long discussions in the W3C mailing lists. In the conference version of this article (see Pérez et al. [2006a] ), we presented one of the first formalizations of a semantics for a fragment of the language. Currently, the official specification of SPARQL [Prud'hommeaux and Seaborne 2008] , endorsed by the W3C, formalizes a semantics based on our work [Pérez et al. 2006a [Pérez et al. , 2006b Arenas et al. 2007] . A formalization of a semantics for SPARQL is beneficial for several reasons, including to serve as a tool to identify and derive relations among the constructors that stay hidden in the use cases, identify redundant and contradicting notions, to drive and help the implementation of query engines, and to study the complexity, expressiveness, and further natural database questions like rewriting and optimization. The broad goal of our work is the formalization and study of the database aspects of SPARQL. In this article, we present a thorough study of the pattern-matching facility of SPARQL, which constitutes the core of the language. In this direction, 16:4 • J. Pérez et al. Organization of the Article. Section 2 presents a formalized algebraic syntax and a compositional semantics for SPARQL. Section 3 presents the complexity study of the language. Section 4 introduces the fragment of well-designed patterns and presents its properties. Finally, Section 5 discusses related work and Section 6 gives some concluding remarks. For the sake of readability, some proofs and technical results are included in the Appendix. SYNTAX AND SEMANTICS OF SPARQL In this section, we give an algebraic formalization of the core fragment of SPARQL over simple RDF, that is, RDF without RDFS vocabulary and literal rules. This allows us to take a close look at the core components of the language and identify some of its fundamental properties. We introduce first the necessary notions about RDF (for details on RDF formalization see Gutierrez et al. [2004], or Marin [2004] for a complete reference including RDFS vocabulary). Assume there are pairwise disjoint infinite sets I , B, and L (IRIs [Durst and Suignard 2005], Blank nodes, and Literals, respectively). A triple (s, p, o) ∈ (I ∪ B) × I × (I ∪ B ∪ L) is called an RDF triple. In this tuple, s is the subject, p the predicate, and o the object. Assume additionally the existence of an infinite set V of variables disjoint from the previous sets. Definition 2.1. An RDF graph [Klyne et al. 2004 ] is a set of RDF triples. In our context, we refer to an RDF graph as an RDF dataset, or simply a dataset. SPARQL is essentially a graph-matching query language. A SPARQL query is of the form H ← B, where B, the body of the query, is a complex RDF graph pattern expression that may include RDF triples with variables, conjunctions, disjunctions, optional parts, and constraints over the values of the variables, and H, the head of the query, is an expression that indicates how to construct the answer to the query. The evaluation of a query Q against a dataset D is done in two steps: The body of Q is matched against D to obtain a set of bindings for the variables in the body, and then using the information on the head of Q, these bindings are processed applying classical relational operators (projection, distinct, etc.) to produce the answer to the query, which can have different forms, such as a yes/no answer, a table of values, or a new RDF dataset. In this article, we concentrate on the body of SPARQL queries, that is, in the graph pattern-matching facility. Syntax of SPARQL Graph Pattern Expressions The official syntax of SPARQL [Prud'hommeaux and Seaborne 2008] considers operators OPTIONAL, UNION, and FILTER, and concatenation via a point symbol (.), to construct graph pattern expressions. The syntax also considers { } to group patterns, and some implicit rules of precedence and association. For example, the point symbol (.) has precedence over OPTIONAL, and OPTIONAL is left associative. In order to avoid ambiguities in the parsing, we present the syntax of SPARQL graph patterns in a more traditional algebraic formalism, using binary operators AND (.), UNION (UNION), OPT (OPTIONAL), and FILTER (FILTER). We fully parenthesize expressions making explicit the precedence and 1 2 = {μ ∈ 1 | for all μ ∈ 2 , μ and μ are not compatible}.
doi:10.1145/1567274.1567278 fatcat:f5qdvjxvlncahixa34s2v2pkqe