Definable relations and first-order query languages over strings

Michael Benedikt, Leonid Libkin, Thomas Schwentick, Luc Segoufin
2003 Journal of the ACM  
We study analogs of classical relational calculus in the context of strings. We start by studying string logics. Taking a classical model-theoretic approach, we fix a set of string operations and look at the resulting collection of definable relations. These form an algebra -a class of n-ary relations for every n, closed under projection and Boolean operations. We show that by choosing the string vocabulary carefully, we get string logics that have desirable properties: computable evaluation
more » ... normal forms. We identify five distinct models and study the differences in their model-theory and complexity of evaluation. We identify a subset of these models which have additional attractive properties, such as finite VC dimension and quantifier elimination. Once you have a logic, the addition of free predicate symbols gives you a string query language. The resulting languages have attractive closure properties from a database point of view: while SQL does not allow the full composition of string pattern-matching expressions with relational operators, these logics yield compositional query languages that can capture common string-matching queries while remaining tractable. For each of the logics studied in the first part of the paper, we study properties of the corresponding query languages. We give bounds on the data complexity of queries, extend the normal form results from logics to queries, and show that the languages have corresponding algebras expressing safe queries. = fa 1 ; : : : ; a n g as first-order structures in the signature (P a1 ; : : : ; P an ; <), so that the structure M s for a string s of length k has the universe f1; : : : ; kg, with < being the usual ordering, and P ai being the set of the positions l such that the lth character in s is a i . Then a sentence of some logic L defines a language L( ) = fs 2 j M s j = g. Two classical results on logic and language theory state that languages thus definable in monadic second-order logic (MSO) are precisely the regular languages [20] , and the languages definable in first-order logic (FO) are precisely the star-free languages [54] . For a survey, see [65, 67] . An alternative approach to definability of strings, based on classical infinite model theory rather than finite model theory, dates back to the 1960s [20, 19] . One considers an infinite structure M consisting of h ; i, where is a set of functions, predicates and constants on . One can then look at definable sets, those of the form fã j M j = '(ã)g, where ' is a first-order formula in the language of M. A well-known result links definability with traditional formal language theory. Let reg consist of unary functions l a , a 2 , binary predicates el(x; y) and x y, where l a (x) = x a, el(x; y) states that x and y have the same length, and x y states that x is a prefix of y. Let S len be the model h ; reg i (we will explain the notation later). Then subsets of definable in S len are precisely the regular languages [20, 19, 14] ; moreover, this implies decidability of the first-order theory of S len [45, 14] . The key advantage of the "model-theoretic approach" is that one immediately gets an extension of the notion of recognizability from string languages to n-ary string relations for arbitrary n. One gets an algebra of n-ary string relations for every n, and these algebras automatically have closure under projection and product, in addition to the Boolean operations. In the case of the model S len above, this algebra is not new: in fact, the definable n-ary relations are exactly the ones recognizable under a natural notion of automaton running over n-tuples [19, 29] . We will refer to these automata-definable relations as the regular relations: the formal definition is given in subsection 3.1.1. We show here that by taking restrictions of the model S len , one gets new algebras of regular relations which behave better, in many ways, than the full algebra of recognizable relations given by S len . We introduce four such models here, and show that the definable sets in these models enjoy superior model-theoretic properties relative to the full algebra of recognizable relations associated with S len . 1 A key motivation for finding closed algebras of string relations comes from the field of databases, in particular, the study of query languages with interpreted operations [8, 10, 37, 50] . String manipulation facilities have long been recognized as a critical component of a realistic database query language. In SQL, for example, the WHERE clause can contain string pattern-matching expressions, such as FACULTY.NAME LIKE 'Nyk%nen'. These expressions can themselves be seen as queries over string relations: the above clause, for example, can be seen as a selection performed on a projection of the FACULTY relation. While the Relational Calculus gives a satisfactory formal model for SQL queries in the absence of built-in datatypes, there has been thus far no satisfactory model that fully accounts for string queries. The lack of an adequate formal model is related to the fact that SQL restricts the interaction of string operations and relational operations in a number of ad-hoc ways: one cannot apply the LIKE operator to a subquery to build up a new query, nor can one take the product of two string expressions built with LIKE. The natural way to obtain a calculus on string relations where one can freely compose string operations and relational operators is to start with a decidable structure on strings, like those mentioned above, and extend them to query languages by adding free predicate symbols -in the same way that traditional Relational Calculus can be obtained from first-order logic over pure equality. Using this approach we see that corresponding to S len and each of the four restricted models mentioned above, we obtain five interesting compositional query languages on strings. The paper has two main parts. In the first part, we study definable algebras of string relations, that is, modeltheoretic structures on and definability in these structures. We focus on five structures, of which the model S len mentioned above is the richest. In the second part of the paper, we deal with database applications, and study the corresponding query languages for string databases given by each of the five structures. This can be thought of as definability over model-theoretic structures and a finite relational database. Naturally, the results of the first part form the basis for reasoning about string query languages. We now summarize the developments in both parts of the paper. As mentioned above, we know that there exists a regular string algebra [20, 19, 14] , i.e., an algebra which exactly captures the regular sets when restricted to unary relations. An obvious question to ask, then, is whether new algebras of string relations arise through the model-theoretic approach. In particular, if we restrict the signature to be less expressive than reg , do we get new relation algebras lying within the recognizable relations? A natural starting point would be to find a signature that captures properties of the star-free sets. Here again, a simple example leaps out: consider the signature sf = ( ; (l a ) a2 ), and let S = h ; sf i. One can easily show that the definable subsets of in S are exactly the star-free ones. Furthermore, we will show that the definable n-ary relations of this model are exactly those definable by regular prefix automata (cf. [4]) whose underlying string automata are counter-free. Just as there is a significant difference between the complexity-theoretic behavior of regular languages and starfree languages (the latter are in AC 0 whereas the former are not), we find that the model S is much more tractable, in terms of its model-theory and its complexity than S len . In particular, we show that S has quantifier-elimination in a natural relational extension, while S len does not. It would be tempting to think of S and S len as canonical extensions of the notions of regularity and star-free to n-ary relations. However, we will show that in fact there are many choices for that share the same one-dimensional definable sets (either star-free or regular). Furthermore, algebras of definable sets may be identical in terms of the string languages they define, but differ considerably in the n-ary string relations in the definable algebra. We thus say that an algebra of definable sets based on h ; i, with reg is a regular algebra of definable sets if the subsets of in it (i.e the one-dimensional definable sets of h ; i) are exactly the regular sets. We likewise say that the algebra based on definable sets for h ; i is a star-free algebra of definable sets if the subsets of in the algebra are exactly the star-free sets. We then study new examples of regular and star-free definable algebras. We give an example of a star-free algebra with considerably more expressive power than the basic star-free algebra S. This model, which we denote by S left (as it allows one to add characters on the left of a string), shares most of the desirable properties of S: in particular, it has quantifier-elimination in a natural language, and membership test in this algebra has low complexity. More surprisingly, perhaps, we give examples of regular algebras (which we denote S reg and S reg;left ) that are strictly contained in S len = h ; reg i. Although the one-dimensional sets in these algebras are still the regular sets, the algebra as a whole shares many of the attractive properties of the star-free languages. In particular, we give quantifier-elimination results for these algebras. In contrast to this, we present a result giving a partial answer to open question 0 in [55], which asks whether S len itself has quantifier-elimination in a reasonable signature. We show that it does not have quantifier-elimination in any relational signature of bounded arity but does have quantifier-elimination RC(S) however, is unable to express certain natural queries, e.g., SELECT a x FROM R, where a is a fixed character. We contrast this to the query language RC(S len ) formed over the richest model. This extension has much greater expressiveness: it enables additional operations such as trimming/adding symbols on both left and right of a string, and the SIMILAR pattern-matching for checking membership in a regular language [41] . We show that this language also satisfies criteria 2 and 3 above, but in RC(S len ) one can express NP-complete and coNP-complete problems. This leads us to the consideration of the three intermediate languages, RC(S left ) ,RC(S reg ), and RC(S reg;left ). We find that each of these languages satisfies all three of the required criteria, while considerably extending the expressive power of RC(S). Related Work: One motivation of our approach was the study of automatic structures [48, 14] , which are a subclass of recursive structures [43] , and were introduced as a generalization of automatic groups [30] . In an automatic structure M = h ; i, every predicate in is definable by a finite automaton. More precisely, an n-ary predicate P is given by a letter-to-letter n-automaton [29, 34] . These structures were also studied in [45] in connection with decidability questions for first-order theories. It is known [19, 14] that a structure is automatic iff it can be interpreted in the structure S len ; hence S len is in some sense the universal automatic structure. The first part of this paper can be seen as a study of subclasses of automatic structures definable within S len that are significantly more restrictive, and that might have stronger model-theoretic or computational properties than a rich structure like S len . The structure S left , without the prefix relation, is useful for modeling queues and it first appeared in the verification context [16] , where an algorithm for deciding existential sentences was given. That algorithm was extended to the full theory in [60], but still without the prefix relation. On the database side, several approaches toward unifying string algebras with relational algebra have been developed in the prior literature. Most of them are based on the concatenation operator, or other operations that make logics undecidable in general. [36] studied the consequences of adding pattern-matching features to SQL. Papers [39, 42, 38] proposed an extension of the relational calculus with alignment logics and studied their complexity and expressive power. Without restrictions, they can define an arbitrary r.e. set [39] . Another approach was proposed in [17, 18] , which considered Datalog extended with appropriate transducers for string operations, and proved a number of completeness results. In [24] arbitrary regions (substrings) can be queried; this, when coupled with relational calculus, gives the power of string concatenation. Closer to our approach, [40, 59] study the relational calculus/algebra extended with an operation for concatenating strings. [25] studies first-order logic over term algebras and extends expressive
doi:10.1145/876638.876642 fatcat:dnyytdskv5holgdahoekdxucki