Generating Plans from Proofs

Michael Benedikt, Balder Ten Cate, Efthymia Tsamoura
2016 ACM Transactions on Database Systems  
We present algorithms for answering queries making use of information about source integrity constraints, access restrictions, and access costs. Our method can exploit the integrity constraints to find plans even when there is no direct access to relations appearing in the query. We look at different kinds of plans, depending on the kind of relational operators that are permitted within their commands. To each type of plan we associate a semantic property that is necessary for having a plan of
more » ... hat type. The key idea of our method is to move from a search for a plan to a search for a proof of the corresponding semantic property, and then generate a plan from a proof. We provide algorithms for converting proofs to plans, and show that they will find a plan of the desired type whenever such a plan exists. We show that while discovery of one proof allows us to find a single plan that answers the query, we can explore alternative proofs to find lower-cost plans. The information available via an interface may or may not be sufficient to answer a query. The reformulation problem in this case is to determine whether there is sufficient information, and if so to generate a query making use of the view predicates. Example 1.1. Consider a database with a table Professor containing ids, last names, and departments of professors, as well as a table Student listing the id and last name of each students, as well as their advisor's id. The database does not allow users to access the Professor and Student table directly, but instead exposes a view Professor where the id attribute is dropped, and a table Student where the advisor's id is replaced with the advisor's last name. That is, Professor is a view defined by the query: {lname, dname | ∃profid Professor(profid, lname, dname)} or equivalently by the constraints: Student is a view defined by the query: {studid, lname, profname | ∃profid ∃dname Student(studid, lname, profid) ∧ Professor(profid, profname, dname)} or equivalently by constraints: Consider a query asking for the last names of all students that have an advisor in the history department. This query can not be answered using the information in the views, since knowing the advisor's name is not enough to identify the department. On the other hand, the views are clearly sufficient to answer a query asking for the last names of students whose advisor has last name Jones, and we can reformulate that query as a selection over Student on profname "Jones".2 Naturally, constraints need not come from views. A natural use of constraints is to represent relationships between sources, such as overlap in the data. This overlap can be exploited to take a query that is specified over a source that a priori does not have sufficient data, and reformulate it over a source that provides the necessary data. Example 1.2. We consider an example schema from [Onet 2013] with a relation Employee where a row contains an employee's id, the employee's name, and the id of the employee's department, and also a relation Department, with each row containing the department's id, the department's name, and the id of the department's manager. The schema also contains the following two constraints: ∀deptid ∀dname ∀mgrid Department(deptid, dname, mgrid) → ∃N Employee(mgrid, N, deptid) ∀eid ∀ename ∀deptid Employee(eid, ename, deptid) → ∃D∃M Department(deptid, D, M ) That is, every department has a manager, and every employee works in a department. Suppose further that only the relation Department is accessible to a certain class A:3 of users. Intuitively, it should still be possible to answer some questions that one could ask concerning the relation Employee, making use of the accessible relation Department. For example, suppose a user poses the query asking for all department ids of employees, writing it like this: Q = {deptid | ∃eid ∃ename Employee(eid, ename, deptid)} Renaming the variables, the query can be reformulated as: Access methods and binding patterns. We will look at a finer notion of interface based on binding patterns, which state that a relation can only be accessed via lookups where certain arguments must be given. The most obvious example is a relation that can only be accessed via an indexed lookup on a certain subset of the attributes. Another example of restricted interfaces that can be modeled using relations with binding patterns comes from web forms. Thinking of the form as exposing a virtual table, the mandatory fields must be filled in by the user, while submitting the form returns all tuples that match the entered values. A third example comes from web services, where the mandatory fields correspond to arguments of a function call. Example 1.3. Consider a Profinfo table containing information about faculty, including their last names, office number, and id, but with a restricted interface that requires giving an id as an input. The query Q asking for ids of faculty named "Smith" cannot be answered over this schema. That is, there is no query over the schema that will return exactly the set of tuples satisfying Q. But suppose another source has a Udirectory table containing the id and last name of every university employee, with an interface that allows one to access the entire contents of the table. Then we can reason that Q has a plan that answers it: a plan would pull tuples from the Udirectory table, select those corresponding to "Smith", and check them within the Profinfo table. 2 In the above example, reasoning about access considerations was straightforward, but in the presence of more complex schemas we may have to chain several inferences, resulting in a plan that may make use of several auxiliary accesses. Example 1.4. We consider two telephone directory datasources with overlapping information. One source exposes information from Direct1(uname, addr, uid) via an access requiring a uname and uid. There is also a table Ids(uid) with no access restriction, that makes available the set of uids (hence a referential constraint from Direct1 into Ids on uid). The other source exposes Direct2(uname, addr, phone), requiring a uname and addr, and also a table Names(uname) with no access restriction that reveals all unames in Direct2 (that is, a referential constraint from Direct2 to Names). There is also a referential constraint from Direct2 to Direct1 on uname and addr. Consider a query asking for all phone numbers in the second directory: Q = {phone | ∃ uname ∃addr Direct2(uname, addr, phone)}. There is a plan that answers this query: it gets all the uids from Ids and unames from Names first, puts them into the access on Direct1, then uses the uname and addr of the resulting tuples to get the phone numbers in Direct2.2 We emphasize that our goal in this work is getting plans which give complete answers to queries. This means that if we have a query asking for the office number of all professors with last name "Smith", the plan produced should return all tuples in the answer, even if access to the Professor relation is limited.
doi:10.1145/2847523 fatcat:szrzwhessvhs5coqcsebtwqeva