An expressive and efficient language for XML information retrieval
Journal of the American Society for Information Science and Technology
Existing XML query languages do not support ranked query results based on textual similarity. For example, Fig. 1 shows an XML database containing books and CDs. We are interested in information-retrieval-style queries such as "order the items by similarity to the phrase 'traditional Ukrainian"' or "find books and CDs with similar titles. Unlike similar efforts, our expressive and efficient language for XML information retrieval (ELIXIR) allows such queries. ELIXIR extends XML-QL  with a
... ual similarity operator. Fig. 1 shows an ELIXIR query Q1 that finds books and CDs with similar titles. Some related query languages (e.g. [5, 7] ) provide only boolean keyword filtering, not ranked retrieval based on textual similarity. Other languages (e.g. [9, 6]) permit similarity comparisons only between a data value and a constant, but not similarity joins across two data values, and thus cannot express Q1. A more complicated ELIXIR query is shown in Fig. 2 . As can be seen from the output, this query finds recent SIGMOD Record publications, New Testament verses, and lines from Macbeth that are similar. Similarity joins are not merely of theoretical interest. For example, in data integration applications involving reconciling heterogeneous textual identifiers, similarity joins can eliminate the need for either common domains or handcrafted normalization routines. A naive implementation of a similarity join between two variables would generate the full cross product of the variable bindings, and then compute the similarity of every pair. Our ELIXIR query processing algorithm avoids this pitfall by: 1. rewriting an ELIXIR query Q1 into a series of XML-QL Q i 2 queries that generate intermediate relational data; 2. invoking WHIRL  to efficiently evaluate Q1's similarity predicates on this intermediate data; and 3. translating WHIRL's output into the XML structure specified by Q1 with a final XML-QL query Q4. Note that ELIXIR queries XML data in its native format, without the potentially expensive operation of flattening it into relations [4, 8] . Fig. 1 shows how the ELIXIR query processing would rewrite Q1 into a series of XML-QL queries Q i 2 , a WHIRL query Q3, and a final XML-QL query Q4. Also shown are the intermediate data that would be passed between the queries, and the final XML output. The algorithm first generates two XML-QL Q i 2 queries. Q 1 2 retrieves books from the XML database, and Q 2 2 retrieves CDs. Note that ELIXIR retrieves these items separately to avoid computing the cross product of book/CD pairs. The algorithm then generates the WHIRL query Q3, which applies Q1's similarity predicate. Q3's output is the desired result, but it is transformed into XML format by executing Q4. ELIXIR's query rewriting algorithm runs in time polynomial in the size of the query, and thus ELIXIR's additional functionality adds negligible overhead to the cost of the underlying XML-QL and WHIRL implementations. Experiments with our ELIXIR prototype demonstrate that our query processing algorithm scales reasonably well with respect to the size and complexity of an ELIXIR query, the size of the intermediate data generated by WHIRL, and the number of similarity-join predicates. Our implementation of ELIXIR extends XML-QL, but we are currently exploring ways of extending the ELIXIR algorithm to other XML query languages such as Quilt  . Another area of future research is query optimization. The ELIXIR query processor follows a strict three-stage policy for rewriting the original query, but this may be sub-optimal in some cases. We are exploring techniques for automatically searching the space of query rewritings in order to find one that is optimal with respect to standard metrics such as intermediate data sizes.