Peter Boncz, Torsten Grust, Maurice van Keulen, Stefan Manegold, Jan Rittinger, Jens Teubner
2006 Proceedings of the 2006 ACM SIGMOD international conference on Management of data - SIGMOD '06  
Relational XQuery systems try to re-use mature relational data management infrastructures to create fast and scalable XML database technology. This paper describes the main features, key contributions, and lessons learned while implementing such a system. Its architecture consists of (i) a range-based encoding of XML documents into relational tables, (ii) a compilation technique that translates XQuery into a basic relational algebra, (iii) a restricted (order) property-aware peephole relational
more » ... query optimization strategy, and (iv) a mapping from XML update statements into relational updates. Thus, this system implements all essential XML database functionalities (rather than a single feature) such that we can learn from the full consequences of our architectural decisions. While implementing this system, we had to extend the state-of-theart with a number of new technical contributions, such as looplifted staircase join and efficient relational query evaluation strategies for XQuery theta-joins with existential semantics. These contributions as well as the architectural lessons learned are also deemed valuable for other relational back-end engines. The performance and scalability of the resulting system is evaluated on the XMark benchmark up to data sizes of 11 GB. The performance section also provides an extensive comparison of all major XMark results published previously, which confirm that the goal of purely relational XQuery processing, namely speed and scalability, was met. We evaluate and compare performance on the XMark benchmark and some synthetic tests to benchmark document shredding and serialization performance, and also include a survey of previously published XMark results. Contributions. The main contribution of our system is to show that the relational XQuery paradigm can indeed leverage the power of mature relational database technology to deliver speed and scalability in the XML domain. Specifically, our system stores XML and manipulates XQuery sequence data purely using relational algebra on relational tables, without data type extensions or changes to the RDBMS storage manager. We do extend the relational query evaluator with a staircase join [19] operator, but this is not strictly needed; it only accelerates XPath location steps. In building the system, we learned valuable lessons regarding RDBMS functionality that can help to improve the performance of relational XQuery. One prominent opportunity is the use of positional algorithms to support lookup into SQL autoincrement key columns. We also present here the two most significant technical innovations to the relational XQuery paradigm (i.e., those that improve performance by more than an order of magnitude). First, we discovered that staircase join, as a technique originally developed for XPath evaluation, falls short of adequately evaluating XQuery, as it cannot efficiently deal with XPath expressions embedded in nested for-loops. The new loop-lifted staircase join presented here addresses this problem as a fast execution algorithm suitable for an XQuery processor providing the full axis feature. Second, we formulate relational query optimization strategies that are specifically suited to efficiently evaluate the query plans originating from XQuery compilation. We define a small number of column properties that are used to drive a peephole optimization stage just before relational code generation. It allows to recognize join patterns in a way that is immune to syntactic variance in XQuery queries, and also allows to avoid expensive sorting operations. We also formulate relational XQuery join evaluation strategies that exploit the existential semantics of general comparisons in XQuery (in contrast to plain relational joins). Outline. Section 2 introduces the basic concepts of relational XML storage and the XQuery compilation scheme we use. Sections 3 and 4 discuss our contributions in the area of loop-lifted staircase join and join optimization, respectively. In Section 5, we fill in a number of details of our implementation regarding the storage scheme used, and the way query optimization and updates are handled. Section 6 focuses on XMark and provides performance results up to XML documents of 11 GB. We also summarize all previously published XMark results to put our system in its proper perspective. We wrap up by discussing related work in Section 7 and outlining our conclusions in Section 8.
doi:10.1145/1142473.1142527 dblp:conf/sigmod/BonczGKMRT06 fatcat:ebu7v3o5m5f3jkynowhlsh6sf4