Query processing in multistore systems: an overview

Carlyna Bondiombouy, Patrick Valduriez
2016 International Journal of Cloud Computing  
Building cloud data-intensive applications often requires using multiple data stores (NoSQL, HDFS, RDBMS, etc.), each optimised for one kind of data and tasks. However, the wide diversification of data store interfaces makes it difficult to access and integrate data from multiple data stores. This important problem has motivated the design of a new generation of systems, called multistore systems, which provide integrated or transparent access to a number of cloud data stores through one or
more » ... query languages. In this paper, we give an overview of query processing in multistore systems. We start by introducing the recent cloud data management solutions and query processing in multidatabase systems. Then, we describe and analyse some representative multistore systems, based on their architecture, data model, query languages and query processing techniques. To ease comparison, we divide multistore systems based on the level of coupling with the underlying data stores, i.e., loosely-coupled, tightly-coupled and hybrid. Our analysis reveals some important trends, which we discuss. We also identify some major research issues. In this paper, we give an overview of query processing in multistore systems. The objective is not to give an exhaustive survey of all systems and techniques, but to focus on the main solutions and trends, based on the study of nine representative systems (3 for each class). The rest of the paper is organized as follows. In Section 2, we introduce cloud data management, including distributed file systems, NoSQL systems and data processing frameworks. In Section 3, we review the main query processing techniques for multidatabase systems, based on the mediator-wrapper architecture. Finally, in Section 4, we analyze the three kinds of multistore systems, based on their architecture, data model, query languages and query processing techniques. Section 5 concludes and discusses open issues. Query Processing in Multistore Systems: an overview 3 2 Cloud Data Management A cloud architecture typically consists of multiple sites, i.e. data centers at different geographic locations, each one providing computing and storage resources as well as various services such as application (AaaS), infrastructure (IaaS), platform (PaaS), etc. To provide reliability and availability, there is always some form of data replication between sites. For managing data at a cloud site, we could rely on RDBMS technology, all of which have a distributed and parallel version. However, RDBMSs have been lately criticized for their "one size fits all" approach [SAD + 10]. Although they have been able to integrate support for all kinds of data (e.g. multimedia objects, XML documents) and new functions, this has resulted in a loss of performance, simplicity and flexibility for applications with specific, tight performance requirements. Therefore, it has been argued that more specialized DBMS engines are needed. For instance, column-oriented DBMSs [AMH08], which store column data together rather than rows in traditional row-oriented RDBMSs, have been shown to perform more than an order of magnitude better on Online Analytical Processing (OLAP) workloads. Similarly, Data Stream Management Systems (DSMSs) are specifically architected to deal efficiently with data streams, which RDBMSs cannot even support [NPP13]. The "one size does not fit all" argument generally applies to cloud data management as well. However, internal clouds used by enterprise information systems, in particular for Online Transaction Processing (OLTP), may use traditional RDBMS technology. On the other hand, for OLAP workloads and web-based applications on the cloud, RDBMSs provide both too much (e.g. transactions, complex query language, lots of tuning parameters), and too little (e.g. specific optimizations for OLAP, flexible programming model, flexible schema, scalability) [Ram09] . Some important characteristics of cloud data have been considered for designing data management solutions. Cloud data can be very large, unstructured or semi structured, and typically append-only (with rare updates). And cloud users and application developers may be in high numbers, but not DBMS experts. Therefore, current cloud data management solutions have traded ACID (Atomicity, Consistency, Isolation, Durability) transactional properties for scalability, performance, simplicity and flexibility. The preferred approach of cloud providers is to exploit a shared-nothing cluster [ÖV11], i.e. a set of loosely connected computer servers with a very fast, extensible interconnect (e.g. Infiniband). When using commodity servers with internal direct-attached storage, this approach provides scalability with excellent performance-cost ratio. Compared to traditional DBMSs, cloud data management uses a different software stack with the following layers: distributed storage, database management and distributed processing. In the rest of this section, we introduce this software stack and present the different layers in more details. Cloud data management (see Figure 1 ) relies on a distributed storage layer, whereby data is typically stored in files or objects distributed over the nodes of a shared-nothing cluster. This is one major difference with the software stack of current DBMSs that relies on block storage. Interestingly, the software stack of the first DBMSs was not very different from that used now in the cloud. The history of DBMSs is interesting to understand the evolution of this software stack. The very first DBMSs, based on the hierarchical or network models, were built as extensions of a file system, such as COBOL, with inter-file links. And the first RDBMSs too were built on top of a file system. For instance, the famous Ingres RDBMS [SKWH76] was implemented atop the Unix file system. But using a general-purpose file system was making data access quite inefficient, as the DBMS could have no control over data clustering on disk or cache management in main memory. The main criticism for this file-based approach was the lack of operating system support for database management (at that time) [Sto81] . As a result, the architecture of RDBMSs evolved from file-based to block-based, using a raw disk interface provided by the operating system. A block-based interface provides direct, efficient access to disk blocks (the unit of storage allocation on disks). Today all RDBMSs are block-based, and thus have full control over disk management. The evolution towards parallel DBMSs kept the same approach, in particular, to ease the transition from centralized systems. Parallel DBMSs use either a shared-nothing or shared-disk architecture. With
doi:10.1504/ijcc.2016.080903 fatcat:etteedwysbcyfidbaeh7x2lrne