Chabok: a Map-Reduce based method to solve data warehouse problems

Mohammadhossein Barkhordari, Mahdi Niamanesh
2018 Journal of Big Data  
Introduction Existing information is a valuable asset for many different types of organizations. Storing and analysing information can solve many problems within an organization [1] . The results from data analyses help organizations make correct decisions and provide better services for customers. Thus, high speed storage and retrieval of large volumes of data generated by electronic devices and software systems are critical issues [2] [3] [4] . Many organizations consider big data solutions
more » ... cause they cannot manage their data with traditional database management systems [5]; therefore, they must seek drastic measures for the design and implementation of new systems according to big data architectures. These organizations must change their architectures from Abstract Currently, immense quantities of data cannot be managed by traditional database management systems. Instead, they must be managed by big data solutions using shared nothing architectures. Data warehouse systems are systems that address very large amounts of information. The most prominent data warehouse model is star schema, which consists of a fact table and some number of dimension tables. It is necessary to join the facts and dimensions for query executions on the data warehouse. In shared nothing architecture, all of the required information is not placed on a single node so it is necessary to retrieve information from other nodes, which causes network congestion and low speeds of query execution. To avoid this problem and achieve maximum parallelism, dimensions can be replicated over nodes if they are not too large. However, if there are dimensions with data volumes greater than the capacity of a node or dimensions where the data volume summation exceeds node capacity, the query execution is confronted with serious problems. In big data problems, the amount of data is immense, and thus replicating immense data cannot be considered an appropriate method. In this paper, we propose a method called Chabok, which uses two-phased Map-Reduce to solve the data warehouse problem. In this method, aggregation is performed completely on Mappers, and intermediate results are sent to the Reducer. Chabok does not need data replication for join omission. The proposed method was implemented on Hadoop, and TPC-DS queries were executed for benchmarking. The query execution time on Chabok surpassed prominent big data products for data warehousing.
doi:10.1186/s40537-018-0144-5 fatcat:hzfn5lgtcnczhk55imhlgdvaiy