A Systematic Review of SQL-on-Hadoop by Using Compact Data Formats

Daiga Plase
2017 Baltic Journal of Modern Computing  
There are massive amounts of data generated from IoT, online transactions, click streams, emails, logs, posts, social networking interactions, sensors, mobile phones, their applications etc. The question is where and how to store these data in order to provide faster data access. Understanding and handling Big Data is a big challenge. The research direction in Big Data projects using Hadoop Technology, MapReduce kind of framework and compact data formats such as RCFile, SequenceFile, ORC, Avro,
more » ... Parquet shows that only two data formats (Avro and Parquet) support schema evolution and compression in order to utilize less storage space. In this paper, a systematic review of SQL-on-Hadoop by using compact data formats (Avro and Parquet) has been performed over the past six years (2010-2015) . With the help of search strategy followed, 94 research papers have been identified out of which 17 have been analyzed as relevant papers. This work outlines the usage of Avro or Parquet data format using publications of conference proceedings, journals and magazines of IEEEXplore, ACM Digital Library and ScienceDirect. At the end of the review, the conclusion has been made that direct comparison by compactness and fastness between Avro and Parquet do not exist in data science.
doi:10.22364/bjmc.2017.5.2.06 fatcat:ol664qf7rrbftj2wefdekzvw2y