A Comparison of ORC-Compress Performance with Big Data Workload on Virtualization

Kritwara Rattanaopas, Sureerat Kaewkeerat, Yanapat Chuchuen
2016 Applied Mechanics and Materials  
Big Data is widely used in many organizations nowadays. Hive is an open source data warehouse system for managing large data set. It provides a SQL-like interface to Hadoop over Map-Reduce framework. Currently, Big Data solution starts to adopt HiveQL tool to improve execution time of relational information. In this paper, we investigate on an execution time of query processing issues comparing two algorithm of ORC file: ZLIB and SNAPPY. The results show that ZLIB can compress data up to 87%
more » ... pared to NONE compressing data. It was better than SNAPPY which has space saving 79%. However, the key for reducing execution time is Map-Reduce that were shown by a less query execution time when mapper and data node were equal. For example, all query suites in 6-node(ZLIB/SNAPPY) with 250-million table rows has quite similar execution time comparison to 9-node(ZLIB/SNAPPY) with 350-million table rows.
doi:10.4028/www.scientific.net/amm.855.153 fatcat:vdy2ovsjbzamhly4zkwpiywtgm