Enabling scientific data storage and processing on big-data systems

Saman Biookaghazadeh, Yiqi Xu, Shujia Zhou, Ming Zhao
2015 2015 IEEE International Conference on Big Data (Big Data)  
Big-data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big-data systems cannot support the self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains and prevents scientific users from leveraging these systems to improve their
more » ... ivity. This paper presents a solution to this problem by enabling big-data systems to directly store and process scientific data. Specifically, it enables Hadoop to efficiently store NetCDF data on HDFS and process them in MapReduce using convenient APIs. It also enables Hive to support standard queries on NetCDF data, transparently to users. The paper also presents an evaluation of the proposed solution using several representative queries on a typical geoscientific dataset. The results show that the proposed approach achieves substantial speedup (up to 20 times) and space saving (83% reduction), compared to the traditional approach which has to convert NetCDF data to CSV format for Hadoop and Hive to use them.
doi:10.1109/bigdata.2015.7363978 dblp:conf/bigdataconf/BiookaghazadehX15 fatcat:cgp433manne55daz6n4ll2jqdi