Scalable in situ scientific data encoding for analytical query processing

Sriram Lakshminarasimhan, David A. Boyuka, Saurabh V. Pendse, Xiaocheng Zou, John Jenkins, Venkatram Vishwanath, Michael E. Papka, Nagiza F. Samatova
2013 Proceedings of the 22nd international symposium on High-performance parallel and distributed computing  
The process of scientific data analysis in high-performance computing environments has been evolving along with the advancement of computing capabilities. With the onset of exascale computing, the increasing gap between compute performance and I/O bandwidth has rendered the traditional method of post-simulation processing a tedious process. Despite the challenges due to increased data production, there exists an opportunity to benefit from "cheap" computing power to perform query-driven
more » ... ion and visualization during simulation time. To accelerate such analyses, applications traditionally augment raw data with large indexes, postsimulation, which are then repeatedly utilized for data exploration. However, the generation of current state-of-the-art indexes involve a compute-and memory-intensive processing, thus rendering them inapplicable in an in situ context. In this paper we propose DIRAQ, a parallel in situ, in network data encoding and reorganization technique that enables the transformation of simulation output into a query-efficient form, with negligible runtime overhead to the simulation run. DIRAQ begins with an effective core-local, precision-based encoding approach, which incorporates an embedded compressed index that is 3 − 6x smaller than current state-of-the-art indexing schemes. DIRAQ then applies an in network index merging strategy, enabling the creation of aggregated indexes ideally suited for spatial-context querying that speed up query responses by up to 10x versus alternative techniques. We also employ a novel aggregation strategy that is topology-, data-, and memory-aware, resulting in efficient I/O and yielding overall end-to-end encoding and I/O time that is less than that required to write the raw data with MPI collective I/O.
doi:10.1145/2462902.2465527 fatcat:pdrsmcgwsnfo7otplogzif5ae4