Efficient document analytics on compressed data

Feng Zhang, Jidong Zhai, Xipeng Shen, Onur Mutlu, Wenguang Chen
2018 Proceedings of the VLDB Endowment  
Today's rapidly growing document volumes pose pressing challenges to modern document analytics, in both space usage and processing time. In this work, we propose the concept of compression-based direct processing to alleviate issues in both dimensions. The main idea is to enable direct document analytics on compressed data. We present how the concept can be materialized on Sequitur, a compression algorithm that produces hierarchical grammar-like representations. We discuss the major challenges
more » ... n applying the idea to various document analytics tasks, and reveal a set of guidelines and also assistant software modules for developers to effectively apply compression-based direct processing. Experiments show that our proposed techniques save 90.8% storage space and 77.5% memory usage, while speeding up data processing significantly, i.e., by 1.6X on sequential systems, and 2.2X on distributed clusters, on average. PVLDB Reference Format:
doi:10.14778/3236187.3236203 fatcat:jlcchu6kqja3tir4zo6y32prhy