HadoopXML

Hyebong Choi, Kyong-Ha Lee, Soo-Hyong Kim, Yoon-Joon Lee, Bongki Moon
2012 Proceedings of the 21st ACM international conference on Information and knowledge management - CIKM '12  
The volume of XML data is tremendous in many areas, but especially in data logging and scientific areas. XML data in the areas are accumulated over time as new data are continuously collected. It is a challenge to process massive XML data with multiple twig pattern queries given by multiple users in a timely manner. We showcase HadoopXML, a system that simultaneously processes many twig pattern queries for a massive volume of XML data with Hadoop. Specifically, HadoopXML provides an efficient
more » ... y to process a single large XML file in parallel. It processes multiple twig pattern queries simultaneously with a shared input scan. Users do not need to iterate M/R jobs for each query. HadoopXML also saves many I/Os by enabling twig pattern queries to share their path solutions each other. Moreover, HadoopXML provides a sophisticated runtime load balancing scheme for fairly assigning multiple twig pattern joins across nodes. With synthetic and real world XML dataset, we demonstrate how efficiently HadoopXML processes many twig pattern queries in a shared and balanced way.
doi:10.1145/2396761.2398745 dblp:conf/cikm/ChoiLKLM12 fatcat:mni5ey5xfjfuriwvt7ubxu7ei4