Semantic-based Structural and Content indexing for the efficient retrieval of queries over large XML data repositories

Norah Saleh Alghamdi, Wenny Rahayu, Eric Pardede
2014 Future generations computer systems  
h i g h l i g h t s • We exploit the semantics of XML Schema and data in building our index. • We trim search space using an object-based intersection technique of large data. • We eliminate irrelevant portions of data by discarding and irrelevant objects. • We prove the efficiency based on measuring CPU Cost and the scalability. • We show the high precision and recall of query results. a b s t r a c t The emergence of XML adoption as semi-structured data representation in multi-disciplinary
more » ... ains has highlighted the need to support the optimization of complex data retrieval processing. In a Big Data environment, the need to speed up data retrieval processes has further grown significantly. In this paper, we have adopted an optimization approach that takes into consideration the semantics of the dataset in order to deal with the complexity of multi-disciplinary domains in Big Data, in particular when the data is represented as XML documents. Our method particularly addresses a twig XML query (or a branched path query), as it is one of the most costly query tasks due to the complexity of the join operation between multiple paths. Our work focuses on optimizing the structural and the content part of XML queries by presenting a method for indexing and processing XML data based on the concept of objects that is formed from the semantic connectivity between XML data nodes. Our method performs object-based data partitioning, which aims at leveraging the notion of frequently-accessed data subsets and putting these subsets together into adjacent partitions. Then, it evaluates branched queries through two essential components: (i) Structural and Content indexing, which use an object-based connection to construct indices i.e. Schema Index, Data Index and Value Index; and (ii) query processing to produce the final results in optimal time. At the end of this paper, a set of experimental results for the proposed approach on a range of real and synthetic XML data, as well as a comparative study with other related work in the area, are presented to demonstrate the effectiveness of our proposed method in terms of CPU cost, matching and merging cost, scalability (size and number of branches) and total number of scanned elements. Our evaluation demonstrates the benefit of the proposed index in terms of performance speed as well as scalability which is critical in a large data repository. (E. Pardede). in multi-disciplinary domains [1] [2] [3] . The metadata in XML documents provides a semantically rich structure which can be leveraged for various information system applications. The metadata also opens up opportunities to improve techniques to access and process XML data. In this paper, we focus on processing XML queries efficiently by taking into consideration the semantic connectivity of the underlying XML documents. In particular, we focus on XML twig queries with or without value predicates. A twig query is a type of query which accesses XML trees with multiple branches and http://dx.
doi:10.1016/j.future.2014.02.010 fatcat:liwz6tqt4bfbbjrawrazwbb2zu