Enabling generic keyword search over raw XML data

Manoj K Agarwal, Krithi Ramamritham
2015 2015 IEEE 31st International Conference on Data Engineering  
XML and JSON have become the default formats to exchange the information for web application or within enterprises. Keyword Search over XML data has been motivated by the need to relieve users from writing difficult XQueries since otherwise users are required to know the complex XML schema. In existing XML keyword search techniques the XML nodes returned for a keyword query are the Lowest Common Ancestor (LCA) nodes for the query keywords. In this paper, we argue that the LCA based techniques
more » ... ill require users to be well versed with the XML schema and also the data to be able to obtain meaningful query results. To address these shortcomings, we present a novel system, Generic Keyword Search (GKS), -for a given keyword query Q, instead of identifying (and returning information) only from LCA nodes, GKS returns 'meaningful' information from any XML node, which contains a subset of keywords in the search query Q. GKS response includes LCA nodes, if any, that would have been returned by LCA based techniques. GKS is also able to find highly relevant keywords and XML schema elements, deeper analytical insights -called DI -in the XML data in the context of the user query. DI enables users to navigate the XML data and to refine their queries even if they are not familiar with the data and the schema. Our experiments on real data sets show that GKS is able to return highly relevant responses to keyword queries efficiently. AND-semantics constraints underlying LCA based techniques are further highlighted by the following example: Example 1: Consider keyword queries Q1, Q2, Q3, on the XML document in Figure 1 (i). Each leaf node in the XML document is a text node (text node is an XML element directly containing its value). We have represented the document as shown in Figure 1 (i) for brevity. Response of SLCA and ELCA based algorithms are
doi:10.1109/icde.2015.7113410 dblp:conf/icde/AgarwalR15 fatcat:kci7dg7rrrew5d3hcmyw3ukrzq