Duplication Free Minimal Keyword Search in Graph using Top-K Algorithm

Miss. Payal P. Thakur, Prof. Borkar N. R.
2017 IJARCCE  
Keyword search over a graph searches for a subgraph that contains a set of query keywords. A problem with most existing keyword search methods is that they may produce duplicate answers that contain the same set of content nodes (i.e., nodes containing a query keyword) although these nodes may be connected differently in different answers. Thus, users may be presented with many similar answers with trivial differences. In addition, some of the nodes in an answer may contain query keywords that
more » ... re all covered by other nodes in the answer. Removing these nodes does not change the coverage of the answer but can make the answer more compact. The answers in which each content node contains at least one unique query keyword are called minimal answersin this paper. We define the problem of finding duplication-free and minimal answers, and propose algorithms for finding suchanswers efficiently. Extensive performance studies using two large real data sets confirm the efficiency and effectiveness of theproposed methods. However, all the four trees on the left have the same set of content nodes. Since the users usually want to see different groups of content nodes that are close to each other and might not be interested in browsing multiple relations to see how the nodes that contain input keywords are related to each other, the above search results might not be desirable [1] . Producing results with distinct sets of content nodes can prevent the search engine from overwhelming the user with many similar answers[1]. In this, we first propose a new approach to keyword search in that produces duplication-free answers. Each answer produced by our approach has a unique set of content files. We also define minimal answers, in which each file contains at least one input keyword. We propose two algorithms that convert an answer to a minimal answer. We prove that the problem of finding a minimal answer while minimizing the proximity function that we use is NP-hard. Thus, one of the algorithms we propose is a greedy algorithm that searches for a sub-optimal minimal answer. We prove that this greedy algorithm has a bounded approximation ratio. Finally, for finding top-k duplication-free and minimal answers, we propose an Top-K algorithm. Our extensive experiments show the efficiency and effectiveness of the proposed methods. Our goal is to search exact file that we want with their exention, according to our keyword search that we enter and also show the graph of keyword search. We also calculated the time required for searching, frequency, and the size of file. Keyword searches are an alternative means for querying databases, which are simple and yet familiar to most internet users since they only require the input of some keywords. While keyword searches have proven effective for text documents (e.g., hypertext markup language (HTML) documents), the problem of keyword searches on structured data (e.g., relational databases) or the semi-structured data (e.g., XML databases) is not straightforward and well studied. Keyword searches in text documents find the documents that are more closely related to the input keywords, while in relational databases it searches the correlative tuples in the database that contains all or some the keywords. However, defining the results of keyword searches in XML documents is more complex. Keyword search on graph data usually returns a set of connected sub-structures, such as sub-trees or sub-graphs, showing that which nodes include query keywords and how they are inter-connected in the graph database[4]. Many approaches find minimal connected sub-trees containing query keywords as succinct answers to a given query [7] [8]. Since there can be a significant number of answer sub-trees in a large graph database, a relevance scoring function is often used to rank candidate answers and select top-k ones having the highest relevance. There have been proposed several approaches based on distinctrootsemantics, where for each node in the graph, at most one sub-tree rooted at the node is considered a possible answer to the query [9] . The answer tree consists of a set of content nodes containing all the query keywords as well as the nodes and edges on the shortest paths from the root to each content node. Its relevance is usually computed by a function of the shortest paths, such as the sum of the path lengths. By reducing the number of sub-trees to be explored in the graph significantly, the search methods based on the distinct root semantics can process keyword queries over a large volume of data more efficiently than other approaches. It also facilitates exploiting indexes on graph data to improve query performance [9] . the data model of choice for representing semi structured self-describing data. Semi structured query languages provide features, such as flexible path expressions, that allow one to query semi structured data, i.e., graph data that are not characterized by rigid structure. However, one still needs sufficient knowledge of the structure, role of the requested objects and XQuery in order to formulate a meaningful query.
doi:10.17148/ijarcce.2017.6525 fatcat:4rqebebx4fdodlmv26oa5qze5q