Constraint based frequent pattern mining for generalized query templates from web log

RV Pujeri, GM Karthik
2011 International Journal of Engineering, Science and Technology  
The World-Wide Web provides every Internet citizen access to an abundance of information, but difficulty increases in identifying the relevant piece of information. Popular Search engine uses log for keeping track of user activities including user queries, click-through and their behavior. Research in web mining tries to address this problem by discovering knowledge from user logs. We propose an approach to discover patterns that can predict user's search, without aid of remote server. Our
more » ... d analyses user's interactions by constructing FP-tree, which facilitates in producing templates from the user logs. Consensus tree growth is restricted and templates are obtained from leaves, which assist user's searching process with precision. We show the effectiveness of our method on realistic web logs and explore the tradeoff between prediction's accuracy and usefulness. Test results show the improved algorithm has lower complexity of time and space, and fit the capacity of memory. Keywords: frequent pattern mining, web log mining, CBFP mining, FP-tree, data mining. under a hierarchical structure of several levels. It is well known that queries for web search engines are often too short to contain sufficient information to discriminate ambiguous documents. It was tried to infer this information from server-side web logs (Mobasher et al, 2000). The information contained in a web log includes the IP-address of the client, the page that has been retrieved, the time at which the request was initiated, the page from which the link originated, the browsing agent used, etc. The log keeps user's queries and their clicks, as well as their browsing activities. The file is quite large: one-day log would be over several hundred megabytes in size. The purpose of web-log mining is to improve web performance by utilizing the mined pattern. Unless additional information is available for informative search terms, there is no way to determine the information that a user browse is relevant or not. User queries posted directly to search engine, results in furnishing information based solely on the keywords. In this paper we tried to reduce the time of searching, unnecessary queries to remote search engine. Instead, our technique provides necessary and required information based on past previous log information obtained from local proxy server. In particular, we apply data mining to extract useful and implicit knowledge from web logs, which keep traces of information, during users' visit of web pages on web servers. If same query posted again, our technique provides direct link obtained from mined log information, otherwise query being sent to remote server side. Indeed, data mining is a process of discovering/extracting implicit and useful knowledge from large data sets. Data mining is very promising, since popular web sites get millions of hits each single day, and since traditional or humanly methods would be infeasible to analyze such logs. The log used in this paper is from a popular commercial search engine that is currently in use. In this paper, we develop and integrate two techniques in order to find interesting frequent patterns or templates from weblog, in order to support the informative search term of the search engine. First, a novel, compact FP tree like TRIE data structure is constructed (Grahne et al. White and Drucker, 2007) which is an extended prefix tree structure, storing quantitative information about the frequent hits of a web page from weblogs. Every node of consensus tree points in the tree describes how they are classified for easier access of the pattern or templates, and path from root to some leaf will give information about keywords, location, time, etc., for each web page or web object obtained from web logs. Leaf of the consensus tree contains value to a special index referencing the buckets to store the templates for each frequently accessed web object. The tree growth restriction, keywords mutation, predicting existence of a template is discussed in this paper later. The FP growth method ensures in finding frequent pattern or templates along with relative keyword list using least frequent item as a suffix, offering good selectivity. Second, using the strategy of constrained based mining (Mannila and restrict growth of FP-tree like TRIE using userspecified constraints (Ng R et al, 1998). Constraint Based mining allows us to focus on restraining the growth of consensus tree by providing additional mining constraints like level (Han, et al., 1999) and rule constraints (Lee and DeRaedt, 2004) . Level constraint focus on the mutation of keywords for a web object along with pruning the level of least accessed. Rule constraint focus to prune the template available in the leaf of the FP tree (Srikant and Agrawal" 1996; Pei et al., 2001; Grahne et al., 2000). Integrating two techniques, a new mechanism is developed known as web log mined using CBFP mining algorithm (Constraint Based Frequent Pattern mining) on web log for generalized queries with reasonable time and space complexities. CBFP mining algorithm constructs a TRIE structure which provides templates, keywords and article relating to a query, and proposes, based on the downward closure property, narrowing down the search into level-wise search strategy. CBFP mining algorithm is proposed for finding useful generalized templates, associating keywords of the queries with targetted articles. CBFP mining algorithm will process much faster than traditional search methods. Consensus tree path will be same for a particular article keyed in by many users with similar query, and templates summarised, will be a direct answer to future searches by similar user keying queries. CBFP mining algorithm will generalize keywords that can match new queries not made previously, making templates more general and precise. CBFP mining algorithm provides clear views of the templates and will locate the article, which users are mostly interested. CBFP mining algorithm is proposed based on two points. The first one is we search for frequent keywords of any length among URL sequences from a web log. The second one is that we search for all instances URL in the input logs. An implicit user-defined constraint plays a vital role in pruning the search space of the FP-tree and reduces searching time. Previous works focused on mining mainly to change the web structure for easier browsing (Craven et al., 1999; Sundaresan and Yi, 2000), predicting browsing behaviours for prefetching (Zaine et al. 1998; Boyan, 1996), or predicting user preference for active advertising (Pei et al., 2000; Perkowitz, 1997).Some works have been done on data mining system to discover useful patterns or templates from web server log (Ling et al, 2001) or to describe users' interest based on semantic (Li et al., 2008) , as well as on consideration of length pattern with position of webpage (Ou et al., 2008) , or have been designed specifically for Apache module (Heung et al., 2009). Few works used data mining to gather statistical information for better weighing and ranking of the document. In previous technique, results can be obtained from newly appended data into web log without details of previous / old data, offering minimum support and setting confidence value a trivial one, while handling large log data. Dependence between page accesses and calculation over the appearance of user access sequences frequency is not effective for various users. All these works suffer from the problem of requiring a large search space and from ineffectiveness in handling long patterns. In this paper, we focus on two factors, i.e., the order of dependencies between page accesses and calculation of frequencies of user access sequences, which characterize the performance of our technique. The aim and scope of this paper is to offer recommendations for developing future personalization services and to report on initial findings on a specific aspect that is highly relevant for personalization: the study of user sessions. We propose a novel data mining technique that discovers useful and
doi:10.4314/ijest.v2i11.64551 fatcat:hvz4pd2zkzhjvosampn6ovusdu