A Novel Technique for Web Log mining with Better Data Cleaning and Transaction Identification

2011 Journal of Computer Science  
Problem statement: In the internet era web sites on the internet are useful source of information for almost every activity. So there is a rapid development of World Wide Web in its volume of traffic and the size and complexity of web sites. Web mining is the application of data mining, artificial intelligence, chart technology and so on to the web data and traces user's visiting behaviors and extracts their interests using patterns. Because of its direct application in e-commerce, Web
more » ... , e-learning, information retrieval, web mining has become one of the important areas in computer and information science. There are several techniques like web usage mining exists. But all processes its own disadvantages. This study focuses on providing techniques for better data cleaning and transaction identification from the web log. Approach: Log data is usually noisy and ambiguous and preprocessing is an important process for efficient mining process. In the preprocessing, the data cleaning process includes removal of records of graphics, videos and the format information, the records with the failed HTTP status code and robots cleaning. Sessions are reconstructed and paths are completed by appending missing pages in preprocessing. And also the transactions which depict the behavior of users are constructed accurately in preprocessing by calculating the Reference Lengths of user access by considering byte rate. Results: When the number of records is considered, for example, for 1000 record, only 350 records are resulted using data cleaning. When the execution time is considered, the initial log take s119 seconds for execution, whereas, only 52 seconds are required by proposed technique. Conclusion: The experimental results show the performance of the proposed algorithm and comparatively it gives the good results for web usage mining compared to existing approaches.
doi:10.3844/jcssp.2011.683.689 fatcat:dognijmcknfqnmtwhmipynhafq