XNDDF: Towards a Framework for Flexible Near-Duplicate Document Detection Using Supervised and Unsupervised Learning

Lavanya Pamulaparty, C.V. Guru Rao, M. Sreenivasa Rao
2015 Procedia Computer Science  
The WWW has witnessed the exponential growth of web documents. People of all walks of life depend on the electronic superhighway, Internet, for retrieving information. Search engines retrieve data. Detecting near duplicate documents and handling them can help search engines to improve performance. In this paper, we proposed two algorithms. The first algorithm is meant for unsupervised probabilistic clustering of documents while the second algorithm is to detect near duplicates that can handle
more » ... offline processing of search engines. The clustered documents can avoid unnecessary comparisons while near duplicate detection algorithm involve local feature selection in are given document based on weights assigned to terms. A classifier is built to have supervised learning for discriminating documents. We proposed a framework named eXtensible Near Duplicate Detection Framework (XNDDF) which provides various components that provide room for flexible duplicate detection solutions besides showing offline and online processing required by a search engine. Our future work is to implement the framework components through a prototype application.
doi:10.1016/j.procs.2015.04.175 fatcat:nssoxjvixzdedbpgyapf343rhm