Text Document Retrieval through Clustering using Meaningful Frequent Ordered Word Patterns

Pushpalatha K.P, G. Raju
2018 International Journal of Applied Engineering Research  
Agglomerative Hierarchical Clustering (AHC) algorithm has two major limitations: one is its rigid nature and the other is the resulting globular clusters. This is due to the constraint of closest pair selection of clusters to re-cluster in a single iteration. In the proposed algorithm, this rigid nature is removed by a greedy approach to select more than one object for clustering in the same iteration using k-Nearest Neighbours approach. Instead of k, a similarity threshold is used to find
more » ... bours. There is a biasness towards globular clusters that are produced when Normalised Google Distance (NGD) is used as the similarity measure. This limiation is reduced to a great extent using a modified NGD measure named as Score, modified by considering the local weightage between the features of different clusters. Many of the algorithms for text document retrieval are based on bag-ofwords (BoW) approach. The sequence of the words are not given much importance in such algorithms. The bag of words representation used for these clustering is often unsatisfactory as it ignores relationships between co-occuring terms. WordNet and Association mining is used to enable the algorithm for meaningful document clustering giving more importance to relationships between co-occuring terms. WordNet is used to enhance the common concepts among documents. Association mining is used to construct feature set named as Frequent Ordered Word Patterns (FOWPs) from WordNet-enriched document data sets. New documents, constructed on FOWPs, are used for mining clusters. This hybrid hierarchical clustering approach when applied on the data sets formed by the newly formed documents, is found to give better results in terms of F-measure than that is given in the work taken as a bench mark for comparison purpose.
doi:10.37622/ijaer/13.7.2018.4822-4833 fatcat:kgwproi2nzabjjed6t7dgetz34