Concept Forest: A New Ontology-assisted Text Document Similarity Measurement Method

James Z. Wang, William Taylor
2007 IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)  
Although using ontologies to assist information retrieval and text document processing has recently attracted more and more attention, existing ontologybased approaches have not shown advantages over the traditional keywords-based Latent Semantic Indexing (LSI) method. This paper proposes an algorithm to extract a concept forest (CF) from a document with the assistance of a natural language ontology, the WordNet lexical database. Using concept forests to represent the semantics of text
more » ... , the semantic similarities of these documents are then measured as the commonalities of their concept forests. Performance studies of text document clustering based on different document similarity measurement methods show that the CF-based similarity measurement is an effective alternative to the existing keywords-based methods. In particular, this CFbased approach has obvious advantages over the existing keywords-based methods, including LSI, in processing text abstracts or in P2P environments where it is impractical to collect the entire document corpus for analysis.
doi:10.1109/wi.2007.11 dblp:conf/webi/WangT07 fatcat:byhme53t4fej5gejke6jmm5cje