An improved topic relevance algorithm for focused crawling

Hong-Wei Hao, Cui-Xia Mu, Xu-Cheng Yin, Shen Li, Zhi-Bin Wang
2011 2011 IEEE International Conference on Systems, Man, and Cybernetics  
Topic relevance of pages and hyperlinks is the key issue in focused crawling. In this paper, an improved topic relevance algorithm for focused crawling is proposed. First, we implement a prototype system of the focused crawler -a topicspecific news gathering system which is prepared for comparative experiments on different similarity measures with the anchor text. Second, experiments on Chinese text corpus show that using LSI (Latent Semantic Indexing) outperforms using TF-IDF (term
more » ... verse document frequency) for hyperlink topic relevance prediction and pages topic relevance calculation. Third, in real crawling experiments on the prototype system, the crawler using TF-IDF has high performance with the accumulated topic relevance increasing quickly at the beginning of crawling, however the crawler using LSI can find more related pages and tunnel through. Fourth, combining their advantages of LSI and TF-IDF, we propose TFIDF+LSI algorithm to guide the crawling. Last, the crawler using TFIDF+LSI performs the same crawl task and demonstrates the combination advantage of TF-IDF and LSI. The experiment suggests that the crawler's performance using TFIDF+LSI is greatly superior to that using either TF-IDF or LSI respectively.
doi:10.1109/icsmc.2011.6083759 dblp:conf/smc/HaoMYLW11 fatcat:qex6mjeitvcghltaujii3biycq