A smart itsy bitsy spider for the Web

Hsinchun Chen, Yi-Ming Chung, Marshall Ramsey, Christopher C. Yang
1998 Journal of the American Society for Information Science  
As part of the ongoing Illinois Digital Library Initiative are expected to worsen as the amount of online informaproject, this research proposes an intelligent agent aption increases. This is mainly due to the problems of proach to Web searching. In this experiment, we develinformation overload and vocabulary differences (Chen, oped two Web personal spiders based on best first 1994; Furnas, Landauer, Gomez, & Dumais, 1987). search and genetic algorithm techniques, respectively. Many researchers
more » ... consider that devising a scalable ap-These personal spiders can dynamically take a user's selected starting homepages and search for the most proach to Web search is critical to the success of Internet closely related homepages in the Web, based on the and Intranet services, and other current and future Nalinks and keyword indexing. A graphical, dynamic, Javational Information Infrastructure (NII) applications based interface was developed and is available for Web (Chen & Schatz, 1994; Schatz & Chen, 1996) . access. A system architecture for implementing such an The main information retrieval mechanisms provided agent-based spider is presented, followed by detailed discussions of benchmark testing and user evaluation by the prevailing Internet WWW-based software are results. In benchmark testing, although the genetic algobased on either keyword search (e.g., Lycos, Alta Vista, rithm spider did not outperform the best first search spiand Yahoo servers) or hypertext browsing (e.g., NCSA der, we found both results to be comparable and com-Mosaic, Netscape Navigator, and Microsoft Internet Explementary. In user evaluation, the genetic algorithm spiplorer). Keyword search often results in low precision, der obtained significantly higher recall value than that of the best first search spider. However, their precision poor recall, and slow response time because of the limitavalues were not statistically different. The mutation protions of indexing and communication methods (bandcess introduced in genetic algorithm allows users to find width), controlled language-based interfaces (the vocabother potential relevant homepages that cannot be exulary problem), and the inability of searchers themselves
doi:10.1002/(sici)1097-4571(19980515)49:7<604::aid-asi3>3.0.co;2-t fatcat:zsxvhdidjzewfike53tmgjou4a