Detecting image purpose in World Wide Web documents

Seungyup Paek, John R. Smith, Daniel P. Lopresti, Jiangying Zhou
1998 Document Recognition V  
The numb e r o f W orld-Wide Web WWW documents available to users of the Internet is growing at an incredible rate. Therefore, it is becoming increasingly important to develop systems that aid users in searching, ltering, and retrieving information from the Internet. Currently, only a few prototype systems catalog and index images in Web documents. To greatly improve the cataloging and indexing of images on the Web, we h a ve developed a prototype rule-based system that detects the content
more » ... s in Web documents. Content images are images that are associated with the main content o f W eb documents, as opposed to a multitude of other images that exist in Web documents for di erent purposes, such as decorative, advertisement and logo images. We present a system that uses decision tree learning for automated rule induction for the content image detection system. The system uses visual features, text-related features and the document context of images in concert for fast and e ective content image detection in Web documents. We h a ve e v aluated the system by collecting more than 1200 images from 4 di erent W eb sites and we h a ve a c hieved an overall classi cation accuracy of 84.
doi:10.1117/12.304628 dblp:conf/drr/PaekS98 fatcat:2ukr6xh4ivd6dpnwomd6cxcepe