Automatic annotation of data extracted from large Web sites

Luigi Arlotta, Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo
2003 International Workshop on the Web and Databases  
Data extraction from web pages is performed by software modules called wrappers. Recently, some systems for the automatic generation of wrappers have been proposed in the literature. These systems are based on unsupervised inference techniques: taking as input a small set of sample pages, they can produce a common wrapper to extract relevant data. However, due to the automatic nature of the approach, the data extracted by these wrappers have anonymous names. In the framework of our ongoing
more » ... ct RoadRunner, we have developed a prototype, called Labeller, that automatically annotates data extracted by automatically generated wrappers. Although Labeller has been developed as a companion system to our wrapper generator, its underlying approach has a general validity and therefore it can be applied together with other wrapper generator systems. We have experimented the prototype over several real-life web sites obtaining encouraging results.
dblp:conf/webdb/ArlottaCMM03 fatcat:ft35urjupjdf5kxlt2qie74434