ExWrap

Bethina Schmitt, Michael Christoffel, Jürgen Schneider
2002 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '02  
{ schmitt, christof, schneidj } @ ipd.uka.de MOTIVATION Within the WWW there are lots of different information retrieval services like search engines, news archives, product catalogs, or literature services. In order to support the user, meta search systems provide great benefits and synergies: For instance, a user query can be evaluated on a larger set of documents and by applying duplicate detection meta search systems can both improve the quality of the results and reveal different purchase
more » ... ptions for the same document or product. Within the UniCats project [1] we develop a meta search system based on digital library services like bookstores, library OPACs, or archives of research papers. Meta systems need a wrapping component that provides mappings between the different query and result formats of the underlying retrieval systems and the internal query and result representation. So, wrappers overcome syntactical and semantical heterogeneity. Actually, generating and maintaining wrappers is quite laborious and time-consuming, especially when they rely on the public web interfaces of the services. Thus, the challenge is to develop a mechanism for a fast wrapper generation and to design this process of generation as simple as possible because not only programmers but also librarians or even users should be able to generate wrappers in order to create useful meta search systems. IDEA AND DESIGN OF EXWRAP The ExWrap toolkit meets these challenges by a "Wrapping by Example" approach: A user conducts a sample search within the retrieval service that he wants to generate a wrapper for. And while formulating his search terms and browsing through the results the user marks the pieces of information that the wrapper should extract automatically later on. Within the demonstration, we present our ExWrap toolkit [3] and show how to generate a wrapper for a typical online bookstore like Amazon -fairly quick and without the need of any expert knowledge. To issue the sample query, ExWrap supports navigation through HTML pages until the user has reached the page with the initial search form. ExWrap automatically extracts all available parameters and values, so the user can easily insert his search terms. Afterwards, the user can study the results (see Figure 1 ). ExWrap offers three kinds of views: a DOM-tree representation of the HTML code, a text-only view, and a typical browser view (Q-S). The user can mark and name the interesting pieces of information, e.g. title, author, year, price, time of delivery, ISBN, summary, ... Therefore, the user can navigate through the different levels of result pages. In Figure 1 (T) the user has already specified three attributes on the first level (root) and two attributes on the second level (details). Right now, he is going to define an ISBN attribute on the detail level. Actually, the result of this sample search is not a wrapper program or any java code but a specific source description file for the queried retrieval service, which is stored in XML format. Our UniCats wrapper [2] is designed as a combination of a "generic" wrapper component together with a source description file which contains the specific properties of the retrieval service. For further information about our wrappers, wrapper generation or other parts of the UniCats architecture, please visit our project homepage at http://www.unicats.de.
doi:10.1145/564376.564493 dblp:conf/sigir/SchmittCS02 fatcat:ljuh4yl3efgezjfbsfixd6pt5m