Robust and Noise Resistant Wrapper Induction

Tim Furche, Jinsong Guo, Sebastian Maneth, Christian Schallhart
2016 Proceedings of the 2016 International Conference on Management of Data - SIGMOD '16  
Wrapper induction is the problem of automatically inferring a query from annotated web pages of the same template. This query should not only select the annotated content accurately but also other content following the same template. Beyond accurately matching the template, we consider two additional requirements: (1) wrappers should be robust against a large class of changes to the web pages, and (2) the induction process should be noise resistant, i.e., tolerate slightly erroneous (e.g.,
more » ... ne generated) samples. Key to our approach is a query language that is powerful enough to permit accurate selection, but limited enough to force noisy samples to be generalized into wrappers that select the likely intended items. We introduce such a language as subset of XPATH and show that even for such a restricted language, inducing optimal queries according to a suitable scoring is infeasible. Nevertheless, our wrapper induction framework infers highly robust and noise resistant queries. We evaluate the queries on snapshots from web pages that change over time as provided by the Internet Archive, and show that the induced queries are as robust as the human-made queries. The queries often survive hundreds sometimes thousands of days, with many changes to the relative position of the selected nodes (including changes on template level). This is due to the few and discriminative anchor (intermediately selected) nodes of the generated queries. The queries are highly resistant against positive noise (up to 50%) and negative noise (up to 20%). mantic role of elements in the template. In HTML, these are typically expressed in id or class attributes and often used for styling and scripting. HTML5's Microdata adds further semantic attributes such as itemprop. Many templates also provide static labels, either visible or in form of title tooltips. These criteria are designed to mimic human-created robust XPATH expressions. As an example, consider the following wrapper, extracting (spanelements of) directors from IMDB movie pages: descendant::div[starts-with(.,"Director:")][1]/ descendant::span This XPATH query selects the first div-element with text-value of the form "Director: . . . ". Starting from this div-element, the query selects all descendant span-elements. There is only one such spanelement, and this element contains precisely the director names of the movie. Thus, the above is an accurate wrapper for extracting director names. However, this query is not robust against changes to the web page: (1) Imagine more span elements (containing nondirector information) are inserted under the correct div-element. The wrapper would wrongly select all of them. (2) Imagine that more div-elements (without director information) are inserted before the div-element containing the director information. The wrapper would select the wrong div-element and return its contained span elements, if any. Our approach does not attempt to build an accurate model of changes done to a specific web site or class of web sites. Instead, we aim to model heuristics for building robust wrappers on any type of web site. What then, is a robust wrapper for this example? descendant::div[starts-with(.,"Director:")]/ descendant::span[@itemprop="name"]
doi:10.1145/2882903.2915214 dblp:conf/sigmod/FurcheGMS16 fatcat:un47r33zqfcvbi5nhyrie6t64i