Querying websites using compact skeletons

Anand Rajaraman, Jeffrey D. Ullman
2003 Journal of computer and system sciences (Print)  
Several commercial applications, such as online comparison shopping and process automation, require integrating information that is scattered across multiple websites or XML documents. Much research has been devoted to this problem, resulting in several research prototypes and commercial implementations. Such systems rely on wrappers that provide relational or other structured interfaces to websites. Traditionally, wrappers have been constructed by hand on a per-website basis, constraining the
more » ... calability of the system. We introduce a website structure inference mechanism called compact skeletons that is a step in the direction of automated wrapper generation. Compact skeletons provide a transformation from websites or other hierarchical data, such as XML documents, to relational tables. We study several classes of compact skeletons and provide polynomial-time algorithms and heuristics for automated construction of compact skeletons from websites. Experimental results show that our heuristics work well in practice. We also argue that compact skeletons are a natural extension of commercially deployed techniques for wrapper construction. r This three-step approach is commonly used in the industry. For example, Whizbang! Labs [35] calls it the "C 4 technique" (where the 4 C's are crawl, classify, capture, and compile; we do not include crawling in our taxonomy), while Junglee's Virtual Database Management System [20] has components called extractors, wrappers, and mappers corresponding to these three steps. A simple way to tackle problem (1) is to use a library of patterns (such as regular expressions). There are several approaches to constructing such patterns: "by hand" by studying several examples [19]; machine learning techniques; and more novel pattern extraction techniques [7] . Our work deals with problems (2) and (3) . Once we have identified the patterns of interest on the pages of a website, we can model the website as a directed graph with data elements at the nodes. We assume that the domains of the schema attributes are pairwise disjoint, so that we can unambiguously associate each data value with it corresponding attribute. There are (unlabeled) arcs in the graph corresponding both to structure within a web page (in the case where we identify multiple data elements within a web page) and to hyperlinks between pages. We call such a graph a data graph; Fig. 1 is an example. In the rest of the paper, we model websites as data graphs. Data graphs have been used extensively in the literature to model semistructured data, e.g., in [1, 6, [8] [9] [10] [11] 26, 30] . Compact skeletons are labeled trees that function as transformations between data graphs and relations. Intuitively, a compact skeleton describes the hierarchical layout of the corresponding website: for example, the IBM site groups jobs first by division ðDÞ; and each listing includes a job id ðIÞ; a job title ðTÞ; a job category ðCÞ; and the state where the job is in ðSÞ: The job title is hyperlinked to details about the job ðJÞ and an address to send resumes to apply for the job ðAÞ: This hierarchy is captured by the corresponding compact skeleton, shown in Fig. 15(a) . Compact skeletons are a natural extension of Junglee's Site Description Language (SDL) [19] , which has been used to construct thousands of wrappers for Junglee's VDBMS [20] . We describe the relationship between SDL and compact skeletons in Section 9. The rest of this paper is organized as follows. In Section 2 we introduce compact skeletons and analyze the properties of perfect compact skeletons (PCS), which apply when the data graph has complete information. In Section 3 we relax the completeness condition and introduce partially perfect compact skeletons (PPCS) that apply when the data graph has incomplete information, corresponding to null values in relations as in Fig. 1 . For a given data graph, we show that the PCS is unique but the PPCS is not; we introduce the notions of minimal and maximal PPCS that provide upper and lower bounds on the relation associated with the data graph. We describe polynomial-time algorithms to compute the PCS and the minimal and maximal PPCS. In Section 4 we present algorithms for querying websites given a compact skeleton; a special case is to materialize the entire relation corresponding to the website. Real-life websites often contain noise (i.e., superfluous information) in addition to incomplete information. In Sections 5 and 6 we study best-fit skeletons (BFS) that apply in such cases. It turns out that computing the BFS is an NP-complete problem. We examine two simple polynomial-time heuristics, the greedy and the weighted greedy. Experimental results show that the heuristics work well in practice. In Section 7 we discuss some of the practical issues that arise when applying the skeleton technique, such as websites that use form inputs. The skeletons we consider in Sections 2-7 are restricted to be labeled trees. Section 8 extends the theory to graph skeletons, where we permit skeletons to be arbitrary labeled graphs. We show that the PCS remains unique and provide a non-deterministic polynomial-time algorithm to compute
doi:10.1016/s0022-0000(03)00029-1 fatcat:xufuxpoqp5ey3dd5plnlnjx7ay