OLERA: Semisupervised Web-Data Extraction with Visual Support

Chia-Hui Chang, Shih-Chien Kuo
2004 IEEE Intelligent Systems  
T he World Wide Web's explosive growth and popularity has resulted in countless information sources on the Internet. However, due to the heterogeneity and lack of structure in Web information sources, information-integration systems and software agents-and sometimes humans as well-must expend a great deal of effort when OLERA is a semisupervised information-extraction system that produces extraction rules from semistructured Web documents without requiring detailed annotation of the training
more » ... uments. It performs well for program-generated Web pages with few training pages and limited user intervention. manipulating various data formats. The problem of translating the content of input documents into structured data is called information extraction. An IE task is defined by its extraction target and input. Its extraction target is generally considered a relation of k-tuple, where k is the number of attributes in a record of the (desired, expected) data. An attribute may have zero (missing) or multiple instantiations in a record, and the extraction task will fill either a single slot (where k equals 1) or multiple slots. Programs that perform IE tasks are referred to as extractors or wrappers. A wrapper is generally a pattern-matching procedure that relies on a set of extraction rules. The simplest way to produce extractors is to have a human observe the input documents and write extraction rules, but this requires a certain degree of programming expertise. It's also time consuming, error prone, and not scalable. IE systems can generate wrappers that can receive input documents and convert them into structured data. We can categorize most IE systems (such as WIEN (Wrapper Induction Environment), 1 Softmealy, 2 and Stalker 3 ) as supervised machine learning, because they require "labeled training examples" to tell the IE system what constitutes a record. By comparing the preceding and succeeding strings of several extraction examples, IE systems can learn the common landmarks as extraction patterns for each attribute and the record boundary. However, the labeled training examples require users to annotate the input documents, which can be tedious even for a small corpus of training documents. IE systems that use unlabeled training examples are comparatively interesting but can only accept specific kinds of input such as program-generated pages under certain assumptions. . (See the sidebar for a more detailed comparison of these approaches.) We propose a semisupervised IE system-On-Line Extraction Rule Analysis-that lets users, with minimal effort, train extraction rules from Web pages. OLERA offers visual interaction by displaying discovered records in a spreadsheet-like table for schema assignment. System framework We introduce OLERA from the users' viewpointthat is, we explain how users interact with OLERA to generate extraction rules for their interested targets. Instead of labeling training pages, users enclose an information block of interest and then specify relevant information slots for each field in the record (see Figure 1 ). Enclosing a data block Given a set of training pages, an OLERA user first encloses a block that's large enough to contain one record of interest as an example. The user doesn't need to label the block's detailed subsegments to indicate the locations of titles, authors, or prices. The labeling work is delayed until OLERA generates the extraction pattern. In addition, the user needn't enclose every record of interest in the training page. The system can automatically discover other records that resemble the enclosed example and present the data in a spreadsheet for attribute designation. To illustrate, suppose we're interested in the main search result for Christmas songs. We can enclose
doi:10.1109/mis.2004.71 fatcat:7kpqyp7mjjaoto3oei3qsikc34