Ontology-based extraction and structuring of information from data-rich unstructured documents

David W. Embley, Douglas M. Campbell, Randy D. Smith, Stephen W. Liddle
1998 Proceedings of the seventh international conference on Information and knowledge management - CIKM '98  
We can extract and structure information from documents if we can match attributes with document data values and associate these matched attribute-value pairs as tuples in relations. In this paper we present a general approach to extracting and structuring information from unstructured documents that are data rich have many recognizable constants. In our approach to this problem we start with an application ontology that describes the objects, relationships, and constraints in a domain of
more » ... st. We parse this ontology to generate recognition rules for constants and context keywords and to extract structural and constraint information. Given the generated rules and an unstructured document, we apply a recognizer to extract the constants and keywords, and we then apply a structure builder to match constant v alues with attributes, to associate attribute-value pairs as relations, and to populate a generated database schema with the extracted data according to the constraints of the application ontology. When applied to a list of several similar unstructured documents, the result is a populated database structured according to and ltered with respect to the application ontology. T o make our approach general, we x all the processes and change only the ontological description for a di erent application domain. In experiments we conducted on two di erent t ypes of unstructured documents taken from the Web, our approach attained recall ratios in the 80 and 90 range and precision ratios near 98.
doi:10.1145/288627.288641 dblp:conf/cikm/EmbleyCSL98 fatcat:g3wdosbfovd33kluj7vir6vose