Confidence Estimation Methods for Partially Supervised Relation Extraction
Proceedings of the 2006 SIAM International Conference on Data Mining
Text documents convey valuable information about entities and relations between entities that can be exploited in structured form for data mining, retrieval, and integration. A promising direction is a family of partially-supervised relation extraction systems that require little manual training. However, the output of such systems tend to be noisy, and hence it is crucial to be able to estimate the quality of the extracted information. We present Expectation-Maximization algorithms for
... cally evaluating the quality of the extraction patterns and derived relation tuples. We demonstrate the effectiveness of our method on a variety of relations. Overview Text documents convey valuable structured information. For example, medical literature contains information about new treatments for diseases. More specifically, information extraction systems can identify particular types of entities (such as company, location, and person names) and relationships between entities (mergers and acquisitions of companies, locations of company headquarters, and names of company executives) in natural language text for storage and retrieval in a structured database  . Once created, the database can be used to answer specific questions quickly and precisely by retrieving answers instead of complete documents, for sophisticated query processing, for integration with relational databases, and for traditional data mining tasks. A fundamental problem in information extraction is how to train an extraction system for an extraction task of interest. Traditionally, this training required substantial human effort and hence the development of information extraction systems was generally expensive and time consuming. An attractive approach to reduce the training cost, pioneered by Brin , is to start with just a handful of "seed" tuples for the relation of interest, and automatically discover extraction patterns for the task. These patterns, in turn, help discover new tuples for the relation, which could be used as new seed tuples for a next iteration of the process. In practice, however, this bootstrapping approach requires distinguishing between valid and invalid tuples proposed by the system. We present general Expectation-Maximization (EM) algorithms for estimating pattern and tuple confidence. Our specific contributions include: • A formalization of the pattern confidence estimation problem (Section 2). * Microsoft Research Email:firstname.lastname@example.org • A general EM-based method for estimating the confidence of automatically generated patterns and the extracted relation tuples (Section 3).