Automated ontology instantiation from tabular web sources—The AllRight system

Dietmar Jannach, Kostyantyn Shchekotykhin, Gerhard Friedrich
2009 Journal of Web Semantics  
The process of populating an ontology-based system with high-quality and upto-date instance information can be both time consuming and prone to error. In many domains, however, one possible solution to this problem is to automate the instantiation process for a given ontology by searching (mining) the web for the required instance information. The primary challenges facing such a system include: (a) efficiently locating web pages that most probably contain the desired instance information, (b)
more » ... xtracting the instance information from a page, and (c) clustering documents that describe the same instance in order to exploit data redundancy on the web and thus improve the overall quality of the harvested data. In addition, these steps should require as little seed knowledge as possible. In this paper, the AllRight ontology instantiation system is presented, which supports the full instantiation life-cycle and addresses the above-mentioned challenges through a combination of new and existing techniques. In particular the system was designed to deal with situations where the instance information is given in tabular form. The main innovative pillars of the system are a new high-recall focused crawling technique (xCrawl), a novel table recognition algorithm, innovative methods for document clustering and instance name recognition, as well as techniques for fact extraction, instance generation and query-based fact validation. The successful evaluation of the system in different real-world application scenarios shows that the ontology instantiation process can be successfully automated using only a very limited amount of seed knowledge. 1 This paper combines and significantly extends the work presented in [33] [34] [35] .
doi:10.1016/j.websem.2009.04.002 fatcat:pxijgeakcvad3aeazbg56ansoi