FOCIH: Form-Based Ontology Creation and Information Harvesting [chapter]

Cui Tao, David W. Embley, Stephen W. Liddle
2009 Lecture Notes in Computer Science  
Creating an ontology and populating it with data are both labor-intensive tasks requiring a high degree of expertise. Thus, scaling ontology creation and population to the size of the web in an effort to create a web of data-which some see as Web 3.0-is prohibitive. Can we find ways to streamline these tasks and lower the barrier enough to enable Web 3.0? Toward this end we offer a form-based approach to ontology creation that provides a way to create Web 3.0 ontologies without the need for
more » ... ialized training. And we offer a way to semi-automatically harvest data from the current web of pages for a Web 3.0 ontology. In addition to harvesting information with respect to an ontology, the approach also annotates web pages and links facts in web pages to ontological concepts, resulting in a web of data superimposed over the web of pages. Experience with our prototype system shows that mappings between conceptual-model-based ontologies and forms are sufficient for creating the kind of ontologies needed for Web 3.0, and experiments with our prototype system show that automatic harvesting, automatic annotation, and automatic superimposition of a web of data over a web of pages work well. Keywords: ontology generation from forms, information harvesting from the web, automatic annotation of web pages, web of data, Web 3.0. C. Tao, et al. Researchers are interested both in easing the burden of ontology creation and in automatic semantic annotation: -With regard to easing the burden of manual ontology creation (e.g., via Protege [ 23] or OntoWeb [28]), researchers are developing semi-automatic ontology generation tools. Tools such as OntoLT [6], Text2Onto [9], On-toLearn [22], and KASO [35] use machine learning methods to generate an ontology from natural-language text. These tools usually require a large training corpus, and, so far, the results are not very satisfactory [24]. Tools such as OntoBuilder [12], TANGO [32], and the ones developed by Pivk et al. [24] and Benslimane et al. [5] use structured information (HTML tables and forms) as a source for learning ontologies. Structured information makes it easier to interpret new items and relations. These approaches, however, derive concepts and relationships among concepts from source data, not from users, and thus do not provide the control some users need to express the ontological world-views they desire. -With regard to enabling automatic annotation, typical approaches (e.g., [2, 4, 8, 10, 15, 17, 21, 33] ) base their work on information extraction [26] . Postextraction alignment with ontologies, however, is their main drawback [17] . A way to overcome this drawback is through "extraction ontologies"-ontologies with data recognizers that are able to directly and automatically extract and thus annotate data with respect to specified ontologies (e.g., [11, 18, 19] ). Extraction ontologies, however, rely on human expertise to manually create, assemble, and tune reference sets and data recognizers. In another direction that tends to overcome both the alignment drawback and the manualcreation drawback, researchers propose structuring unstructured data for query purposes [7] or doing "best-effort" information extraction [27] . These approaches, however, yield less precise results both for the ontological structure of the data and for the annotation of the data with respect to the ontological structure.
doi:10.1007/978-3-642-04840-1_26 fatcat:zi35uea7h5hg3cqerzsohxzo6e