Learning non-taxonomic relationships from web documents for domain ontology construction
Data & Knowledge Engineering
In recent years, much effort has been put in ontology learning. However, the knowledge acquisition process is typically focused in the taxonomic aspect. The discovery of non-taxonomic relationships is often neglected, even though it is a fundamental point in structuring domain knowledge. This paper presents an automatic and unsupervised methodology that addresses the non-taxonomic learning process for constructing domain ontologies. It is able to discover domain-related verbs, extract
... mically related concepts and label relationships, using the Web as corpus. The paper also discusses how the obtained relationships can be automatically evaluated against WordNet and presents encouraging results for several domains. have appeared during the last decade (see a survey in ) most of them mostly focus on the automatic acquisition of C and H and often neglect the importance of non-taxonomic interlinkage between concepts. In fact, the discovery of non-taxonomic relations is considered as the least tackled problem within ontology learning  . It appears to be the most intricate task as, in general, it is not known how many and what type of conceptual relationships should be modelled in a particular ontology. In general, two tasks have to be performed for learning non-taxonomic relationships. On the one hand, one has to detect which concepts are related. On the other hand, one has to figure out how these concepts are related; thus, a name for the relation has to be found. Considering the state of the art in non-taxonomic learning (discussed in Section 2), this paper presents a novel contribution in this area, proposing an automatic methodology for acquiring non-taxonomic relationships, framed in the context of domain ontology learning. As any other learning methodology, this data-driven knowledge acquisition process requires a source from which to extract relationships. In the past, this has been typically addressed by using domain texts, electronic dictionaries, semantic repositories (like WordNet ), and structured and semi-structured information and data sources. Nowadays, with the enormous success of the Information Society, the Web has become an invaluable source of information for almost every possible domain of knowledge. This has motivated many researchers to start considering the Web as a valid repository for Information Retrieval and Knowledge Acquisition tasks. However, the Web suffers from many problems that are not typically observed in the previously mentioned classical information repositories. Those sources are often quite structured in a meaningful organisation or selected by information engineers and, in consequence, one can assume the trustiness and validity of the information contained in them. In contrast, the Web presents a lack of structure, a high dynamicity of information, untrustworthiness of data sources and noise added by the visual representation, in addition to the ambiguity inherent to resources written in natural language. Despite all these shortcomings, the Web also presents characteristics that can be interesting for knowledge acquisition. Due to its huge size and heterogeneity, it can be assumed that the Web approximates the real distribution of the information in humankind  . From the learning point of view, this is a very interesting point. As will be discussed in Section 3, this is one of our motivations for using the Web as the source for knowledge acquisition. Summarizing, this paper presents a novel approach for discovering non-taxonomic relationships from the Web. The method is able to discover relevant verbs for a domain, which are used as the knowledge base to learn and label non-taxonomic relationships automatically and unsupervisedly. It uses the Web as the source of information from which to extract candidates and compute global scale statistics about information distribution. The only restriction is that resources should be written in English, as the method relies in English language regularities. The approach presented here has been designed as an extension of a previous work  that covers the learning of taxonomic relationships. The final goal of our research is the construction of domain ontologies from scratch. The rest of the paper is organised as follows. Section 2 presents an overview of previous research performed in the non-taxonomic learning area. Section 3 describes the working environment (the Web) and the main knowledge acquisition techniques that configure the base of the present proposal. Section 4 briefly introduces the general approach for learning domain ontologies from the Web in which the present approach is framed. The non-taxonomic learning stage is extensively described in Section 5. Section 6 discusses some relevant issues about the learning, including bootstrapping techniques, dynamic adaptation of the analysed corpus and efficient analysis of web resources. Section 7 describes a novel automatic evaluation procedure (using WordNet-based relatedness measures) and the results obtained for several well differentiated domains. The final section presents the conclusions and proposes some lines of future work. Related work There are several trends in learning non-taxonomic relationships from text depending on the degree of generality of the extracted relations. Some authors have developed approaches for learning specific relationships such as Part-of , Qualia , Telic and Agentive  , Causation  or a combination of them  by using concrete linguistic patterns (e.g. X consists of Y, X is used for Y, X leads to Y). Even though those approaches may have interest for developing or enriching general purpose semantic networks (such as WordNet), they are not able to retrieve domain-dependent relationships that are crucial for constructing domain ontologies.