OntoLearn Reloaded: A Graph-Based Algorithm for Taxonomy Induction
In 2004 we published in this journal an article describing OntoLearn, one of the first systems to automatically induce a taxonomy from documents and Web sites. Since then, OntoLearn has continued to be an active area of research in our group and has become a reference work within the community. In this paper we describe our next-generation taxonomy learning methodology, which we name OntoLearn Reloaded. Unlike many taxonomy learning approaches in the literature, our novel algorithm learns both
... orithm learns both concepts and relations entirely from scratch via the automated extraction of terms, definitions, and hypernyms. This results in a very dense, cyclic and potentially disconnected hypernym graph. The algorithm then induces a taxonomy from this graph via optimal branching and a novel weighting policy. Our experiments show that we obtain high-quality results, both when building brand-new taxonomies and when reconstructing sub-hierarchies of existing taxonomies. Computational Linguistics Volume 39, Number 3 domain and the language used to express domain meanings within text. And, in turn, this connection can be established by producing full-fledged lexicalized ontologies for the domain of interest. Manually constructing ontologies is a very demanding task, however, requiring a large amount of time and effort, even when principled solutions are used (De Nicola, Missikoff, and Navigli 2009) . A quite recent challenge, referred to as ontology learning, consists of automatically or semi-automatically creating a lexicalized ontology using textual data from corpora or the Web (Gomez-Perez and Manzano-Mancho 2003; Biemann 2005; Maedche and Staab 2009; Petasis et al. 2011) . As a result of ontology learning, the heavy requirements of manual ontology construction can be drastically reduced. In this paper we deal with the problem of learning a taxonomy (i.e., the backbone of an ontology) entirely from scratch. Very few systems in the literature address this task. OntoLearn (Navigli and Velardi 2004) was one of the earliest contributions in this area. In OntoLearn taxonomy learning was accomplished in four steps: terminology extraction, derivation of term sub-trees via string inclusion, disambiguation of domain terms using a novel Word Sense Disambiguation algorithm, and combining the subtrees into a taxonomy. The use of a static, general-purpose repository of semantic knowledge, namely, WordNet (Miller et al. 1990; Fellbaum 1998), prevented the system from learning taxonomies in technical domains, however. In this paper we present OntoLearn Reloaded, a graph-based algorithm for learning a taxonomy from the ground up. OntoLearn Reloaded preserves the initial step of our 2004 pioneering work (Navigli and Velardi 2004) , that is, automated terminology extraction from a domain corpus, but it drops the requirement for WordNet (thereby avoiding dependence on the English language). It also drops the term compositionality assumption that previously led to us having to use a Word Sense Disambiguation algorithm-namely, SSI (Navigli and Velardi 2005)-to structure the taxonomy. Instead, we now exploit textual definitions, extracted from a corpus and the Web in an iterative fashion, to automatically create a highly dense, cyclic, potentially disconnected hypernym graph. An optimal branching algorithm is then used to induce a full-fledged treelike taxonomy. Further graph-based processing augments the taxonomy with additional hypernyms, thus producing a Directed Acyclic Graph (DAG). Our system provides a considerable advancement over the state of the art in taxonomy learning: r First, excepting for the manual selection of just a few upper nodes, this is the first algorithm that has been experimentally shown to build from scratch a new taxonomy (i.e., both concepts and hypernym relations) for arbitrary domains, including very technical ones for which gold-standard taxonomies do not exist. r Second, we tackle the problem with no simplifying assumptions: We cope with issues such as term ambiguity, complexity of hypernymy patterns, and multiple hypernyms.