Taxonomy induction based on a collaboratively built knowledge repository

Simone Paolo Ponzetto, Michael Strube
2011 Artificial Intelligence  
The category system in Wikipedia can be taken as a conceptual network. We label the semantic relations between categories using methods based on connectivity in the network and lexico-syntactic matching. The result is a large scale taxonomy. For evaluation we propose a method which (1) manually determines the quality of our taxonomy, and (2) automatically compares its coverage with ResearchCyc, one of the largest manually created ontologies, and the lexical database WordNet. Additionally, we
more » ... form an extrinsic evaluation by computing semantic similarity between words in benchmarking datasets. The results show that the taxonomy compares favorably in quality and coverage with broadcoverage manually created resources. 1. We propose to derive a taxonomy from the system of categories in Wikipedia. This amounts to transforming the Wikipedia categorization system into a full-fledged subsumption hierarchy such as the one found in Cyc [50] and Word-Net [28]. 2. We develop a set of lightweight heuristics to automatically distinguish isa and notisa relations between the categories in Wikipedia. Our method works by capturing linguistic regularities in category labels (syntax-based methods, Section 2.3), exploiting naming conventions and connectivity in the graph (connectivity-based methods, Section 2.4), as well as mining large corpora for patterns expressing semantic relations (lexico-syntactic based methods, Section 2.5). The result is a large scale taxonomy including 335,128 semantic links. 3. We perform an evaluation which (1) determines the quality of our taxonomy based on human assessment, and (2) automatically compares its coverage with ResearchCyc and WordNet, arguably two of the largest manually annotated knowledge bases. For the manual evaluation we report an F 1 measure of up to 84%. For the automatic evaluation of coverage, we develop a taxonomy mapping method based on the syntactic structure of the Wikipedia category labels. This evaluation shows that there is little overlap in terms of concept relations between our taxonomy and ResearchCyc and WordNet, which indicates that Wikipedia complements those resources. Compared with ResearchCyc our taxonomy provides 28.2% extra coverage, compared with WordNet 211.6%. 4. We extrinsically evaluate the resource by computing the semantic similarity of word pairs on benchmarking datasets and improve our previous results from [104] by a large margin. The results obtained by using the taxonomy for computing semantic similarity are competitive with the best ones from the literature, i.e. up to a Pearson correlation coefficient r of 0.87, and lie near the estimated upper bound for performance for this task. The remainder of this article is structured as follows: in Section 2 we present our methods for generating a subsumption hierarchy from the network of categories in Wikipedia. In Section 3 we evaluate the automatically generated taxonomy by comparing it with ResearchCyc and WordNet, as well as by computing semantic similarity between words in benchmarking datasets. We finally present related work in Section 4 and conclude with suggestions for future work in Section 5. Methods Since May 2004 Wikipedia has allowed for structured access by means of categories. 2 The categories form a graph which can be taken to represent a conceptual network with unspecified semantic relations [104, 85] . In this section we present our methods to derive isa and notisa relations from these generic links. This allows us to generate a taxonomy from the Wikipedia category graph by performing the following task: for each pair of categories Subcat, Supercat where Subcat 3 is categorized into Supercat, decide whether Subcat isa Supercat or not. This aims at transforming a graph with unlabeled semantic relations into a semantic network where the links between categories are augmented with isa relations. The Wikipedia category network contains categories which are used to refer either to an entity, e.g. the Microsoft category, or to a property of a set of entities, e.g. Multinational companies. Accordingly, the relation between a category and its super-categories can be either one of subsumption (i.e. a concept-to-concept strict IS-A relation) or instantiation (i.e. an entity-to-concept INSTANCE-OF relation). In this work we do not distinguish categories that are classes from categories that are entities: therefore we use a definition of isa which includes both the IS-A and INSTANCE-OF relations. This is similar to the semantics of the subsumption relation found in WordNet prior to version 2.1. Although this is not methodologically adequate [73], it represents a valid step toward generating a taxonomy from the category network. As in the case of Word-Net [63], the distinction between classes and instances can be added to the generated taxonomy later [114] . These same considerations also apply when considering the notisa relation: although it does not carry any semantics per-se, i.e. it simply refers to 'what is not in an isa relation', it allows us to concentrate on generating a core subsumption hierarchy and does not rule out the generation of more specific relations, e.g. part-of, located-in, etc., at a later stage [69] . The pseudocode of our method is shown in Algorithm 1. We start with the unlabeled category graph found in Wikipedia and remove from it all nodes which refer to categories used for administration of the Wikipedia project (lines 1-6, Section 2.1). We then collect all remaining nodes and edges and build an initial taxonomy graph which assigns a default notisa relation to all category pairs (lines 7-12). Finally, given a set of processing components (described in Sections 2.2-2.6), we generate the isa relations by performing a cascade of tests on the category pairs which, at each step, have not yet been discovered as being in an isa relation (lines 13-18). For each processing component, we collect all edges in the taxonomy graph labeled with a notisa relation and test them for an isa semantic relation. As a result of the algorithm, the taxonomy graph is returned (line 19, category pairs for which no isa relation can be acquired retain the default notisa relation). The order of the processing components is enforced by the size of Wikipedia. We start with lightweight heuristics to filter out the number of categories to be processed by subsequent modules: these generate isa relations by analyzing the syntactic 2 Wikipedia can be downloaded at http://download.wikimedia.org. In our experiments we use the English Wikipedia database dump from March 12, 2008. This includes 2,276,274 articles, 99.1% of which are categorized into 337,741 categories. 3 We use Sans Serif for words and queries, CAPITALS for Wikipedia pages and Small Caps for Wikipedia categories. S.P. Ponzetto, M. Strube / Artificial Intelligence 175 (2011) 1737-1756 Lemma: The lemma is the canonical form of a word, e.g. the infinitive of inflected words or the singular of nouns. A lemmatizer automatically determines the lemma for a word. Stem: The stem (or the root) is the part of the word which does not change when the word is inflected or the word class changes. E.g. contain-is the stem of the words contained and container. A stemmer automatically reduces a word to its stem. Part of speech: The part of speech (POS) of a word is its word class, i.e., noun, verb, determiner, adjective, etc. The set of available parts of speech is generally language specific and is provided by reference corpora, e.g. the Penn treebank for English [56] . A POS tagger automatically labels words with their parts of speech. Chunk: A chunk is the segment of a sentence that identifies a basic non-recursive phrase corresponding to one of the major parts of speech: noun phrases (NPs), verb phrases (VPs), adjective phrases (APs) or prepositional phrases (PPs) [106] . In contrast to traditional phrase structures, chunks build flat, i.e. non-hierarchical and non-overlapping, sequences. A chunker segments sequences of words into chunks and labels them as NP chunk, VP chunk, etc. Parse: Syntactic structures of sentences are typically assumed to have a hierarchical representation in the form of a tree. A parse is the syntactic tree of a sentence. A parser determines this structure automatically. Head, modifier: Syntactic phrases consist of one (lexical) head and possibly of modifiers. The head of a phrase is the word which is grammatically most important in the phrase, since it determines the nature of the overall phrase [89] . For instance, the head of a verb phrase is a verb, the head of a noun phrase is a noun. Modifiers are optional elements of phrases. They can be words, phrases and clauses. Named entity: An entity for which one or many rigid designators [46] can be used to refer to it, e.g. the software company created by Bill Gates in 1975 can be referred to as Microsoft or Microsoft Corporation. Following the terminology found in some ontological analysis studies, e.g. OntoClean [36], we sometimes also refer to them as individuals. The recognition of proper names of persons (Bill Gates), geographical entities (Redmond, Washington) and organizations (Microsoft) is an important task in computational linguistics. A Named Entity Recognizer performs this task automatically. Word sense: Words can have different meanings depending on their context of occurrence. E.g. star can be used to refer to an astronomical object, an actor or the state of being prominent, etc. Typically, sense inventories are obtained from semantic lexica such as WordNet. Word senses from WordNet can be denoted with a superscript indicating the sense number (ordered by frequency of occurrence in the manually sense-tagged SemCor corpus [64]) and a subscript indicating the word class, e.g. star 1 n , star 4 n , star 1 v . The task of automatically determining word senses is called Word Sense Disambiguation [71] .
doi:10.1016/j.artint.2011.01.003 fatcat:lqpmsockpjaedbtwruvn6fsxxq