Automatic expansion of domain-specific lexicons by term categorization

Henri Avancini, Alberto Lavelli, Fabrizio Sebastiani, Roberto Zanoli
2006 ACM Transactions on Speech and Language Processing  
We discuss an approach to the automatic expansion of domain-specific lexicons by means of term categorization, a novel task employing techniques from information retrieval and machine learning. Specifically, we view the expansion of such lexicons as a process of learning previously unknown associations between terms and domains (i.e. disciplines, or fields of activity). The process generates, for each c i in a set C = {c 1 , . . . , cm} of domains, a lexicon L i 1 , bootstrapping from an
more » ... lexicon L i 0 and a set of documents θ given as input. The method is inspired by text categorization, the discipline concerned with labeling natural language texts with labels from a predefined set of domains, or categories. However, while text categorization deals with documents represented as vectors in a space of terms, we formulate the task of term categorization as one in which terms are (dually) represented as vectors in a space of documents, and in which terms (instead of documents) are labeled with domains. As a learning device we adopt a boostingbased method, since boosting (a) has demonstrated state-of-the-art effectiveness in a variety of text categorization applications, and (b) naturally allows for a form of "data cleaning", thereby making the process of generating a lexicon an iteration of generate-and-test steps. We present the results of a number of experiments using a set of domain-specific lexicons called WordNetDomains (which actually consists of an extension of WordNet), and performed using the documents in the Reuters Corpus Volume 1 as "implicit" representations for our terms. 3 "Bag" is used here in its set-theoretic meaning, as a synonym of multiset, i.e. a set in which the same element may occur several times. In text indexing, adopting a "bag of words" model means assuming that the number of times that a given word occurs in the same document is semantically significant. "Set of words" models, in which this number is assumed not significant, are thus particular instances of bag of words models.
doi:10.1145/1138379.1138380 dblp:journals/tslp/AvanciniLSZ06 fatcat:qdctcxjquzhevhuezidiaubfum