The Semi-automatic Construction of the Polish Cyc Lexicon

Aleksander Pohl
2010 Investigationes Linguisticae  
In this paper we discuss the problem of building the Polish lexicon for the Cyc ontology. As the ontology is very large and complex we describe semiautomatic translation of part of it, which might be useful for tasks lying on the border between the fields of Semantic Web and Natural Language Processing. We concentrate on precise identification of lexemes, which is crucial for tasks such as natural language generation in massively inflected languages like Polish, and we also concentrate on
more » ... word entries, since in Cyc for every 10 concepts, 9 of them is mapped to expressions containing more than one word. Motivation Our primary concern is to build algorithms and tools which bridge the gap between Polish language and the Semantic Web, thus bringing the benefits of the technology to the Polish speaking community. Even though the fields of the Semantic Web and Natural Language Processing have much in common, there are certain problems, which have to be solved, before the data available in the Semantic Web and the data made available by NLP techniques is fully translatable. This stems from the fact, that the reference resources for the Semantic Web are ontologies, while the Princeton WordNet and its incarnations for languages other than English, serve as the de facto standard for NLP. Yet, there exist mappings between concepts of ontologies and WordNets (e.g. there is a mapping between Cyc and Princeton WordNet 2.0), but these mappings have certain limitation, stemming from the fact, that the logical structures of ontologies and WordNets is different. The most problematic difference, in our opinion, is the huge discrepancy between the number and semantics of the types 2 of relations employed in both types of resources. In ontologies, the number of relations is not restricted a priori -it is only limited by the complexity of the domain of the ontology and by the desired level of detail. For instance, the old version of Dublin Core 3 defined 15 relations 4 , while the latest defines approx. 50; the Music Ontology 5 defines approx. 120 relations, DBpedia 6 approx. 1200 and Cyc approx. 17000 relations 7 . On the other hand, most of the WordNets are created in accordance with the original Princeton WordNet idea refraining from using cross-part-of-speech relations. What is more, the set of relations was primarily limited to these, which were well accepted by the linguistic researchers community. Even though there are exceptions to these rules (e.g. there are cross-part-of-speech relations in the Polish WordNet), and there are plans 2 From here, by relation we mean both type of a relation and instance of a relation. We hope this inadequacy will not introduce ambiguities, since in most cases the types of relations are discussed. 3 http://dublincore.org/documents/dcmi-terms/ 4 In RDF/OWL oriented ontologies the relations are always binary and are called properties. 5
doi:10.14746/il.2010.21.2 fatcat:tsnka3qxx5b6dhlsok77i5ouqy