WORD SENSE DISAMBIGUATION WITHIN A MULTILINGUAL FRAMEWORK [thesis]

Mona Talat Diab
2003
Word Sense Disambiguation (WSD) is the process of resolving the meaning of a word unambiguously in a given natural language context. Within the scope of this thesis, it is the process of marking text with explicit sense labels. What constitutes a sense is a subject of great debate. An appealing perspective, aims to define senses in terms of their multilingual correspondences, an idea explored by several researchers, Dyvik (1998), Ide (1999), Resnik & Yarowsky (1999), and Chugur, Gonzalo &
more » ... o (2002) but to date it has not been given any practical demonstration. This thesis is an empirical validation of these ideas of characterizing word meaning using cross-linguistic correspondences. The idea is that word meaning or word sense is quantifiable as much as it is uniquely translated in some language or set of languages. Consequently, we address the problem of WSD from a multilingual perspective; we expand the notion of context to encompass multilingual evidence. We devise a new approach to resolve word sense ambiguity in natural language, using a source of information that was never exploited on a large scale for WSD before. The core of the work presented builds on exploiting word correspondences across languages for sense distinction. In essence, it is a practical and functional implementation of a basic idea common to research interest in defining word meanings in cross-linguistic terms. We devise an algorithm, SALAAM for Sense Assignment Leveraging Alignment And Multilinguality, that empirically investigates the feasibility and the validity of utilizing translations for WSD. SALAAM is an unsupervised approach for word sense tagging of large amounts of text given a parallel corpus — texts in translation — and a sense inventory for one of the languages in the corpus. Using SALAAM, we obtain large amounts of sense annotated data in both languages of the parallel corpus, simultaneously. The quality of the tagging is rigorously evaluated for both languages of the corpora. The automatic unsupervised tagged d [...]
doi:10.13016/ny9e-yqwo fatcat:4flcqprayzb5bo3rky257yk4ve