SEMANTIC SIMILARITY MEASUREMENT FOR MALAY WORDS USING WORDNET BAHASA AND WIKIPEDIA BAHASA MELAYU: ISSUES AND PROPOSED SOLUTIONS

Tuan Norhafizah Tuan Zakaria, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43600 Selangor, Malaysia, Mohd Juzaiddin Ab Aziz, Mohd Rosmadi Mokhtar, Saadiyah Darus, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43600 Selangor, Malaysia, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43600 Selangor, Malaysia, Faculty of Social Sciences and Humanities, Universiti Kebangsaan Malaysia, 43600 Selangor,Malaysia
2020 International Journal of Software Engineering and Computer Systems  
Semantic similarity between words is a very important task and widely practiced in the field of natural language processing. Knowledge-based lexical resources like WordNet and Wikipedia are useful for this task. WordNet Bahasa (WB) and Wikipedia Bahasa Melayu (WikiBM) are the example of lexical resources for Malay language. However, these lexical resources are still ongoing and limited semantic information. This paper aims to discuss some issues regarding semantic similarity for Malay language,
more » ... propose a framework using WB and WikiBM, and evaluate the performance of both. An experiment was done using 150 Malay translated words (75 wordpairs). The result showed that the WB and WikiBM are capable to be adapted to literature techniques. For WB, we tested the coverage of WB based on three word-levels (stem, root and mix level) to find the most applicable word level as our dataset. The test indicated that the mix level (86.7%) outperformed the stem (78.7%) and root level (68.0%). For WikiBM, we evaluated the coverage of three main features in its article (gloss definitions, hyperlinks and categories) where these features are important in some previous techniques. The results of the experiment revealed that the gloss definition gave full coverage (100%) for our 75 word-pairs input compared to hyperlinks and categories (88.0%).
doi:10.15282/ijsecs.6.1.2020.4.0067 fatcat:74rnf2kvonce5onhwseqnuhhm4