Corpus-Based Arabic Stemming Using N-Grams [chapter]

Abdelaziz Zitouni, Asma Damankesh, Foroogh Barakati, Maha Atari, Mohamed Watfa, Farhad Oroumchian
2010 Lecture Notes in Computer Science  
In languages with high word inflation such as Arabic, stemming improves text retrieval performance by reducing words variants. We propose a change in the corpus-based stemming approach proposed by Xu and Croft for English and Spanish languages in order to stem Arabic words. We generate the conflation classes by clustering 3-gram representations of the words found in only 10% of the data in the first stage. In the second stage, these clusters are refined using different similarity measures and
more » ... resholds. We conducted retrieval experiments using row data, Light-10 stemmer and 8 different variations of the similarity measures and thresholds and compared the results. The experiments show that 3-gram stemming using the dice distance for clustering and the EM similarity measure for refinement performs better than using no stemming; but slightly worse than Light-10 stemmer. Our method potentially could outperform Light-10 stemmer if more text is sampled in the first stage.
doi:10.1007/978-3-642-17187-1_27 fatcat:kpovw73hcjhltnnxstc4bwxwky