Topic Identification Of Arabic Texts Based On Statistical Techniques

L. Fodil Et Al.
2015 Zenodo  
One of the main factors that characterize a text is its content. Nowadays, the number of documents scattered online by private and public sectors are in the orders of millions. The rapid growth in the number of documents necessitates the use of automatic text classification. While a lot of effort has been put into manifold languages, minimal experimentation has been done with Arabic. Arabic language is highly inflectional and derivational language which makes text mining a challenging task. In
more » ... allenging task. In this paper, we propose two statistical approaches for topic identification. In the first approach we have developed two techniques ACM (Automatic Classification Method) and SACM (Semi-Automatic Classification Method) for the keywords extraction. In the second approach, we have used Centroid Classifier Models to classify the text documents by employing several distances (Euclidean, Manhattan, chebychev, etc.). The tests of evaluation are conducted on an Arabic textual corpus containing 5 different topics: Economics, Politics, Sport, Medicine and Religion. Results show the efficiency of the proposed approaches on topic identification.
doi:10.5281/zenodo.45766 fatcat:wnkyi6jdmnenbo3v5xmvq5a5ma