A Silver Standard Arabic Corpus for Segmentation and Validation

Hussein Awdeh, Adelle Abdallah, Gilles Bernard, Mohammad Hajjar, Mazen El-Sayed
2019 International Conference on Big Data and Cyber-Security Intelligence  
The Arabic Natural Language Processing applications suffer from the deficiency of both Arabic corpus and gold standard corpus. Defined as a collection of written or spoken texts stored on a computer, a corpus is written either in a single language, Monolingual Corpus or in several languages, Multilingual Corpus. A corpus is considered as the most important sources for semantic and syntaxic analysis in the domain of natural language processing. Our study aims to build a New Silver Arabic Corpus
more » ... ollected from a set of Newspaper Articles morphologically analyzed. It contains 18,167,183 words in total incorporating six categories, Religion, Economy, Culture, Sports, Local and International News. It is encoded namely in UTF-8 encoding and XML. This silver corpus can be used as an accurate reference for validation and learning in the syntaxic analysis mainly for the word segmentation and part of speech tagging.
dblp:conf/bdcsintell/AwdehABHE19 fatcat:iwn3eicq5rgbnalddpnoioxa7u