Building Arabic Corpora: Concepts, Methodologies, Tools, and Experiments

Imad Zeroual, Abdelhak Lakhouaja
<span title="2019-01-17">2019</span> <i title="Zenodo"> Zenodo </i> &nbsp;
The term corpus comes from Latin and means "body". According to corpus linguists, a corpus can be defined as a collection of machine-readable authentic texts, including transcripts of spoken data. The focus of corpora builders is essentially divided into three areas: corpus compilation, data processing, and corpus annotation. Each one of these tasks requires specialists, takes time, and costs money. The further task is to infer information from corpora to provide empirical evidence for
more &raquo; ... c theories or to turn the data into products or services. Corpora are essential resources for computational linguistics and Natural Language Processing (NLP) fields. Expressly, corpora include empirical data that enable linguists and grammarians to form objective rather than subjective statements. Further, many NLP applications are moving from rule-based systems and knowledge-based methods to data-driven approaches. The prime motivation for carrying out the research in this thesis comes from the limited research on Arabic corpus linguistics and the lack of available resources, standards, and efficient tools that can cope with the perspectives of Arabic NLP. Furthermore, most Arabic corpora builders have often proposed corpora and tools that comply with their suitable objectives without considering the standardization and the international aspects. Therefore, another purpose of this thesis is to provide an overview of the central criteria and methodology of building corpora and to give a better understanding of Arabic corpus linguistics. To widen the scope of this thesis, it was necessary to carry out some tasks: We conducted a survey that covers 100 well-known and influential corpora to know how relevant corpora have been built, yet, what and how long it takes to complete the procedure. The survey presents a summarisation of data sources and different compilation methods used in relation to corpus characteristics like size and time consumed during the compilation process. Basically, there is a lack of appropriate t [...]
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.5281/zenodo.4441159">doi:10.5281/zenodo.4441159</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/nwix7lrzrbaxpgasing7mgdtwq">fatcat:nwix7lrzrbaxpgasing7mgdtwq</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20210226044610/https://zenodo.org/record/4441160/files/Thesis%20-%20I.Zeroual.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/ee/27/ee2751d46e3526e08567341e2478ea108b6958dd.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.5281/zenodo.4441159"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="unlock alternate icon" style="background-color: #fb971f;"></i> zenodo.org </button> </a>