A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit <a rel="external noopener" href="https://econtents.bc.unicamp.br/inpec/index.php/joss/article/download/15038/10093">the original URL</a>. The file type is <code>application/pdf</code>.
<i title="Universidade Estadual de Campinas">
<a target="_blank" rel="noopener" href="https://fatcat.wiki/container/47ejz7smfzaa3lnzkxmlbmu5mu" style="color: black;">Journal of Speech Sciences</a>
humans or machines. To develop phrase break classifiers, we need a boundary-annotated and part-ofspeech tagged corpus. Boundary annotations in English speech corpora are descriptive, delimiting intonation units perceived by the listener; manual annotation must be done by an expert linguist. For Arabic, there are no existing suitable resources. We take a novel approach to phrase break prediction for Arabic, deriving our prosodic annotation scheme from Tajwid (recitation) mark-up in the Qur'an<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.20396/joss.v2i2.15038">doi:10.20396/joss.v2i2.15038</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/nivlypokcvgo5km3hws4yl35uu">fatcat:nivlypokcvgo5km3hws4yl35uu</a> </span>
more »... ch we then interpret as additional text-based data for computational analysis. This mark-up is prescriptive, and signifies a widely-used recitation style, and one of seven original styles of transmission. Here we report on version 1.0 of our Boundary-Annotated Qur'an dataset of 77430 words and 8230 sentences, where each word is tagged with prosodic and syntactic information at two coarse-grained levels. We then use this dataset to train, test, and compare two probabilistic taggers (trigram and HMM) for Arabic phrase break prediction, where the task is to predict boundary locations in an unseen test set stripped of boundary annotations by classifying words as breaks or non-breaks. The preponderance of non-breaks in the training data sets a challenging baseline success rate: 85.56%. However, we achieve significant gains in accuracy with a trigram tagger, and significant gains in performance recognition of minority class instances with both taggers via the Balanced Classification Rate metric. This is initial work on a longterm research project to produce annotation schemes, language resources, algorithms, and applications for Classical and Modern Standard Arabic.
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20210330142253/https://econtents.bc.unicamp.br/inpec/index.php/joss/article/download/15038/10093" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/ee/db/eedb18c7483dbae9b105786ce68b591d67a645a2.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.20396/joss.v2i2.15038"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="unlock alternate icon" style="background-color: #fb971f;"></i> Publisher / doi.org </button> </a>