Automatic Extraction of Linguistic Data from Digitized Documents

Terrence Szymanski
<span title="2013-12-16">2013</span> <i title="Linguistic Society of America"> <a target="_blank" rel="noopener" href="" style="color: black;">Proceedings of the annual meeting of the Berkeley Linguistics Society</a> </i> &nbsp;
In lieu of an abstract, here is a brief excerpt:This paper presents a system for automatically extracting linguistic data from digitized linguistic documents using a combination of existing software packages and custom scripts. The system is designed to leverage existing resources in online digital libraries in order to bootstrap the creation of large, multi-lingual linguistic corpora, which can then be used to conduct data-driven experimental research into cross-linguistic or universal
more &raquo; ... ic phenomena. The system identifies instances of foreign-language text accompanied by reference-language translations within the text of printed books that have been scanned into digital format, and extracts these to produce a parallel corpus of example sentences. While the system achieves a high precision on predicting foreign text, its accuracy overall is low, and directions for improvement and future work are identified.
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="">doi:10.3765/bls.v39i1.3886</a> <a target="_blank" rel="external noopener" href="">fatcat:u6i5xs3a7jgcdfdic3zbhcu5gq</a> </span>
<a target="_blank" rel="noopener" href="" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href=""> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="unlock alternate icon" style="background-color: #fb971f;"></i> Publisher / </button> </a>