Automatic Extraction of Linguistic Data from Digitized Documents

Terrence Szymanski
<span title="2013-12-16">2013</span> <i title="Linguistic Society of America"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/dfd2cmq4hfepdp5qrfmibg5qx4" style="color: black;">Proceedings of the annual meeting of the Berkeley Linguistics Society</a> </i> &nbsp;
In lieu of an abstract, here is a brief excerpt:This paper presents a system for automatically extracting linguistic data from digitized linguistic documents using a combination of existing software packages and custom scripts. The system is designed to leverage existing resources in online digital libraries in order to bootstrap the creation of large, multi-lingual linguistic corpora, which can then be used to conduct data-driven experimental research into cross-linguistic or universal
more &raquo; ... ic phenomena. The system identifies instances of foreign-language text accompanied by reference-language translations within the text of printed books that have been scanned into digital format, and extracts these to produce a parallel corpus of example sentences. While the system achieves a high precision on predicting foreign text, its accuracy overall is low, and directions for improvement and future work are identified.
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.3765/bls.v39i1.3886">doi:10.3765/bls.v39i1.3886</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/u6i5xs3a7jgcdfdic3zbhcu5gq">fatcat:u6i5xs3a7jgcdfdic3zbhcu5gq</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20180721072306/https://journals.linguisticsociety.org/proceedings/index.php/BLS/article/download/3886/3582" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/f5/5e/f55e709adc7e0e86090a62a5ff8f9daf4d9d36e8.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.3765/bls.v39i1.3886"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="unlock alternate icon" style="background-color: #fb971f;"></i> Publisher / doi.org </button> </a>