Automatic Extraction of Linguistic Data from Digitized Documents

Terrence Szymanski
<span title="2013-12-16">2013</span> <i title="Linguistic Society of America"> <a target="_blank" rel="noopener" href="" style="color: black;">Proceedings of the annual meeting of the Berkeley Linguistics Society</a> </i> &nbsp;
This paper presents a system for automatically extracting linguistic data from digitized linguistic documents using a combination of existing software packages and custom scripts. The system is designed to leverage existing resources in online digital libraries in order to bootstrap the creation of large, multi-lingual linguistic corpora, which can then be used to conduct data-driven experimental research into cross-linguistic or universal
linguistic phenomena. The system identifies instances of foreign-language text accompanied by reference-language translations within the text of printed books that have been scanned into digital format, and extracts these to produce a parallel corpus of example sentences. While the system achieves a high precision on predicting foreign text, its accuracy overall is low, and directions for improvement and future work are identified.
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="">doi:10.3765/bls.v39i1.3886</a> <a target="_blank" rel="external noopener" href="">fatcat:u6i5xs3a7jgcdfdic3zbhcu5gq</a> </span>
