The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening

Timo Baumann, Arne Köhn, Felix Hennig
<span title="2018-01-09">2018</span> <i title="Springer Nature"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/qiptgj2ubngu3hrrsrkbdvpchi" style="color: black;">Language Resources and Evaluation</a> </i> &nbsp;
Spoken corpora are important for speech research, but are expensive to create and do not necessarily reflect (read or spontaneous) speech 'in the wild'. We report on our conversion of the preexisting and freely available Spoken Wikipedia into a speech resource. The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. There are initiatives to create and sustain Spoken Wikipedia versions in many languages and hence the available data grows over time. Thousands of spoken
more &raquo; ... s are available to users who prefer a spoken over the written version. We turn these semi-structured collections into structured and time-aligned corpora, keeping the exact correspondence with the original hypertext as well as all available metadata. Thus, we make the Spoken Wikipedia accessible for sustainable research. We present our open-source software pipeline that downloads, extracts, normalizes and text-speech aligns the Spoken Wikipedia. Additional language versions can be exploited by adapting configuration files or extending the software if necessary for language peculiarities. We also present and analyze the resulting corpora for German, English, and Dutch, which presently total 1005 h and grow at an estimated 87 h per year. The corpora, together with our software, are available via http://islrn.org/resources/684-927-624-257-3/. As a prototype usage of the time-aligned corpus, we describe an experiment about the preferred modalities for interacting with information-rich read-out hypertext. We find alignments to help improve user experience and factual information access by enabling targeted interaction. Keywords Wikipedia · speech corpus · found data · annotation · robust text-speech alignment · spoken hypertext · eyes-free speech access This article extends and consolidates previous research (Köhn et al, 2016; Rohde and Baumann, 2016) . This work was partly supported by a PostDoc grant by Daimler-and-Benz-foundation to the first author. This is a pre-print of an article published in Language Resources and Evaluation. The final authenticated version is available online at: https://doi.
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1007/s10579-017-9410-y">doi:10.1007/s10579-017-9410-y</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/2u4wfkcqknfdxcwwx3wc764tu4">fatcat:2u4wfkcqknfdxcwwx3wc764tu4</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20200709211356/https://edoc.sub.uni-hamburg.de/informatik/volltexte/2018/234/pdf/bauman_koehn_henning_the_spoken_wikipedia_corpus_collection.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/b5/4f/b54f95a1af647951dfc38e9860a4d1a15a90e500.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1007/s10579-017-9410-y"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> springer.com </button> </a>