A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit <a rel="external noopener" href="https://edoc.sub.uni-hamburg.de/informatik/volltexte/2018/234/pdf/bauman_koehn_henning_the_spoken_wikipedia_corpus_collection.pdf">the original URL</a>. The file type is <code>application/pdf</code>.
<i title="Springer Nature">
<a target="_blank" rel="noopener" href="https://fatcat.wiki/container/qiptgj2ubngu3hrrsrkbdvpchi" style="color: black;">Language Resources and Evaluation</a>
Spoken corpora are important for speech research, but are expensive to create and do not necessarily reflect (read or spontaneous) speech 'in the wild'. We report on our conversion of the preexisting and freely available Spoken Wikipedia into a speech resource. The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. There are initiatives to create and sustain Spoken Wikipedia versions in many languages and hence the available data grows over time. Thousands of spoken<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1007/s10579-017-9410-y">doi:10.1007/s10579-017-9410-y</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/2u4wfkcqknfdxcwwx3wc764tu4">fatcat:2u4wfkcqknfdxcwwx3wc764tu4</a> </span>
more »... s are available to users who prefer a spoken over the written version. We turn these semi-structured collections into structured and time-aligned corpora, keeping the exact correspondence with the original hypertext as well as all available metadata. Thus, we make the Spoken Wikipedia accessible for sustainable research. We present our open-source software pipeline that downloads, extracts, normalizes and text-speech aligns the Spoken Wikipedia. Additional language versions can be exploited by adapting configuration files or extending the software if necessary for language peculiarities. We also present and analyze the resulting corpora for German, English, and Dutch, which presently total 1005 h and grow at an estimated 87 h per year. The corpora, together with our software, are available via http://islrn.org/resources/684-927-624-257-3/. As a prototype usage of the time-aligned corpus, we describe an experiment about the preferred modalities for interacting with information-rich read-out hypertext. We find alignments to help improve user experience and factual information access by enabling targeted interaction. Keywords Wikipedia · speech corpus · found data · annotation · robust text-speech alignment · spoken hypertext · eyes-free speech access This article extends and consolidates previous research (Köhn et al, 2016; Rohde and Baumann, 2016) . This work was partly supported by a PostDoc grant by Daimler-and-Benz-foundation to the first author. This is a pre-print of an article published in Language Resources and Evaluation. The final authenticated version is available online at: https://doi.
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20200709211356/https://edoc.sub.uni-hamburg.de/informatik/volltexte/2018/234/pdf/bauman_koehn_henning_the_spoken_wikipedia_corpus_collection.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/b5/4f/b54f95a1af647951dfc38e9860a4d1a15a90e500.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1007/s10579-017-9410-y"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> springer.com </button> </a>