Filters








11,095 Hits in 3.4 sec

MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible [article]

Marcely Zanon Boito, William N. Havard, Mahault Garnerin, Éric Le Ferrand, Laurent Besacier
2020 arXiv   pre-print
We name this corpus MaSS (Multilingual corpus of Sentence-aligned Spoken utterances).  ...  The quality of the final corpus is attested by human evaluation performed on a corpus subset (100 utterances, 8 language pairs).  ...  We also scored a simple baseline that uses utterance length to retrieve spoken verses (in other words, it uses only distance between spoken utterances' lengths to solve the retrieval task).  ... 
arXiv:1907.12895v3 fatcat:giqdjrajgngvxabrbr7dhbzik4

Arja Nurmi, Tanja Rütten, and Päivi Pahta (eds): CHALLENGING THE MYTH OF MONOLINGUAL CORPORA

Rachelle Vessey
2018 Applied Linguistics  
In other words, do corpus linguists focusing on Spanish, Arabic, or Chinese languages also contend that their data sets are (or should be) monolingual?  ...  Within this range of corpus types, there is a largely clear distinction between spoken corpora, on the one hand, and written corpora, on the other.  ... 
doi:10.1093/applin/amy019 fatcat:zbp2zvl3tjce5oedjcjgdb2sji

A Large Multilingual Corpus of Pashto, Urdu, English for Automatic Spoken Language Identification

Aamer Zahoor, Nasir Ahmad
2019 International Journal of Computer Applications  
This work presents the development of a large multilingual speech corpus of Pashto, Urdu and English.  ...  The corpus comprises of three categories of phonetically rich spoken data in each language, that is, short questions regarding speaker's personal information, read speech and spontaneous speech from the  ...  This paper presents the development of a standard and phonetically rich large multilingual spoken corpus for Pashto, Urdu and English to serve as baseline corpus for automatic spoken language identification  ... 
doi:10.5120/ijca2019919522 fatcat:r75q2nu5yjfplja64jpg7hpira

How Does Language Influence Documentation Workflow? Unsupervised Word Discovery Using Translations in Multiple Languages [article]

Marcely Zanon Boito, Aline Villavicencio, Laurent Besacier
2019 arXiv   pre-print
We translate the bilingual Mboshi-French parallel corpus (Godard et al. 2017) into four other languages, and we perform bilingual-rooted unsupervised word discovery.  ...  Table 1 shows some statistics for the produced Multilingual Mboshi parallel corpus. 2 Bilingual Unsupervised Word Segmentation/Discovery Approach: We use the bilingual neuralbased Unsupervised Word Segmentation  ...  Lastly, we extend the bilingual Mboshi-French parallel corpus, creating a multilingual corpus for the endangered language Mboshi that we make available to the community.  ... 
arXiv:1910.05154v1 fatcat:sdvmzcun5nejvkwxaz5daqjlw4

Learner Corpora : Their Potentials for the Language Learning Classroom in Indonesian Primary School Contexts

Laily Zen Evynurul, Universitas Negeri Malang, Indonesia, Kadarisman Effendi, Apriana Aulia, Putri Yaniafari Rahmati
2019 The Journal of AsiaTEFL  
Her research projects mainly include topics on multilingualism and multilingual education.  ...  verb meanings using corpus data to examine the regularities of words and word relations.  ...  LINDSEI (Louvain International Database of Spoken English Inter-language): a corpus of spoken English produced by advanced learners of English from several mother tongue backgrounds; and e.  ... 
doi:10.18823/asiatefl.2019.16.2.20.718 fatcat:a5or4h64o5bxbbe4qyboe7x5wa

Multilingual Spoken Language Corpus Development for Communication Research [chapter]

Toshiyuki Takezawa
2006 Lecture Notes in Computer Science  
In this study, we describe an experience with multilingual spoken language corpus development at our research institution, focusing in particular on speech recognition and natural language processing for  ...  Multilingual spoken language corpora are indispensable for research on areas of spoken language communication, such as speech-to-speech translation.  ...  Acknowledgments The work reported here was mainly conducted at ATR Spoken Language Communication Research Laboratories. The authors are grateful to Prof. Seiichi Yamamoto, Dr.  ... 
doi:10.1007/11939993_78 fatcat:h7ujqavdmnh4hpjnqajc6c7usm

Crossroads Corpus creation: Design and case study

Abbie Hantgan-Sonko
2017 Yearbook of the Poznan Linguistic Meeting  
The newly compiled corpus contains approximately 183,000 annotations of multilingual, spoken data, gathered by eight researchers over a ten year span using methods ranging from structured lexical elicitation  ...  A potential path for convergence or divergence that emerged during data collection and in building and searching the corpus is the crossroads in the phonetic production of word-initial velar plosives.  ...  corpus of spoken multilingual data.  ... 
doi:10.1515/yplm-2017-0009 fatcat:s7huoabd4vaolaravcj3xrnvv4

Toward Multilingual Identification of Online Registers

Veronika Laippala, Roosa Kyllönen, Jesse Egbert, Douglas Biber, Sampo Pyysalo
2019 Nordic Conference of Computational Linguistics  
Using CORE and Fin-CORE data, we demonstrate the feasibility of cross-lingual register identification using a simple approach based on convolutional neural networks and multilingual word embeddings.  ...  We introduce the Finnish Corpus of Online REgisters (FinCORE), the first manually annotated non-English corpus of online registers featuring the full range of linguistic variation found online.  ...  Schwenk and Li (2018) compared their performance in genre classification of a multilingual Reuters corpus, using word embeddings generated by Ammar et al. (2016) and combined to document representations  ... 
dblp:conf/nodalida/LaippalaKEBP19 fatcat:ecbill2hnjaqjjp6pwsazt22fi

Eliciting comparable spoken data in minor languages: first observations from the corpus Kontatti

Marta Ghilardi
2019 Suvremena Lingvistika  
In this contribution, we will deal with the issue of building a spoken corpus of conversational data that can be easily compared across languages.  ...  We will present linguistic codes embedded in Trentino and South Tyrol, where multilingualism (de jure) is the rule.  ...  multilingual corpus, where the multilingual dimension lies not in the translation of texts or words, but in the recording of plurilingual speakers, in order to cast some light in the functioning of both  ... 
doi:10.22210/suvlin.2019.088.07 fatcat:dng2hd3tqnebdlxocfrturrfbe

TEP: Tehran English-Persian Parallel Corpus [chapter]

Mohammad Taher Pilevar, Heshaam Faili, Abdol Hamid Pilevar
2011 Lecture Notes in Computer Science  
To the best of our knowledge, TEP has been the first freely released large-scale (in order of million words) English-Persian parallel corpus.  ...  In spite of their importance in many multi-lingual applications, no large-scale English-Persian corpus has been made available so far, given the difficulties in its creation and the intensive labors required  ...  Conclusion In this paper we described the development of TEP corpus and also mentioned some of the problems faced in parallel corpus construction from movie subtitles together with possible solutions to  ... 
doi:10.1007/978-3-642-19437-5_6 fatcat:3bhjoq3vgrc6xep57qkpbxsfpy

Reducing latency for language identification based on large-vocabulary continuous speech recognition

Takuma Okamoto, Atsuo Hiroe, Hisashi Kawai
2017 Acoustical Science and Technology  
Introduction Spoken language identification (LID) [1] has been gaining attention as an important technique for multilingual spoken language communications.  ...  In multilingual spoken language applications such as bilingual speech translation systems, e.g., VoiceTra à , or multilingual spoken dialogue systems, the number of languages is two or less than 10, respectively  ...  Introduction Spoken language identification (LID) [1] has been gaining attention as an important technique for multilingual spoken language communications.  ... 
doi:10.1250/ast.38.38 fatcat:qha42lre7zgkzlzutph4mb5p34

The Spoken Language Translator

Manny Rayner, David Carter, Pierrette Bouillon, Vassilis Digalakis, Mats Wirén
2003 Computational Linguistics  
The speech recognition system was developed for multilingual speech and is capable of decoding a word string in any of a given set of languages.  ...  The Spoken Language Translator consists of 21 chapters contributed by different authors who worked on the building of components and/or the overall system for the Spoken Language Translator (SLT), an early  ... 
doi:10.1162/089120103321337485 fatcat:uw44bari5ra75jftck4g3mfu2q

Multilingual Speech Recognition [chapter]

E. Nöth, S. Harbeck, H. Niemann
1999 Computational Models of Speech Pattern Processing  
Only allowing transitions between the words from one language, each hypothesized word chain contains words from just one language and language identification is an implicit by-product of the speech recognizer  ...  In the second approach, the trained recognizers of the languages to be recognized, the lexicons, and the language models are combined to one multilingual recognizer.  ...  Ideally the recognized words are the same as what was actually spoken.  ... 
doi:10.1007/978-3-642-60087-6_31 fatcat:6veahlx55rewbh4jbjpf7prs6e

Overview of Spoken Language Communication Technologies

Hideki Kashioka
2012 Journal of NICT  
Toward this goal, we will intensively develop ICT for a human-machine interface, such as multilingual speech recognition, multilingual speech synthesis, and spoken dialogue technology.  ...  NICT, is to realize multi language communication technologies with spoken language regardless of who, where, when, how and in which language users speak.  ...  To realize such communications, multilingual spoken language communications need to be studied for people's smooth interactions in any languages, at any time and in any place with any expressions.  ... 
doi:10.24812/nictjournal.59.3.4_011 fatcat:dtai4rikozcrpbnbgqh7fcxadi

Towards Innovative Evaluation Methodologies for Speech Translation

Michael Paul, Hiromi Nakaiwa, Marcello Federico
2004 NTCIR Conference on Evaluation of Information Access Technologies  
reported here was supported in part by a contract with the National Institute of Information and Communications Technology entitled, "A study of speech dialogue translation technology based on a large corpus  ...  Multilingual Spoken Language Corpus The multilingual spoken language corpus, jointly developed by the C-STAR partners, is a collection of sentences that bilingual travel experts consider useful for people  ...  The Evaluation Campaign 2004 will be carried out using parts of the multilingual BTEC corpus.  ... 
dblp:conf/ntcir/PaulNF04 fatcat:l3ymbr3qzfez7cq2f6bsjrwana
« Previous Showing results 1 — 15 out of 11,095 results