Filters








30 Hits in 1.6 sec

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, Yonghui Wu
2019 Interspeech 2019  
This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use.  ...  The new corpus inherits desired properties of the LibriSpeech corpus while addressing a number of issues which make LibriSpeech less than ideal for text-to-speech work.  ...  We would like to express our gratitude to Guoguo Chen, Sanjeev Khudanpur, Vassil Panayotov, and Daniel Povey for releasing the LibriSpeech corpus, and to the thousands of Project Gutenberg and LibriVox  ... 
doi:10.21437/interspeech.2019-2441 dblp:conf/interspeech/ZenDCZWJCW19 fatcat:visxkvg4pzcuzc4nlcl7yhv26m

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech [article]

Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, Yonghui Wu
2019 arXiv   pre-print
This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use.  ...  The new corpus inherits desired properties of the LibriSpeech corpus while addressing a number of issues which make LibriSpeech less than ideal for text-to-speech work.  ...  We would like to express our gratitude to Guoguo Chen, Sanjeev Khudanpur, Vassil Panayotov, and Daniel Povey for releasing the LibriSpeech corpus, and to the thousands of Project Gutenberg and LibriVox  ... 
arXiv:1904.02882v1 fatcat:dw2sclnu7bfjtdql2osu3er5vy

RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis [article]

Rohola Zandie, Mohammad H. Mahoor, Julia Madsen, Eshrat S. Emamian
2021 arXiv   pre-print
This paper introduces RyanSpeech, a new speech corpus for research on automated text-to-speech (TTS) systems.  ...  In order to meet the need for a high quality, publicly available male speech corpus within the field of speech recognition, we have designed and created RyanSpeech which contains textual materials from  ...  LibriSpeech [3] and LibriTTS [2] derive their text source from the LibriVox project which is based on audiobooks [9] .  ... 
arXiv:2106.08468v1 fatcat:oid4lwzdjbe4hbzwcodk464iky

Simple and Effective Unsupervised Speech Synthesis [article]

Alexander H. Liu, Cheng-I Jeff Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevski, James Glass
2022 arXiv   pre-print
Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus.  ...  Experiments demonstrate the unsupervised system can synthesize speech similar to a supervised counterpart in terms of naturalness and intelligibility measured by human evaluation.  ...  Acknowledgements We thank Tomoki Hayashi and Erica Cooper for their advice on TTS training and evaluation. This research was supported in part by the MIT-IBM Watson AI Lab. References  ... 
arXiv:2204.02524v3 fatcat:l22ns5752vcmve5izkyyxw3qyi

Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior [article]

Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Andrew Rosenberg, Bhuvana Ramabhadran, Yonghui Wu
2020 arXiv   pre-print
Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech.  ...  Furthermore, initial experiments demonstrate that randomly sampling from the proposed model can be used as data augmentation to improve the ASR performance.  ...  ACKNOWLEDGEMENTS The authors thank Daisy Stanton, Eric Battenberg, and the Google Brain and Perception teams for their helpful feedback and discussions.  ... 
arXiv:2002.03788v1 fatcat:e6qmsdw5m5eh3d4fzxwkmufbpq

Introducing the VoicePrivacy Initiative [article]

Natalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, Massimiliano Todisco
2020 arXiv   pre-print
The VoicePrivacy initiative aims to promote the development of privacy preservation tools for speech technology by gathering a new community to define the tasks of interest and the evaluation methodology  ...  , and benchmarking solutions through a series of challenges.  ...  This is meant for subjective evaluation of speaker verifiability/linkability in a text-dependent manner.  ... 
arXiv:2005.01387v3 fatcat:f4fgcoxqg5ftxcdx4lymkegna4

Introducing the VoicePrivacy Initiative

N. Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, Massimiliano Todisco
2020 Interspeech 2020  
The VoicePrivacy initiative aims to promote the development of privacy preservation tools for speech technology by gathering a new community to define the tasks of interest and the evaluation methodology  ...  , and benchmarking solutions through a series of challenges.  ...  For all downstream goals to be achieved, it should: (a) output a speech waveform, (b) hide speaker identity, (c) leave other speech characteristics unchanged, (d) ensure that all trial utterances from  ... 
doi:10.21437/interspeech.2020-1333 dblp:conf/interspeech/TomashenkoS00NY20 fatcat:65nqflofsnchzobt6h72mu7fla

Injecting Text in Self-Supervised Speech Pretraining [article]

Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Gary Wang, Pedro Moreno
2021 arXiv   pre-print
In this paper, we propose to jointly learn representations during pretraining from two different modalities: speech and text.  ...  The proposed method, tts4pretrain complements the power of contrastive learning in self-supervision with linguistic/lexical representations derived from synthesized speech, effectively learning from untranscribed  ...  The TTS model for tts4pretrain was trained using the LibriTTS corpus described in Section 4.  ... 
arXiv:2108.12226v1 fatcat:mc55fw4pt5febcfyksuvm46hcq

CVSS Corpus and Massively Multilingual Speech-to-Speech Translation [article]

Ye Jia, Michelle Tadmor Ramanovich, Quan Wang, Heiga Zen
2022 arXiv   pre-print
CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems  ...  We introduce CVSS, a massively multilingual-to-English speech-to-speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English.  ...  CVSS is directly derived from the CoVoST 2 ST corpus, which is further derived from the Common Voice speech corpus.  ... 
arXiv:2201.03713v2 fatcat:yko3iwaj5nh2ley76d5zbspuba

The VoicePrivacy 2020 Challenge Evaluation Plan [article]

Natalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, Massimiliano Todisco
2022 arXiv   pre-print
The VoicePrivacy Challenge aims to promote the development of privacy preservation tools for speech technology by gathering a new community to define the tasks of interest and the evaluation methodology  ...  , and benchmarking solutions through a series of challenges.  ...  The first one is a subset of the LibriSpeech-dev-clean dataset. The second one (denoted as VCTKdev ) is obtained from the VCTK corpus.  ... 
arXiv:2205.07123v1 fatcat:wqplenaihrb45gwvgfdqgv2zny

Hi-Fi Multi-Speaker English TTS Dataset [article]

Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang
2021 arXiv   pre-print
This paper introduces a new multi-speaker English dataset for training text-to-speech models. The dataset is based on LibriVox audiobooks and Project Gutenberg texts, both in the public domain.  ...  To select speech samples with high quality, we considered audio recordings with a signal bandwidth of at least 13 kHz and a signal-to-noise ratio (SNR) of at least 32 dB.  ...  The LJSpeech is a single-speaker TTS dataset derived from LibriVox books. The corpus contains about 24 hours of speech sampled at 22.05 kHz.  ... 
arXiv:2104.01497v3 fatcat:ztgy67mrwbf3xkdt7tdouzk3pe

Predicting Prosodic Prominence from Text with Pre-trained Contextualized Word Representations [article]

Aarne Talman, Antti Suni, Hande Celikkanat, Sofoklis Kakouros, Jörg Tiedemann, Martti Vainio
2019 arXiv   pre-print
In this paper we introduce a new natural language processing dataset and benchmark for predicting prosodic prominence from written text.  ...  Finally we discuss the dataset in light of the results and point to future research and plans for further improving both the dataset and methods of predicting prosodic prominence from text.  ...  We also gratefully acknowledges the support of the Academy of Finland through projects no. 314062 from the ICT 2023 call on Computation, Machine Learning and Artificial Intelligence, no. 1293348 from the  ... 
arXiv:1908.02262v1 fatcat:at63k5ee2ffchkcnbpe4j446gq

Replacing Human Audio with Synthetic Audio for On-device Unspoken Punctuation Prediction [article]

Daria Soboleva, Ondrej Skopek, Márius Šajgalík, Victor Cărbune, Felix Weissenberger, Julia Proskurnia, Bogdan Prisacari, Daniel Valcarce, Justin Lu, Rohit Prabhavalkar, Balint Miklos
2021 arXiv   pre-print
We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings  ...  This is achieved by leveraging hash-based embeddings of automatic speech recognition text output in conjunction with acoustic features as input to a quasi-recurrent neural network, keeping the model size  ...  The dataset is derived from LibriSpeech [21] , with the differences that it preserves original text including punctuation, speech is split at sentence boundaries, and utterances with significant background  ... 
arXiv:2010.10203v2 fatcat:yabdublah5be5lzut4vowsxcue

VocBench: A Neural Vocoder Benchmark for Speech Synthesis [article]

Ehab A. AlBadawy, Andrew Gibiansky, Qing He, Jilong Wu, Ming-Ching Chang, Siwei Lyu
2021 arXiv   pre-print
Neural vocoders, used for converting the spectral representations of an audio signal to the waveforms, are a commonly used component in speech synthesis pipelines.  ...  It focuses on synthesizing waveforms from low-dimensional representation, such as Mel-Spectrograms. In recent years, different approaches have been introduced to develop such vocoders.  ...  We present VocBench, a framework for general purpose benchmark of neural vocoders on the speech synthesis task.  ... 
arXiv:2112.03099v1 fatcat:rvnn2xhsqnfvjaccnevmsmmyj4

MLS: A Large-Scale Multilingual Dataset for Speech Research [article]

Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert
2020 arXiv   pre-print
This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research.  ...  We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org.  ...  Acknowledgements We would like to thank Steven Garan for help in data preparation and text normalization and Mark Chou for helping with setting up the workflow for transcription verification.  ... 
arXiv:2012.03411v1 fatcat:krcmqjo2jzatfh6ahrlykqeooi
« Previous Showing results 1 — 15 out of 30 results