Filters








14 Hits in 7.1 sec

Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior [article]

Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Andrew Rosenberg, Bhuvana Ramabhadran, Yonghui Wu
2020 arXiv   pre-print
Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech.  ...  This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.  ...  ACKNOWLEDGEMENTS The authors thank Daisy Stanton, Eric Battenberg, and the Google Brain and Perception teams for their helpful feedback and discussions.  ... 
arXiv:2002.03788v1 fatcat:e6qmsdw5m5eh3d4fzxwkmufbpq

A Survey on Neural Speech Synthesis [article]

Xu Tan, Tao Qin, Frank Soong, Tie-Yan Liu
2021 arXiv   pre-print
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad  ...  In this paper, we conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends.  ...  In this way, the quantized discrete tokens and the speech can be regarded as pseudo paired data to pre-train a TTS model, which is then fine-tuned on few truly paired text and speech data [201, 358, 436  ... 
arXiv:2106.15561v3 fatcat:pbrbs6xay5e4fhf4ewlp7qvybi

Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis [article]

Alexandra Vioni, Myrsini Christidou, Nikolaos Ellinas, Georgios Vamvoukakis, Panos Kakoulidis, Taehoon Kim, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis
2021 arXiv   pre-print
This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system.  ...  This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module.  ...  Furthermore, a hierarchical, multi-level, fine-grained VAE structure is proposed in [9] , modeling word-level and phoneme-level prosody features, while a similar VAE structure with the addition of a quantization  ... 
arXiv:2111.10177v1 fatcat:7wa5o5yqsbfale6juzhkok3m24

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis [article]

Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao
2022 arXiv   pre-print
Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing  ...  This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.  ...  Sun et al. (2020b; a) adopt VAE to represent the fine-grained prosody variable, naturally enabling sampling of different prosody features for each phoneme.  ... 
arXiv:2205.07211v1 fatcat:xbd5tfkxdnfnbhg2zoj6w63ycq

Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance [article]

Hieu-Thi Luong, Junichi Yamagishi
2021 arXiv   pre-print
Generally speaking, the main objective when training a neural speech synthesis system is to synthesize natural and expressive speech from the output layer of the neural network without much attention given  ...  In this paper, we investigate the use of quantized vectors to model the latent linguistic embedding and compare it with the continuous counterpart.  ...  Audio, Speech, Language Process., to-speech samples using a quantized fine-grained vae and autore- vol. 29, pp. 745–755, 2021. gressive prosody prior,” in Proc.  ... 
arXiv:2106.13479v1 fatcat:3pva7ksvirgdzijtu5x7anizs4

You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation [article]

Aleksandr Laptev, Roman Korostik, Aleksey Svischev, Andrei Andrusenko, Ivan Medennikov, Sergey Rybin
2020 arXiv   pre-print
Using recent advances in speech synthesis (text-to-speech, or TTS), we build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.  ...  Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.  ...  [19] suggest modeling style using an autoregressive prior over representations from quantized fine-grained VAE and perform evaluation by synthesizing utterances for training ASR on LibriSpeech.  ... 
arXiv:2005.07157v2 fatcat:qgp3erhkwjgh5kxg2laoqdxmr4

Deep generative models for musical audio synthesis [article]

M. Huzaifah, L. Wyse
2020 arXiv   pre-print
There are a few distinct approaches that have been developed historically including modelling the physics of sound production and propagation, assembling signal generating and processing elements to capture  ...  acoustic features, and manipulating collections of recorded audio samples.  ...  Acknowledgements This research was supported by a Singapore MOE Tier 2 grant, "Learning Generative Recurrent Neural Networks," and by an NVIDIA Corporation Academic Programs GPU grant.  ... 
arXiv:2006.06426v2 fatcat:swt7npt3gnbj5ppzcf2ef3rose

Privacy-preserving Voice Analysis via Disentangled Representations [article]

Ranya Aloufi, Hamed Haddadi, David Boyle
2020 arXiv   pre-print
Our objective is to enable primary tasks such as speech recognition and user identification, while removing sensitive attributes in the raw speech data before sharing it with a cloud service provider.  ...  To defend against this class of attacks, we design, implement, and evaluate a user-configurable, privacy-aware framework for optimizing speech-related data sharing mechanisms.  ...  In the case of speech processing, an ideal disentangled representation would be able to separate fine-grained factors [22] such as speaker identity, noise, recording channels, and prosody, as well as  ... 
arXiv:2007.15064v1 fatcat:oancnsvxlja4zdz2whxef5s3tm

NAUTILUS: a Versatile Voice Cloning System

Hieu-Thi Luong, Junichi Yamagishi
2020 IEEE/ACM Transactions on Audio Speech and Language Processing  
We introduce a novel speech synthesis system, called NAUTILUS, that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker.  ...  By using a multi-speaker speech corpus to train all requisite encoders and decoders in the initial training stage, our system can clone unseen voices using untranscribed speech of target speakers on the  ...  ACKNOWLEDGMENTS This work was partially supported by a JST CREST Grant (JPMJCR18A6, VoicePersonae project), Japan, and MEXT KAKENHI Grants (16H06302, 17H04687, 18H04120, 18H04112, 18KT0051), Japan.  ... 
doi:10.1109/taslp.2020.3034994 fatcat:6kxb7ohf55fphczttutgxmlb4e

NAUTILUS: a Versatile Voice Cloning System [article]

Hieu-Thi Luong, Junichi Yamagishi
2020 arXiv   pre-print
We introduce a novel speech synthesis system, called NAUTILUS, that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker.  ...  By using a multi-speaker speech corpus to train all requisite encoders and decoders in the initial training stage, our system can clone unseen voices using untranscribed speech of target speakers on the  ...  ACKNOWLEDGMENTS This work was partially supported by a JST CREST Grant (JPMJCR18A6, VoicePersonae project), Japan, and MEXT KAKENHI Grants (16H06302, 17H04687, 18H04120, 18H04112, 18KT0051), Japan.  ... 
arXiv:2005.11004v2 fatcat:elj3sz6ognahpmoras2ju4nco4

A Roadmap for Big Model [article]

Sha Yuan, Hanyu Zhao, Shuai Zhao, Jiahong Leng, Yangxiao Liang, Xiaozhi Wang, Jifan Yu, Xin Lv, Zhou Shao, Jiaao He, Yankai Lin, Xu Han (+88 others)
2022 arXiv   pre-print
Generation, Dialogue and Protein Research.  ...  In each topic, we summarize clearly the current studies and propose some future research directions. At the end of this paper, we conclude the further development of BMs in a more general view.  ...  The most commonly used framework is an autoregressive text generation that produces output given the previously generated words word-by-word [4] , and another is a non-autoregressive text generation that  ... 
arXiv:2203.14101v4 fatcat:rdikzudoezak5b36cf6hhne5u4

Table of contents

2021 ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
......................................... 5689 TEXT-TO-SPEECH SPECTRUM MODELING Qing He, Zhiping Xiu, Thilo Koehler, Jilong Wu, Facebook Inc, United States SPE-3.4: END-TO-END TEXT-TO-SPEECH USING LATENT  ...  Jonathan Shen, Yu Zhang, Ye Jia, Ron Weiss, Yonghui Wu, Google, Israel SPE-4.2: FCL-TACO2: TOWARDS FAST, CONTROLLABLE AND LIGHTWEIGHT ......................................... 5714 TEXT-TO-SPEECH SYNTHESIS  ... 
doi:10.1109/icassp39728.2021.9414617 fatcat:m5ugnnuk7nacbd6jr6gv2lsfby

Paralinguistic Privacy Protection at the Edge [article]

Ranya Aloufi, Hamed Haddadi, David Boyle
2020
As our emotional patterns and sensitive attributes like our identity, gender, mental well-being, are easily inferred using deep acoustic models, we encounter a new generation of privacy risks by using  ...  at the edge prior to offloading to the cloud.  ...  In the case of speech processing, an ideal disentangled representation would be able to separate fine-grained factors such as speaker identity, noise, recording channels, and prosody, as well as the linguistic  ... 
doi:10.48550/arxiv.2011.02930 fatcat:4uhqol3tsbaezk3mwx2z576mje

Effects of errorless learning on the acquisition of velopharyngeal movement control

Andus Wing-Kuen Wong, Tara Whitehill, Estella Ma, Rich Masters
2012 Journal of the Acoustical Society of America  
The problem of estimating the directions-of-arrival (DOA) of a source in a room and its reflections using RIR data and microphone arrays, is considered.  ...  In practical cases only a limited amount of information is available to compute the Herglotz kernel, typically because a finite number of sensors is used for the measurement.  ...  These speech samples were useful in the early stages of tone discrimination learning.  ... 
doi:10.1121/1.4708235 fatcat:7wzupz5u2nd6nc7ttvbpxwvunm