Filters








37 Hits in 3.7 sec

Fine-grained Noise Control for Multispeaker Speech Synthesis [article]

Karolos Nikitaras, Georgios Vamvoukakis, Nikolaos Ellinas, Konstantinos Klapsas, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis
2022 arXiv   pre-print
unsupervised, interpretable and fine-grained noise and prosody modeling.  ...  To the same end, we perform fine-grained prosody modeling via a Fully Hierarchical Variational AutoEncoder (FVAE) which additionally results in more expressive speech synthesis.  ...  Experimental results show that our system outperforms other methods in clean speech synthesis, indicating that our unsupervised, fine-grained noise modeling method can control and remove the inherent noise  ... 
arXiv:2204.05070v1 fatcat:cmy3dsmyjfhm5plzuvezehsp6i

Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior [article]

Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Andrew Rosenberg, Bhuvana Ramabhadran, Yonghui Wu
2020 arXiv   pre-print
Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech.  ...  Such models typically incorporate a fine-grained variational autoencoder (VAE) structure, extracting latent features at each input token (e.g., phonemes).  ...  ACKNOWLEDGEMENTS The authors thank Daisy Stanton, Eric Battenberg, and the Google Brain and Perception teams for their helpful feedback and discussions.  ... 
arXiv:2002.03788v1 fatcat:e6qmsdw5m5eh3d4fzxwkmufbpq

On Prosody Modeling for ASR+TTS based Voice Conversion [article]

Wen-Chin Huang, Tomoki Hayashi, Xinjian Li, Shinji Watanabe, Tomoki Toda
2021 arXiv   pre-print
speech into the underlying linguistic contents; these are then used as input by a text-to-speech (TTS) system to generate the converted speech.  ...  Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity.  ...  We would also like to thank Yu-Huai Peng and Hung-Shin Lee from Academia Sinica, Taiwan, for training the BNF extractor.  ... 
arXiv:2107.09477v1 fatcat:vq27g2f7kjd3hm737emd6wbhka

Multi-Speaker Neural Vocoder

Oriol Barbany, Antonio Bonafonte, Santiago Pascual
2018 IberSPEECH 2018  
Statistical Parametric Speech Synthesis (SPSS) offers more flexibility than unit-selection based speech synthesis, which was the dominant commercial technology during the 2000s decade.  ...  This paper exposes two proposals conceived to improve deep learning-based text-to-speech systems.  ...  In summary, with the combination of the two proposals, a state of the art MOS score has been achieved for a multispeaker speech synthesis system.  ... 
doi:10.21437/iberspeech.2018-7 dblp:conf/iberspeech/BarbanyBP18 fatcat:gcmvuewz2zc4dl7fvbfi7zz6k4

Review of end-to-end speech synthesis technology based on deep learning [article]

Zhaoxi Mu, Xinyu Yang, Yizhuo Dong
2021 arXiv   pre-print
Moreover, this paper also summarizes the open-source speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks, and introduces some commonly used subjective and  ...  Due to the limitations of high complexity and low efficiency of traditional speech synthesis technology, the current research focus is the deep learning-based end-to-end speech synthesis technology, which  ...  How to achieve fine-grained style control of speech at word level and phrase level will also be the focus of TTS research in the future.  ... 
arXiv:2104.09995v1 fatcat:q5lx74ycx5hobjox4ktl3amfta

A Survey on Neural Speech Synthesis [article]

Xu Tan, Tao Qin, Frank Soong, Tie-Yan Liu
2021 arXiv   pre-print
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad  ...  As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years.  ...  ; 3) in order to achieve fine-grained voice control and transfer, we need to disentangle different variation information, such as content and prosody, timbre and noise, etc.  ... 
arXiv:2106.15561v3 fatcat:pbrbs6xay5e4fhf4ewlp7qvybi

Pitchtron: Towards audiobook generation from ordinary people's voices [article]

Sunghee Jung, Hoirin Kim
2020 arXiv   pre-print
AXY score over GST is 2.01 and 1.14 for hard pitchtron and soft pitchtron respectively.  ...  To be specific, we explore transferring Korean dialects and emotive speech even though training set is mostly composed of standard and neutral Korean.  ...  Lee et al. suggested using variable-length residual embedding to improve fine-grained control over local dynamics [15] .  ... 
arXiv:2005.10456v1 fatcat:mdthtk2oofbubeba6jthokqxqu

Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis [article]

Devang S Ram Mohan, Vivian Hu, Tian Huey Teh, Alexandra Torresquintero, Christopher G. R. Wallis, Marlene Staib, Lorenzo Foglianti, Jiameng Gao, Simon King
2021 arXiv   pre-print
Compared to a model that employs a variational auto-encoder to learn unsupervised latent features, our model provides more interpretable, temporally-precise, and disentangled control.  ...  Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text.  ...  Acknowledgements We thank our adviser Mark Gales for feedback on this work. References  ... 
arXiv:2106.08352v1 fatcat:voo4yudmpre2bbjczawwrfpsuu

Speech Synthesis with Mixed Emotions [article]

Kun Zhou, Berrak Sisman, Rajib Rana, B.W.Schuller, Haizhou Li
2022 arXiv   pre-print
Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type.  ...  At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector.  ...  Recent attempts [80] , [81] study a way to include a hierarchical, fine-grained prosody representation into the style token-based diagram [36] .  ... 
arXiv:2208.05890v1 fatcat:ryp3rzi4xnfvppmtfxupwkimje

You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation [article]

Aleksandr Laptev, Roman Korostik, Aleksey Svischev, Andrei Andrusenko, Ivan Medennikov, Sergey Rybin
2020 arXiv   pre-print
Using recent advances in speech synthesis (text-to-speech, or TTS), we build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.  ...  Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.  ...  [19] suggest modeling style using an autoregressive prior over representations from quantized fine-grained VAE and perform evaluation by synthesizing utterances for training ASR on LibriSpeech.  ... 
arXiv:2005.07157v2 fatcat:qgp3erhkwjgh5kxg2laoqdxmr4

Controllable Data Generation by Deep Learning: A Review [article]

Shiyu Wang, Yuanqi Du, Xiaojie Guo, Bo Pan, Liang Zhao
2022 arXiv   pre-print
Designing and generating new data under targeted properties has been attracting various critical applications such as molecule design, image editing and speech synthesis.  ...  This article provides a systematic review of this promising research area, commonly known as controllable deep data generation.  ...  Stanford Sentiment Treebank (SST) dataset includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences [295] .  ... 
arXiv:2207.09542v2 fatcat:ey6v72rkxjbghdw63y2v2kjcde

NAUTILUS: a Versatile Voice Cloning System

Hieu-Thi Luong, Junichi Yamagishi
2020 IEEE/ACM Transactions on Audio Speech and Language Processing  
We introduce a novel speech synthesis system, called NAUTILUS, that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker.  ...  Moreover, it is demonstrated that the proposed framework has the ability to switch between TTS and VC with high speaker consistency, which will be useful for many applications.  ...  These questions were used to highlight the fine-grained differences between generation systems. Each participant in our subjective listening tests was asked to do ten sessions. V.  ... 
doi:10.1109/taslp.2020.3034994 fatcat:6kxb7ohf55fphczttutgxmlb4e

NAUTILUS: a Versatile Voice Cloning System [article]

Hieu-Thi Luong, Junichi Yamagishi
2020 arXiv   pre-print
We introduce a novel speech synthesis system, called NAUTILUS, that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker.  ...  Moreover, it is demonstrated that the proposed framework has the ability to switch between TTS and VC with high speaker consistency, which will be useful for many applications.  ...  These questions were used to highlight the fine-grained differences between generation systems. Each participant in our subjective listening tests was asked to do ten sessions. V.  ... 
arXiv:2005.11004v2 fatcat:elj3sz6ognahpmoras2ju4nco4

A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs

Merlijn Blaauw, Jordi Bonada
2017 Applied Sciences  
We recently presented a new model for singing synthesis based on a modified version of the WaveNet architecture.  ...  In this work, we extend our proposed system to include additional components for predicting F0 and phonetic timings from a musical score with lyrics.  ...  We thank Nagoya Institute of Technology for providing the NIT-SONG070-F001 dataset (licensed under CC BY 3.0), Zya for providing the English datasets, and Voctro Labs for providing the Spanish dataset  ... 
doi:10.3390/app7121313 fatcat:fuqf42bz5ndkbj7gmzmnryp6xq

ASVspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech

Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi Kinnunen, Ville Vestman, Massimiliano Todisco, Hector Delgado, Md Sahidullah, Junichi Yamagishi, Kong Aik Lee
2021 IEEE Transactions on Biometrics Behavior and Identity Science  
While fusion is shown to be particularly effective for the logical access scenario involving speech synthesis and voice conversion attacks, participants largely struggled to apply fusion successfully for  ...  Furthermore, while results for simulated data are promising, experiments with real replay data show a substantial gap, most likely due to the presence of additive noise in the latter.  ...  ACKNOWLEDGEMENTS The ASVspoof 2019 organisers thank the following for their invaluable contribution to the LA data collection effort -Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min  ... 
doi:10.1109/tbiom.2021.3059479 fatcat:2dgcayn4pzamzfx76su5jvx6ky
« Previous Showing results 1 — 15 out of 37 results