Filters








40 Hits in 4.4 sec

Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling [article]

Jonathan Shen, Ye Jia, Mike Chrzanowski, Yu Zhang, Isaac Elias, Heiga Zen, Yonghui Wu
2021 arXiv   pre-print
This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model, replacing the attention mechanism with an explicit duration predictor.  ...  The duration predictor enables both utterance-wide and per-phoneme control of duration at inference time.  ...  Semi-supervised and unsupervised duration modeling of Non-Attentive Tacotron, allowing the model to be trained with few to no duration annotations; and 5.  ... 
arXiv:2010.04301v4 fatcat:idae6o3gabhbfle762dow5kt5e

A Survey on Neural Speech Synthesis [article]

Xu Tan, Tao Qin, Frank Soong, Tie-Yan Liu
2021 arXiv   pre-print
We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive  ...  Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad  ...  [382, 303] , DeepVoice 3 [270] , and TransformerTTS [192] . 2) AR + Non-Attention (Duration), such as DurIAN [418] , RobuTrans [194] , and Non-Attentive Tacotron [304] . 3) Non-AR + Attention, such  ... 
arXiv:2106.15561v3 fatcat:pbrbs6xay5e4fhf4ewlp7qvybi

DurIAN: Duration Informed Attention Network For Multimodal Synthesis [article]

Chengzhu Yu, Heng Lu, Na Hu, Meng Yu, Chao Weng, Kun Xu, Peng Liu, Deyi Tuo, Shiyin Kang, Guangzhi Lei, Dan Su, Dong Yu
2019 arXiv   pre-print
This is different from the end-to-end attention mechanism used, and accounts for various unavoidable artifacts, in existing end-to-end speech synthesis systems such as Tacotron.  ...  from a duration model.  ...  Linchao Bao, Haozhi Huang and other members in the Tencent AI Lab computer vision team for providing facial modeling features and multimodal experiment environment.  ... 
arXiv:1909.01700v2 fatcat:zh36kga3czak5fey5r3sje4p6e

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis [article]

Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous
2018 arXiv   pre-print
When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.  ...  The soft interpretable "labels" they generate can be used to control synthesis in novel ways, such as varying speed and speaking style - independently of the text content.  ...  Weiss, Mike Schuster, Yonghui Wu, Patrick Nguyen, and the Machine Hearing, Google Brain and Google TTS teams for their helpful discussions and feedback.  ... 
arXiv:1803.09017v1 fatcat:tzoe7pe3vzcatftoy2jv6uesla

Review of end-to-end speech synthesis technology based on deep learning [article]

Zhaoxi Mu, Xinyu Yang, Yizhuo Dong
2021 arXiv   pre-print
and more attention.  ...  has more powerful modeling ability and a simpler pipeline.  ...  Robust acoustic model The neural TTS models based on autoregressive generative method and attention mechanism have been able to generate speech that is as natural as human voice.  ... 
arXiv:2104.09995v1 fatcat:q5lx74ycx5hobjox4ktl3amfta

Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control [article]

Konstantinos Markopoulos, Nikolaos Ellinas, Alexandra Vioni, Myrsini Christidou, Panos Kakoulidis, Georgios Vamvoukakis, Georgia Maniati, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis, Aimilios Chalamandaris
2021 arXiv   pre-print
The neural TTS model is fine-tuned to an unseen speaker's limited recordings, allowing rapping/singing synthesis with the target's speaker voice.  ...  It utilizes a Tacotron-based multispeaker acoustic model trained on read-only speech data and which provides prosody control at the phoneme level.  ...  During the last few years, with the establishment of neural TTS systems such as Tacotron [7] , it has become possible to investigate approaches like neural rapping and singing.  ... 
arXiv:2111.09146v1 fatcat:fonznraxrvcu7kvaxubkmpe35m

Hierarchical prosody modeling and control in non-autoregressive parallel neural TTS [article]

Tuomo Raitio, Jiangchuan Li, Shreyas Seshadri
2022 arXiv   pre-print
Experiments show that a non-autoregressive TTS model hierarchically conditioned on utterance-wise pitch, pitch range, duration, energy, and spectral tilt can effectively control each prosodic dimension  ...  Neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech.  ...  Previously, we presented prosody modeling and control using Tacotron 2 [16] , and now we expand this work to hierarchical prosody modeling in non-autoregressive parallel TTS.  ... 
arXiv:2110.02952v2 fatcat:xhp2ekbcpvehzikhxnshhmsxli

Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis [article]

Devang S Ram Mohan, Vivian Hu, Tian Huey Teh, Alexandra Torresquintero, Christopher G. R. Wallis, Marlene Staib, Lorenzo Foglianti, Jiameng Gao, Simon King
2021 arXiv   pre-print
Compared to a model that employs a variational auto-encoder to learn unsupervised latent features, our model provides more interpretable, temporally-precise, and disentangled control.  ...  Since much of the unexplained variation is in the prosody, we propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody: F_0, energy and duration.  ...  FastSpeech 2 [17] is a non-autoregressive TTS model conditioned on extracted F0 and energy features, which uses explicit phone durations.  ... 
arXiv:2106.08352v1 fatcat:voo4yudmpre2bbjczawwrfpsuu

Semi-Supervised Generative Modeling for Controllable Speech Synthesis [article]

Raza Habib, Soroosh Mariooryad, Matt Shannon, Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, David Kao, Tom Bagby
2019 arXiv   pre-print
We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models.  ...  TTS models.  ...  We also include a baseline of our Tacotron model augmented only by the unsupervised latent z s , to aid comparison.  ... 
arXiv:1910.01709v1 fatcat:hme6rv53vncsrbxcvmkncl3blu

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis [article]

Rafael Valle, Kevin Shih, Ryan Prenger, Bryan Catanzaro
2020 arXiv   pre-print
Flowtron borrows insights from IAF and revamps Tacotron in order to provide high-quality and expressive mel-spectrogram synthesis.  ...  Our mean opinion scores (MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech quality.  ...  Without labels for the non-textual information, models have fallen back to unsupervised learning.  ... 
arXiv:2005.05957v3 fatcat:fyns5vml2zcgza7qdqohtfk2xa

Adversarial Feature Learning and Unsupervised Clustering based Speech Synthesis for Found Data with Acoustic and Textual Noise [article]

Shan Yang, Yuxuan Wang, Lei Xie
2020 arXiv   pre-print
Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved extraordinary performance.  ...  model.  ...  t−1, and c t is the context vector computed from the attention function g(·), which includes a content-or non-content based score function to measure the contribution of each memory x i [1] , [13] ,  ... 
arXiv:2004.13595v1 fatcat:bt4bt53lsvbbphoie4675p34iu

Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data [article]

Mingyang Zhang, Yi Zhou, Li Zhao, Haizhou Li
2021 arXiv   pre-print
We first develop a multi-speaker speech synthesis system with sequence-to-sequence encoder-decoder architecture, where the encoder extracts robust linguistic representations of text, and the decoder, conditioned  ...  This paper presents a novel framework to build a voice conversion (VC) system by learning from a text-to-speech (TTS) synthesis system, that is called TTS-VC transfer learning.  ...  Experimental Setup and Model Architecture The three competing baselines in Table II include a multispeaker TTS model, a PPG-VC model and a VAE-VC model.  ... 
arXiv:2009.14399v2 fatcat:ta32qp23rbayfhj4iwbhvrr7km

Speaking Speed Control of End-to-End Speech Synthesis using Sentence-Level Conditioning [article]

Jae-Sung Bae, Hanbin Bae, Young-Sun Joo, Junmo Lee, Gyeong-Hoon Lee, Hoon-Young Cho
2020 arXiv   pre-print
The proposed SCTTS does not require any additional well-trained model or an external speech database to extract phoneme-level duration information and can be trained in an end-to-end manner.  ...  This paper proposes a controllable end-to-end text-to-speech (TTS) system to control the speaking speed (speed-controllable TTS; SCTTS) of synthesized speech with sentence-level speaking-rate value as  ...  In [9, 10] , neural TTS systems that control the phonemelevel speech duration have been proposed.  ... 
arXiv:2007.15281v2 fatcat:3lirjxoxgfdavbb37oieenwp2m

Speaking Speed Control of End-to-End Speech Synthesis Using Sentence-Level Conditioning

Jae-Sung Bae, Hanbin Bae, Young-Sun Joo, Junmo Lee, Gyeong-Hoon Lee, Hoon-Young Cho
2020 Interspeech 2020  
The proposed SCTTS does not require any additional well-trained model or an external speech database to extract phoneme-level duration information and can be trained in an end-to-end manner.  ...  This paper proposes a controllable end-to-end text-to-speech (TTS) system to control the speaking speed (speed-controllable TTS; SCTTS) of synthesized speech with sentence-level speaking-rate value as  ...  In [9, 10] , neural TTS systems that control the phonemelevel speech duration have been proposed.  ... 
doi:10.21437/interspeech.2020-1361 dblp:conf/interspeech/BaeBJLLC20 fatcat:f3b5t57bkzg7bewt7fellleyda

Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance [article]

Hieu-Thi Luong, Junichi Yamagishi
2021 arXiv   pre-print
Generally speaking, the main objective when training a neural speech synthesis system is to synthesize natural and expressive speech from the output layer of the neural network without much attention given  ...  In this paper, we investigate the use of quantized vectors to model the latent linguistic embedding and compare it with the continuous counterpart.  ...  Zen, and from speech synthesis to voice conversion with non-parallel train- Y. Wu, “Non-attentive tacotron: Robust and controllable neural ing data,” IEEE/ACM Trans.  ... 
arXiv:2106.13479v1 fatcat:3pva7ksvirgdzijtu5x7anizs4
« Previous Showing results 1 — 15 out of 40 results