Filters








359 Hits in 3.9 sec

Tacotron: Towards End-to-End Speech Synthesis [article]

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous
2017 arXiv   pre-print
In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters.  ...  A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.  ...  ACKNOWLEDGMENTS The authors would like to thank Heiga Zen and Ziang Xie for constructive discussions and feedback.  ... 
arXiv:1703.10135v2 fatcat:p6l6fcxy55dnphltjw2jw4n2w4

Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [article]

Ron J. Weiss, RJ Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad, Diederik P. Kingma
2021 arXiv   pre-print
Contemporary state-of-the-art text-to-speech (TTS) systems use a cascade of separately learned models: one (such as Tacotron) which generates intermediate features (such as spectrograms) from text, followed  ...  We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.  ...  ACKNOWLEDGEMENTS The authors thank Jenelle Feather for initial work integrating a flow into Tacotron, Rif A.  ... 
arXiv:2011.03568v2 fatcat:lnms2llr75g5vdas6ash4nfnxq

Parallel Tacotron: Non-Autoregressive and Controllable TTS [article]

Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, Ron Weiss, Yonghui Wu
2020 arXiv   pre-print
Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness.  ...  This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder.  ...  Text-to-speech synthesis is a one-to-many mapping problem, as there can be multiple possible speech realizations with different prosody for a text input.  ... 
arXiv:2010.11439v1 fatcat:3w3cqv2tkzd3lkd6zcxvn6beni

Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language [article]

Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi
2019 arXiv   pre-print
End-to-end speech synthesis is a promising approach that directly converts raw text to speech.  ...  Towards end-to-end Japanese speech synthesis, we extend Tacotron to systems with self-attention to capture long-term dependencies related to pitch accents and compare their audio quality with classical  ...  Acknowledgements We are grateful to Prof. Zhen-Hua Ling from USTC for kindly answering our questions.  ... 
arXiv:1810.11960v2 fatcat:i7mp374z4natbb3wd6zxd7y25i

Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS [article]

Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li
2020 arXiv   pre-print
Tacotron-based end-to-end speech synthesis has shown remarkable voice quality.  ...  In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks.  ...  INTRODUCTION With the advent of deep learning, end-to-end text-to-speech (TTS) has shown many advantages over the conventional TTS techniques [1] , [2] .  ... 
arXiv:2008.05284v1 fatcat:cqeky4hzu5fx3aavg26ql7eni4

Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling [article]

Jonathan Shen, Ye Jia, Mike Chrzanowski, Yu Zhang, Isaac Elias, Heiga Zen, Yonghui Wu
2021 arXiv   pre-print
This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model, replacing the attention mechanism with an explicit duration predictor.  ...  recognition model.  ...  INTRODUCTION Autoregressive neural text-to-speech (TTS) models using an attention mechanism are known to be able to generate speech with naturalness on par with recorded human speech.  ... 
arXiv:2010.04301v4 fatcat:idae6o3gabhbfle762dow5kt5e

Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet [article]

Mingyang Zhang, Xin Wang, Fuming Fang, Haizhou Li, Junichi Yamagishi
2019 arXiv   pre-print
An end-to-end speech synthesis task is conducted when the model is given text as the input while a sequence-to-sequence voice conversion task is conducted when it is given the speech of a source speaker  ...  We investigated the training of a shared model for both text-to-speech (TTS) and voice conversion (VC) tasks.  ...  Section 2 introduces the end-to-end speech synthesis model Tacotron.  ... 
arXiv:1903.12389v2 fatcat:k43cqpkwfvaebhp64tydqgynui

Es-Tacotron2: Multi-Task Tacotron 2 with Pre-Trained Estimated Network for Reducing the Over-Smoothness Problem

Yifan Liu, Jin Zheng
2019 Information  
End-to-end text-to-speech synthesis has gained considerable research interest, because compared to traditional models the end-to-end model is easier to design and more robust.  ...  Text-to-speech synthesis is a computational technique for producing synthetic, human-like speech by a computer.  ...  Moreover, the implement of the Tacotron 2 in this work can be acquired from https://github.com/Rayhane-mamah/Tacotron-2/, thanks for the awesome codes.  ... 
doi:10.3390/info10040131 fatcat:6s4uroc4szdelnvw6bjiytrvp4

Lombard Speech Synthesis Using Transfer Learning in a Tacotron Text-to-Speech System

Bajibabu Bollepalli, Lauri Juvela, Paavo Alku
2019 Interspeech 2019  
Lombard speech synthesis using transfer learning in a Tacotron text-to-speech system. In Proceedings of Interspeech, Graz, pages 2833-2837, September 2019.  ...  Publication IX: "Lombard speech synthesis using transfer learning in a Tacotron text-to-speech system" The author developed, implemented, and evaluated the proposed algorithm.  ... 
doi:10.21437/interspeech.2019-1333 dblp:conf/interspeech/BollepalliJA19 fatcat:5uz43svog5erzev5nzakdnc4qe

Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

Rui Liu, Berrak Sisman, Feilong Bao, Guang Lai Gao, Haizhou Li
2020 IEEE Signal Processing Letters  
Tacotron-based end-to-end speech synthesis has shown remarkable voice quality.  ...  In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks.  ...  INTRODUCTION With the advent of deep learning, end-to-end text-to-speech (TTS) has shown many advantages over the conventional TTS techniques [1] , [2] .  ... 
doi:10.1109/lsp.2020.3016564 fatcat:q7rd6md5mnbrtpsyjpbaoiv5ou

Multi-reference Tacotron by Intercross Training for Style Disentangling,Transfer and Control in Speech Synthesis [article]

Yanyao Bian, Changbin Chen, Yongguo Kang, Zhenglin Pan
2019 arXiv   pre-print
Existing approaches model all speech styles into one representation, lacking the ability to control a specific speech feature independently.  ...  Experimental results show that our model is able to control and transfer desired speech styles individually.  ...  Conclusions In this paper, we introduced multi-reference encoder to Tacotron and proposed intercross training technique.  ... 
arXiv:1904.02373v1 fatcat:xbplm7k24zhv7jflrszs7bnnau

Joint Training Framework for Text-to-Speech and Voice Conversion Using Multi-Source Tacotron and WaveNet

Mingyang Zhang, Xin Wang, Fuming Fang, Haizhou Li, Junichi Yamagishi
2019 Interspeech 2019  
An end-to-end speech synthesis task is conducted when the model is given text as the input while a sequence-to-sequence voice conversion task is conducted when it is given the speech of a source speaker  ...  We propose using an extended model architecture of Tacotron, that is a multi-source sequence-to-sequence model with a dual attention mechanism as the shared model for both the TTS and VC tasks.  ...  The architecture of our model is based on Tacotron. Given text characters as input, the model conducts end-to-end speech synthesis.  ... 
doi:10.21437/interspeech.2019-1357 dblp:conf/interspeech/00030F0Y19 fatcat:urmwhweuzjakhisxfsgb67ijdq

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron [article]

RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous
2018 arXiv   pre-print
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody.  ...  Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance.  ...  Model Architecture Our model is based on Tacotron (Wang et al., 2017a) , a recently proposed state-of-the-art end-to-end speech synthesis model that predicts mel spectrograms directly from grapheme or  ... 
arXiv:1803.09047v1 fatcat:dzeoe3iwmjbdrfhdq6qv7oe4q4

MIST-Tacotron: End-to-End Emotional Speech Synthesis Using Mel-Spectrogram Image Style Transfer

Sungwoo Moon, Sunghyun Kim, Yong-Hoon Choi
2022 IEEE Access  
In this paper, we propose mel-spectrogram image transfer (MIST)-Tacotron, a Tacotron 2-based speech synthesis model that adds a reference encoder with an image style transfer module.  ...  INDEX TERMS Tacotron, mel-spectrogram, image style transfer, speech synthesis, multi-speaker text-tospeech (TTS), emotion expression.  ...  Tacotron [9] is the first end-to-end generative TTS model based on the sequence-to-sequence neural network model [18] with attention module.  ... 
doi:10.1109/access.2022.3156093 fatcat:k2vxxhn6lrcczhyf2qlle3tpni

Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language

Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi
2019 ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
End-to-end speech synthesis is a promising approach that directly converts raw text to speech.  ...  Towards endto-end Japanese speech synthesis, we extend Tacotron to systems with self-attention to capture long-term dependencies related to pitch accents and compare their audio quality with classical  ...  Acknowledgements We are grateful to Prof. Zhen-Hua Ling from USTC for kindly answering our questions.  ... 
doi:10.1109/icassp.2019.8682353 dblp:conf/icassp/YasudaWTY19 fatcat:onp3jb4jvrdz7brleqin3zjsna
« Previous Showing results 1 — 15 out of 359 results