338 Hits in 13.8 sec

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions [article]

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu
2018 arXiv   pre-print
The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder  ...  To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and  ...  The authors thank Jan Chorowski, Samy Bengio, Aäron van den Oord, and the WaveNet and Machine Hearing teams for their helpful discussions and advice, as well as Heiga Zen and the Google TTS team for their  ... 
arXiv:1712.05884v2 fatcat:bnu5fnrvarcxxmiiyh3dhcf2ve

Real-Time Neural Text-to-Speech with Sequence-to-Sequence Acoustic Model and WaveGlow or Single Gaussian WaveRNN Vocoders

Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai
2019 Interspeech 2019  
Especially, the seq2seq AM and WaveGlow vocoder conditioned on mel-spectrograms with simple PyTorch implementations can be realized with real-time factors 0.06 and 0.10 for inference using a GPU.  ...  The proposed SG-WaveRNN can predict continuous valued speech waveforms with half the synthesis time compared with vanilla WaveRNN with dual-softmax for 16 bit audio prediction.  ...  In the TTS condition, mel-spectrograms were predicted by the seq2seq AM with full-context label input, and the TTS waveforms were synthesized by the neural vocoders trained in the AS condition with the  ... 
doi:10.21437/interspeech.2019-1288 dblp:conf/interspeech/OkamotoTSK19 fatcat:z7cmh74fcrh5leen4aoaoncemu

GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram [article]

Lauri Juvela, Bajibabu Bollepalli, Junichi Yamagishi, Paavo Alku
2019 arXiv   pre-print
The present sequence-to-sequence models can directly map text to mel-spectrogram acoustic features, which are convenient for modeling, but present additional challenges for vocoding (i.e., waveform generation  ...  High-quality synthesis can be achieved with neural vocoders, such as WaveNet, but such autoregressive models suffer from slow sequential inference.  ...  Quality of the WaveNet-based TTS system is robust to mismatch between natural and generated acoustic features due to the WaveNet's ability to correct its behavior based on previous predictions [18] .  ... 
arXiv:1904.03976v3 fatcat:a2zibokvbndu7h3zxh7j42ufqi

Speaking style adaptation in Text-To-Speech synthesis using Sequence-to-sequence models with attention [article]

Bajibabu Bollepalli and Lauri Juvela and Paavo Alku
2018 arXiv   pre-print
Moreover, we experiment with a WaveNet vocoder in synthesis of Lombard speech. We conducted subjective evaluations to assess the performance of the adapted TTS systems.  ...  The subjective evaluation results indicated that an adaptation system with the WaveNet vocoder clearly outperformed the conventional deep neural network based TTS system in synthesis of Lombard speech.  ...  The study proposes using an adaptation method based on fine-tuning combined with sequence-to-sequence based TTS models and the WaveNet vocoder conditioned using mel-spectrograms.  ... 
arXiv:1810.12051v1 fatcat:bjt2fl7m7raqzncp2cvfmqskte

High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram [article]

Leyuan Sheng, Dong-Yan Huang, Evgeniy N. Pavlovskiy
2019 arXiv   pre-print
Inspired by image-to-image translation, we address this problem by using a learning-based post filter combining Pix2PixHD and ResUnet to reconstruct the mel-spectrograms together with super-resolution.  ...  In speech synthesis and speech enhancement systems, melspectrograms need to be precise in acoustic representations.  ...  INTRODUCTION Text to speech (TTS) synthesis aims at producing an intelligible and natural speech for a given text input.  ... 
arXiv:1912.01167v1 fatcat:bjcl5zcuofapxkt5f25h6gsr3q

Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language

Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi
2019 ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
of the Wavenet vocoder.  ...  In a large-scale listening test, we investigated the impacts of the presence of accentual-type labels, the use of force or predicted alignments, and acoustic features used as local condition parameters  ...  A predicted mel-spectrogram is converted to an audio waveform with WaveNet [2] .  ... 
doi:10.1109/icassp.2019.8682353 dblp:conf/icassp/YasudaWTY19 fatcat:onp3jb4jvrdz7brleqin3zjsna

A Survey on Neural Speech Synthesis [article]

Xu Tan, Tao Qin, Frank Soong, Tie-Yan Liu
2021 arXiv   pre-print
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad  ...  This survey can serve both academic researchers and industry practitioners working on TTS.  ...  LPCNet generates speech waveform conditioned on BFCC (bark-frequency cepstral coefficients) features, and can be easily adapted to condition on mel-spectrograms.  ... 
arXiv:2106.15561v3 fatcat:pbrbs6xay5e4fhf4ewlp7qvybi

Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder [article]

Yi Zhao, Shinji Takaki, Hieu-Thi Luong, Junichi Yamagishi, Daisuke Saito, Nobuaki Minematsu
2018 arXiv   pre-print
Significant degradation happens, especially when predicted acoustic features have mismatched characteristics compared to natural ones.  ...  The neural vocoder uses acoustic features as local condition parameters, and these parameters need to be accurately predicted by another acoustic model.  ...  The WaveNet vocoder is conditioned on both mel-spectrogram and speaker codes, as well.  ... 
arXiv:1807.11679v1 fatcat:pqsrkbf7cnerndm554fj26ynhy

Es-Tacotron2: Multi-Task Tacotron 2 with Pre-Trained Estimated Network for Reducing the Over-Smoothness Problem

Yifan Liu, Jin Zheng
2019 Information  
on predicting the individual features of mel spectrogram.  ...  Then, we design Es-Tacotron2 by employing the Es-Network to calculate the estimated mel spectrogram residual, and setting it as an additional prediction task of Tacotron 2, to allow the model focus more  ...  Acknowledgments: This work is supported by the Funding above. Authors would like to thank anonymous reviewers for the valuable comments and feedbacks.  ... 
doi:10.3390/info10040131 fatcat:6s4uroc4szdelnvw6bjiytrvp4

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech [article]

Wei Ping, Kainan Peng, Jitong Chen
2019 arXiv   pre-print
We also successfully distill a parallel waveform synthesizer conditioned on the hidden representation in this end-to-end model.  ...  In this work, we propose a new solution for parallel wave generation by WaveNet.  ...  We test our parallel waveform synthesis method by conditioning it on mel-spectrograms and hidden representation within the end-to-end model.  ... 
arXiv:1807.07281v3 fatcat:ms5aytw5sfdnlczv75xftu6s2q

Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language [article]

Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi
2019 arXiv   pre-print
of the Wavenet vocoder.  ...  In a large-scale listening test, we investigated the impacts of the presence of accentual-type labels, the use of force or predicted alignments, and acoustic features used as local condition parameters  ...  A predicted mel-spectrogram is converted to an audio waveform with WaveNet [2] .  ... 
arXiv:1810.11960v2 fatcat:i7mp374z4natbb3wd6zxd7y25i

Representation Mixing for TTS Synthesis [article]

Kyle Kastner, João Felipe Santos, Yoshua Bengio, Aaron Courville
2018 arXiv   pre-print
Recent character and phoneme-based parametric TTS systems using deep learning have shown strong performance in natural speech generation.  ...  Experiments and user studies on a public audiobook corpus show the efficacy of our approach.  ...  This model was also trained on LJSpeech, allowing us to directly use it as a neural inverse to the log mel spectrograms predicted by the attention-based RNN model, or as an inverse to log mel spectrograms  ... 
arXiv:1811.07240v2 fatcat:o5z3i7jfpvfd7molvk5jblradm

Non-Autoregressive Neural Text-to-Speech [article]

Kainan Peng, Wei Ping, Zhao Song, Kexin Zhao
2020 arXiv   pre-print
ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner.  ...  In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram.  ...  We train two 20-layer WaveNets with residual channel 256 conditioned on the predicted mel spectrogram from ParaNet and DV3, respectively.  ... 
arXiv:1905.08459v3 fatcat:e5ohuxfx4bb7dczmfvohuinwl4

Full-band LPCNet: A real-time neural vocoder for 48 kHz audio with a CPU

Keisuke Matsubara, Takuma Okamoto, Ryoichi Takashima, Tetsuya Takiguchi, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai
2021 IEEE Access  
LPCNet predicts residual signals between natural speech and predicted speech calculated by linear prediction coding (LPC) [56] .  ...  In [15] , it was been reported that a full-band mel-scale spectrogram inferred by TTS causes over-smoothing of high-frequency components and deterioration of quality.  ... 
doi:10.1109/access.2021.3089565 fatcat:uavgpgactbcwhkr37ao7xl3xyq

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning [article]

Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, John Miller
2018 arXiv   pre-print
Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster.  ...  We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers.  ...  We compare the quality of several waveform synthesis methods, including WORLD (Morise et al., 2016) , Griffin-Lim (Griffin & Lim, 1984) , and WaveNet (Oord et al., 2016) .  ... 
arXiv:1710.07654v3 fatcat:x2kmwi4hwzcglhfu4yy3f7llwi
« Previous Showing results 1 — 15 out of 338 results