17 Hits in 3.7 sec

FastPitch: Parallel Text-to-speech with Pitch Prediction [article]

Adrian Łańcucki
2021 arXiv   pre-print
We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference.  ...  Uniformly increasing or decreasing pitch with FastPitch generates speech that resembles the voluntary modulation of voice.  ...  ACKNOWLEDGEMENTS The author would like to thank Dabi Ahn, Alvaro Garcia, and Grzegorz Karch for their help with the experiments and evaluation of the model, and Jan Chorowski, João Felipe Santos, Przemek  ... 
arXiv:2006.06873v2 fatcat:6nklt2iqobblxk2z4ah5xqfsrq

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction [article]

Stanislav Beliaev, Boris Ginsburg
2021 arXiv   pre-print
The first network predicts grapheme durations. An input text is expanded by repeating each symbol according to the predicted duration. The second network predicts pitch value for every mel frame.  ...  We propose TalkNet, a non-autoregressive convolutional neural model for speech synthesis with explicit pitch and duration prediction. The model consists of three feed-forward convolutional networks.  ...  The third network generates mel-spectrograms from an expanded text and predicted pitch.  ... 
arXiv:2104.08189v3 fatcat:uzwpb7odgbfqvnnvjg26dpgree

Mixer-TTS: non-autoregressive, fast and compact text-to-speech model conditioned on language model embeddings [article]

Oktai Tatanov, Stanislav Beliaev, Boris Ginsburg
2021 arXiv   pre-print
Both versions have a small number of parameters and enable much faster speech synthesis compared to the models with similar quality.  ...  The basic Mixer-TTS contains pitch and duration predictors, with the latter being trained with an unsupervised TTS alignment framework.  ...  Speech-to-text alignment framework Most non-autoregressive TTS models with duration prediction rely on durations extracted from external sources.  ... 
arXiv:2110.03584v2 fatcat:egxf3343ozbp5bvzfftjsiodce

Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows [article]

Kevin J. Shih, Rafael Valle, Rohan Badlani, João Felipe Santos, Bryan Catanzaro
2022 arXiv   pre-print
Despite recent advances in generative modeling for text-to-speech synthesis, these models do not yet have the same fine-grained adjustability of pitch-conditioned deterministic models such as FastPitch  ...  Pitch information is not only low-dimensional, but also discontinuous, making it particularly difficult to model in a generative setting.  ...  As with works such as FastPitch (Łańcucki, 2021) and FastSpeech2, we assume our mel decoder is further conditioned on pitch (F 0 ) and energy information: Timed Text Representation: Φ text is a C × T  ... 
arXiv:2203.01786v2 fatcat:7f2niskbwvehretdwfsa6fi2xq

Digital Einstein Experience: Fast Text-to-Speech for Conversational AI [article]

Joanna Rownicka, Kilian Sprenkamp, Antonio Tripiana, Volodymyr Gromoglasov, Timo P Kunz
2021 arXiv   pre-print
Our solution utilizes Fastspeech 2 for log-scaled mel-spectrogram prediction from phonemes and Parallel WaveGAN to generate the waveforms.  ...  To create the voice which fits the context well, we first design a voice character and we produce the recordings which correspond to the desired speech attributes. We then model the voice.  ...  We are also grateful to UneeQ for giving us the opportunity to complement one of their digital humans. References  ... 
arXiv:2107.10658v1 fatcat:ijh3ccdbpvch5i5km5f2dg43sm

Review of end-to-end speech synthesis technology based on deep learning [article]

Zhaoxi Mu, Xinyu Yang, Yizhuo Dong
2021 arXiv   pre-print
Due to the limitations of high complexity and low efficiency of traditional speech synthesis technology, the current research focus is the deep learning-based end-to-end speech synthesis technology, which  ...  It mainly consists of three modules: text front-end, acoustic model, and vocoder.  ...  FastPitch [117] adds a pitch prediction network to FastSpeech to control pitch.  ... 
arXiv:2104.09995v1 fatcat:q5lx74ycx5hobjox4ktl3amfta

Non-autoregressive sequence-to-sequence voice conversion [article]

Tomoki Hayashi, Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda
2021 arXiv   pre-print
Inspired by the great success of NAR-S2S models such as FastSpeech in text-to-speech (TTS), we extend the FastSpeech2 model for the VC problem.  ...  Furthermore, we extend variance predictors to variance converters to explicitly convert the source speaker's prosody components such as pitch and energy into the target speaker.  ...  (ASR) or text-to-speech (TTS) pretraining model [14] .  ... 
arXiv:2104.06793v1 fatcat:yf6bceaizzbvtmka6j5onajw6m

Multi-SpectroGAN: High-Diversity and High-Fidelity Spectrogram Generation with Adversarial Style Combination for Speech Synthesis [article]

Sang-Hoon Lee, Hyun-Wook Yoon, Hyeong-Rae Noh, Ji-Hoon Kim, Seong-Whan Lee
2020 arXiv   pre-print
While generative adversarial networks (GANs) based neural text-to-speech (TTS) systems have shown significant improvement in neural speech synthesis, there is no TTS system to learn to synthesize speech  ...  from text sequences with only adversarial feedback.  ...  Introduction Recently, there has been a significant progress in the end-toend text-to-speech (TTS) model, which can convert a normal text into speech.  ... 
arXiv:2012.07267v1 fatcat:cms2ugs23jhpvnneq57yjrk2fy

A Survey on Neural Speech Synthesis [article]

Xu Tan, Tao Qin, Frank Soong, Tie-Yan Liu
2021 arXiv   pre-print
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad  ...  We further summarize resources related to TTS (e.g., datasets, opensource implementations) and discuss future research directions.  ...  predict prosody features with text encoder.  ... 
arXiv:2106.15561v3 fatcat:pbrbs6xay5e4fhf4ewlp7qvybi

TTS-by-TTS: TTS-driven Data Augmentation for Fast and High-Quality Speech Synthesis [article]

Min-Jae Hwang, Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim
2020 arXiv   pre-print
In this paper, we propose a text-to-speech (TTS)-driven data augmentation method for improving the quality of a non-autoregressive (AR) TTS system.  ...  In this method, large-scale synthetic corpora including text-waveform pairs with phoneme duration are generated by the AR TTS system and then used to train the target non-AR model.  ...  INTRODUCTION Recently proposed end-to-end text-to-speech (TTS) systems, which generate a speech signal directly from an input text, provide highquality synthetic speech [1] [2] [3] [4] [5] .  ... 
arXiv:2010.13421v1 fatcat:lpoc5hsnfzctdhi2dm7srw2boe

Revisiting Over-Smoothness in Text to Speech [article]

Yi Ren, Xu Tan, Tao Qin, Zhou Zhao, Tie-Yan Liu
2022 arXiv   pre-print
Non-autoregressive text to speech (NAR-TTS) models have attracted much attention from both academia and industry due to their fast generation speed.  ...  One limitation of NAR-TTS models is that they ignore the correlation in time and frequency domains while generating speech mel-spectrograms, and thus cause blurry and over-smoothed results.  ...  Preliminary Study Text-to-speech mapping is a one-to-many mapping since multiple speech sequences can possibly correspond to a text sequence with different pitch, duration and prosody, making the mel-spectrograms  ... 
arXiv:2202.13066v1 fatcat:b5dudahdmje2hhgodq5jouur4a

ESPnet2-TTS: Extending the Edge of TTS Research [article]

Tomoki Hayashi and Ryuichi Yamamoto and Takenori Yoshimura and Peter Wu and Jiatong Shi and Takaaki Saeki and Yooncheol Ju and Yusuke Yasuda and Shinnosuke Takamichi and Shinji Watanabe
2021 arXiv   pre-print
This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit.  ...  extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance TTS performance.  ...  The extended FastSpeech 2 predicts token-averaged energy and pitch sequences instead of raw pitch and energy sequences.  ... 
arXiv:2110.07840v1 fatcat:tbif7rxvzjah7pwq73eqportyi

Creation and Detection of German Voice Deepfakes [article]

Vanessa Barnekow, Dominik Binder, Niclas Kromrey, Pascal Munaretto, Andreas Schaad, Felix Schmieder
2021 arXiv   pre-print
With a focus on German language and an online teaching environment we discuss the societal implications as well as demonstrate how to use machine learning techniques to possibly detect such fakes.  ...  A user study with more than 100 participants shows how difficult it is to identify real and fake voice (on avg. only 37 percent can distinguish between real and fake voice of a professor).  ...  for text-to-speech.  ... 
arXiv:2108.01469v1 fatcat:mppnenow3bdgfeqtruqfa2b7eu

Controlling Prosody in End-to-End TTS: A Case Study on Contrastive Focus Generation

Siddique Latif, Inyoung Kim, Ioan Calapodescu, Laurent Besacier
2021 Proceedings of the 25th Conference on Computational Natural Language Learning   unpublished
While End-2-End Text-to-Speech (TTS) has made significant progresses over the past few years, these systems still lack intuitive user controls over prosody.  ...  For instance, generating speech with fine-grained prosody control (prosodic prominence, contextually appropriate emotions) is still an open challenge.  ...  Acknowledgements We thank Jennifer, our American English speaker, for her professional speech recordings made following our precise instructions.  ... 
doi:10.18653/v1/2021.conll-1.42 fatcat:ed3kkahkb5d5tbi4msg6mw7z2i

Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech [article]

Yang Li, Cheng Yu, Guangzhi Sun, Hua Jiang, Fanglei Sun, Weiqin Zu, Ying Wen, Yang Yang, Jun Wang
2022 arXiv   pre-print
Modelling prosody variation is critical for synthesizing natural and expressive speech in end-to-end text-to-speech (TTS) systems.  ...  Experimental results on LJ-Speech and LibriTTS data show that the proposed CUC-VAE TTS system improves naturalness and prosody diversity with clear margins.  ...  Additionally, FastSpeech 2 predicts pitch and energy from the encoder output, which is also supervised with pitch contours and L2-norm of signal amplitudes as labels respectively.  ... 
arXiv:2205.04120v1 fatcat:mafoyi6dt5btpedl77i35wbyou
« Previous Showing results 1 — 15 out of 17 results