94 Hits in 9.4 sec

Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data [article]

Zhu Li, Yuqing Zhang, Mengxi Nie, Ming Yan, Mengnan He, Ruixiong Zhang, Caixia Gong
2021 arXiv   pre-print
However, training these models typically requires a large amount of high-fidelity speech data, and for unseen texts, the prosody of synthesized speech is relatively unnatural.  ...  Recent advancements in end-to-end speech synthesis have made it possible to generate highly natural speech.  ...  To verify its effectiveness in improving the prosody of synthesized speech for unseen texts, we pre-train the duration predictor on a large-scale noisy dataset and on a relatively small clean dataset,  ... 
arXiv:2111.07549v1 fatcat:yyq53ir4xrhdlmk32d432mppjm

Hybrid Framework for Speaker-independent Emotion Conversion using I-vector PLDA and Neural Network

Susmitha Vekkot, Deepa Gupta, Mohammed Zakariah, Yousef Ajami Alotaibi.
2019 IEEE Access  
Speaker and text-independent emotion conversion are challenging modeling problems in this paradigm.  ...  German (EmoDB), Telugu (IITKGP), and English (SAVEE). The proposed approach delivered superior performance compared to the baseline under both clean and noisy data conditions considered for analysis.  ...  ACKNOWLEDGMENT The authors sincerely thank all the native and foreign (German) listeners who participated in the perception tests.  ... 
doi:10.1109/access.2019.2923003 fatcat:azbzeoxx4vahlazabykyvcw3jm

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis [article]

Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu
2019 arXiv   pre-print
We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training.  ...  We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural  ...  Acknowledgements The authors thank Heiga Zen, Yuxuan Wang, Samy Bengio, the Google AI Perception team, and the Google TTS and DeepMind Research teams for their helpful discussions and feedback.  ... 
arXiv:1806.04558v4 fatcat:wwxuxx42j5bvpgonydabet5gk4

Hierarchical Generative Modeling for Controllable Speech Synthesis [article]

Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang
2018 arXiv   pre-print
This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking  ...  In particular, we train a high-quality controllable TTS model on real found data, which is capable of inferring speaker and style attributes from a noisy utterance and use it to synthesize clean speech  ...  Saurous, William Chan, RJ Skerry-Ryan, Eric Battenberg, and the Google Brain, Perception and TTS teams for their helpful feedback and discussions.  ... 
arXiv:1810.07217v2 fatcat:6xyu5omwfzdedplwuwpghlf6hq

Enhance the Word Vector with Prosodic Information for the Recurrent Neural Network Based TTS System

Xin Wang, Shinji Takaki, Junichi Yamagishi
2016 Interspeech 2016  
However, these word vectors trained from text data may encode insufficient information related to speech.  ...  Besides, we also show that the enhanced vectors provide better initial values than the raw vectors for error back-propagation of the network, which results in further improvement.  ...  Shinji Takaki was supported in part by the NAVER Labs..  ... 
doi:10.21437/interspeech.2016-390 dblp:conf/interspeech/WangTY16 fatcat:wa3fmedhvzcyladz6l75pmzdhm

On Controlled DeEntanglement for Natural Language Processing [article]

SaiKrishna Rallabandi
2019 arXiv   pre-print
I conclude this writeup by a roadmap of experiments that show the applicability of this framework to scalability, flexibility and interpretibility.  ...  Thus far, AI has made significant progress in low stake low risk scenarios such as playing Go and we are currently in a transition toward medium stake scenarios such as Visual Dialog.  ...  Emphatic Text to Speech I am interested in investigating approaches to incorporate automatically derivable information from speech into the model architecture for better modeling and controlling prosody  ... 
arXiv:1909.09964v1 fatcat:mi5wm7pnxrddplwluwyqlauuoe

Expressive speech synthesis: a review

D. Govind, S. R. Mahadeva Prasanna
2012 International Journal of Speech Technology  
The review provided in this paper include, review of the various approaches for text to speech synthesis, various studies on the analysis and estimation of expressive parameters and various studies on  ...  In this approach, the ESS is achieved by modifying the parameters of the neutral speech which is synthesized from the text.  ...  Acknowledgement The work done in this paper is funded by the on going UK-India Education Research Initiative (UKIERI) project titled "study of source features for speech synthesis and speaker recognition  ... 
doi:10.1007/s10772-012-9180-2 fatcat:syjgawdjbbdapmdq6d6h5ulzni

Review of end-to-end speech synthesis technology based on deep learning [article]

Zhaoxi Mu, Xinyu Yang, Yizhuo Dong
2021 arXiv   pre-print
Moreover, this paper also summarizes the open-source speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks, and introduces some commonly used subjective and  ...  It mainly consists of three modules: text front-end, acoustic model, and vocoder.  ...  In order to avoid ignoring text information during synthesis and thus generating wrong speech, Liu et al.  ... 
arXiv:2104.09995v1 fatcat:q5lx74ycx5hobjox4ktl3amfta

Karaoker: Alignment-free singing voice synthesis with speech training data [article]

Panos Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, June Sig Sung, Gunu Jho, Pirros Tsiakoulis, Aimilios Chalamandaris
2022 arXiv   pre-print
Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information.  ...  We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks that guide the model to an accurate result.  ...  The most probable reason for this, is that linguistic information is derived from the formant information within the model rather than from the textual one.  ... 
arXiv:2204.04127v1 fatcat:p32qhfq4m5enrhhri2sea4mgqm

Voice Conversion [chapter]

Jani Nurminen, Hanna Siln, Victor Popa, Elina Helander, Moncef Gabbouj
2012 Speech Enhancement, Modeling and Recognition- Algorithms and Applications  
In the synthesis phase, the trained HMMs are used to generate speech parameters for text unseen in the training data.  ...  Speech conveys a variety of information that can be categorized, for example, into linguistic and nonlinguistic information.  ...  This book on Speech Processing consists of seven chapters written by eminent researchers from Italy, Canada, India, Tunisia, Finland and The Netherlands.  ... 
doi:10.5772/37334 fatcat:2hgxvblj4rccvasfudopppuiau

Perception in Black and White: Effects of Intonational Variables and Filtering Conditions on Sociolinguistic Judgments With Implications for ASR

Nicole R. Holliday
2021 Frontiers in Artificial Intelligence  
, with implications for austomatic speech recognition systems as well as speech synthesis.  ...  These results enhance our understanding of cues listeners rely on in making social judgments about speakers, especially in ethnic identification and linguistic profiling, by highlighting perceptual differences  ...  improvements in speech synthesis.  ... 
doi:10.3389/frai.2021.642783 fatcat:g5ggrnjnozdbnkrayg3y54j7pu

Sequence-to-Sequence Emotional Voice Conversion with Strength Control

Heejin Choi, Minsoo Hahn
2021 IEEE Access  
By aggregating the emotion embedding vectors for each emotion, a representative vector for the target emotion is obtained and weighted to reflect emotion strength.  ...  This paper proposes an improved emotional voice conversion (EVC) method with emotional strength and duration controllability.  ...  Mel-spectrogram effectively implies various information in speech, not only linguistic but also non-linguistic, such as the speaker and the emotion.  ... 
doi:10.1109/access.2021.3065460 fatcat:wk263sv73rfbvok2q4ri5ort7y

Emotional speech synthesis with rich and granularized control [article]

Se-Yun Um, Sangshin Oh, Kyungguen Byun, Inseon Jang, Chunghyun Ahn, Hong-Goo Kang
2019 arXiv   pre-print
This paper proposes an effective emotion control method for an end-to-end text-to-speech (TTS) system.  ...  Subjective evaluation results in terms of emotional expressiveness and controllability show the superiority of the proposed algorithm to the conventional methods.  ...  INTRODUCTION The objective of a text-to-speech (TTS) system is to synthesize human-like speech signals such that linguistic and paralinguistic information can be conveyed clearly.  ... 
arXiv:1911.01635v2 fatcat:ht35e4eevndcnmzwawgue2ocdu

Whispered and Lombard Neural Speech Synthesis [article]

Qiong Hu, Tobias Bleisch, Petko Petkov, Tuomo Raitio, Erik Marchi, Varun Lakshminarasimhan
2021 arXiv   pre-print
In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data.  ...  It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user.  ...  In the pre-training process, a multi-speaker Tacotron using only linguistic features as input is trained to learn a general text-tospeech task.  ... 
arXiv:2101.05313v1 fatcat:2khnwh3h4vblbdai4smfsr7v7m

Lombard Speech Synthesis Using Transfer Learning in a Tacotron Text-to-Speech System

Bajibabu Bollepalli, Lauri Juvela, Paavo Alku
2019 Interspeech 2019  
Thus, in addition to the linguistic information embedded in the input text, human speech contains a lot of other information about, for example, the speaker identity and emotions.  ...  with the unseen speaker data.  ... 
doi:10.21437/interspeech.2019-1333 dblp:conf/interspeech/BollepalliJA19 fatcat:5uz43svog5erzev5nzakdnc4qe
« Previous Showing results 1 — 15 out of 94 results