Filters








1,325 Hits in 8.0 sec

Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition [article]

Xiong Cai, Dongyang Dai, Zhiyong Wu, Xiang Li, Jingbei Li, Helen Meng
2021 arXiv   pre-print
Then, we use emotion labels on the TTS dataset predicted by the trained SER model to build an auxiliary SER task and jointly train it with the TTS model.  ...  Neural text-to-speech (TTS) approaches generally require a huge number of high quality speech data, which makes it difficult to obtain such a dataset with extra emotion labels.  ...  Therefore, some semi-supervised approaches have been proposed to alleviate the burden of data requirements.  ... 
arXiv:2010.13350v2 fatcat:5vjmd7flafbedcy4smte6kgdmi

Interactive Text-to-Speech System via Joint Style Analysis [article]

Yang Gao, Weiyi Zheng, Zhaojun Yang, Thilo Kohler, Christian Fuegen, Qing He
2020 arXiv   pre-print
To solve these, we adopted a semi-supervised approach that uses the style extraction model to create style labels for the TTS dataset and applied transfer learning to learn the style embedding jointly.  ...  To achieve this, the system includes a style extraction model that extracts a style embedding from the speech query, which is then used by the TTS to produce a matching response.  ...  end-to-end emotion modeling [9] .  ... 
arXiv:2002.06758v2 fatcat:gu6sszljkzagzod2kho3kftvy4

Interactive Text-to-Speech System via Joint Style Analysis

Yang Gao, Weiyi Zheng, Zhaojun Yang, Thilo Köhler, Christian Fuegen, Qing He
2020 Interspeech 2020  
To solve these, we adopted a semi-supervised approach that uses the style extraction model to create style labels for the TTS dataset and applied transfer learning to learn the style embedding jointly.  ...  To achieve this, the system includes a style extraction model that extracts a style embedding from the speech query, which is then used by the TTS to produce a matching response.  ...  end-to-end emotion modeling [9] .  ... 
doi:10.21437/interspeech.2020-3069 dblp:conf/interspeech/GaoZYKFH20 fatcat:bpqckrnp3jdbhhdlfr7hv5lyqm

Principal Style Components: Expressive Style Control and Cross-Speaker Transfer in Neural TTS

Alexander Sorin, Slava Shechtman, Ron Hoory
2020 Interspeech 2020  
We propose a novel semi-supervised technique that enables expressive style control and cross-speaker transfer in neural text to speech (TTS), when available training data contains a limited amount of labeled  ...  Furthermore, this technique provides control over the speech rate, pitch level, and articulation type that can be used for TTS voice transformation.  ...  This led us to conclude that the tokens trained on our data failed to capture the target expressive styles.  ... 
doi:10.21437/interspeech.2020-1854 dblp:conf/interspeech/SorinSH20 fatcat:37ruymsmjbfdlpttx4adxv7pxe

A Survey on Neural Speech Synthesis [article]

Xu Tan, Tao Qin, Frank Soong, Tie-Yan Liu
2021 arXiv   pre-print
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad  ...  We further summarize resources related to TTS (e.g., datasets, opensource implementations) and discuss future research directions.  ...  Semi-Supervised Learning for Control Some attributes used to control the speech include pitch, duration, energy, prosody, emotion, speaker, noise, etc.  ... 
arXiv:2106.15561v3 fatcat:pbrbs6xay5e4fhf4ewlp7qvybi

Semi-Supervised Generative Modeling for Controllable Speech Synthesis [article]

Raza Habib, Soroosh Mariooryad, Matt Shannon, Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, David Kao, Tom Bagby
2019 arXiv   pre-print
We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models.  ...  We demonstrate that our model is able to reliably discover and control important but rarely labelled attributes of speech, such as affect and speaking rate, with as little as 1% (30 minutes) supervision  ...  The work most similar to ours is Wu et al. (2019) which also attempts to achieve affect control using semi-supervision with a heuristic approach based on Global Style Tokens .  ... 
arXiv:1910.01709v1 fatcat:hme6rv53vncsrbxcvmkncl3blu

Review of end-to-end speech synthesis technology based on deep learning [article]

Zhaoxi Mu, Xinyu Yang, Yizhuo Dong
2021 arXiv   pre-print
Due to the limitations of high complexity and low efficiency of traditional speech synthesis technology, the current research focus is the deep learning-based end-to-end speech synthesis technology, which  ...  Moreover, this paper also summarizes the open-source speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks, and introduces some commonly used subjective and  ...  Chung YA, Wang Y, Hsu WN, Zhang Y, Skerry-Ryan R (2019) Semi-supervised training for improving data efficiency in end-to-end speech synthesis.  ... 
arXiv:2104.09995v1 fatcat:q5lx74ycx5hobjox4ktl3amfta

Emotional Prosody Control for Speech Generation [article]

Sarath Sivaprasad, Saiteja Kosgi, Vineet Gandhi
2021 arXiv   pre-print
We show that the system works on emotion unseen during training and can scale to previously unseen speakers given his/her speech sample.  ...  Current text to speech systems generates speech with either a flat emotion, emotion selected from a predefined set, average variation learned from prosody sequences in training data or transferred from  ...  [22] proposed a framework to learn a bank of style embeddings called "Global Style Tokens" (GST) that are jointly trained within Tacotron (without any explicit supervision).  ... 
arXiv:2111.04730v1 fatcat:nmksy5wirjewhpndpoa7vsscmy

ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit [article]

Tomoki Hayashi and Ryuichi Yamamoto and Katsuki Inoue and Takenori Yoshimura and Shinji Watanabe and Tomoki Toda and Kazuya Takeda and Yu Zhang and Xu Tan
2020 arXiv   pre-print
., ASR-based objective evaluation and semi-supervised learning with both ASR and TTS models.  ...  This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-TTS, which is an extension of the open-source speech processing toolkit ESPnet.  ...  For example, we provide the recipe based on ASR-TTS cycle consistency training [29] and semi-supervised training using ASR and TTS [28] .  ... 
arXiv:1910.10909v2 fatcat:kbw2m3edhzdetmeepricf75wxm

Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis [article]

Alexandra Vioni, Myrsini Christidou, Nikolaos Ellinas, Georgios Vamvoukakis, Panos Kakoulidis, Taehoon Kim, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis
2021 arXiv   pre-print
This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module.  ...  Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set.  ...  The task of integrating prosodic control mechanisms in neural end-to-end speech synthesis has been in the limelight, as extensive research is conducted to increase the controllability and the expressiveness  ... 
arXiv:2111.10177v1 fatcat:7wa5o5yqsbfale6juzhkok3m24

Machine Speech Chain

Andros Tjandra, Sakriani Sakti, Satoshi Nakamura
2020 IEEE/ACM Transactions on Audio Speech and Language Processing  
The sequence-to-sequence model in closed-loop architecture allows us to train our model on the concatenation of both labeled and unlabeled data.  ...  Despite the close relationship between speech perception and production, research in automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has progressed more or less independently without  ...  SEMI-SUPERVISED LEARNING METHOD, EVALUATED ON TEST_EVAL92 SET (WITHOUT ANY LEXICON & LANGUAGE MODEL ON THE DECODING STEP) spc , s , and /s as noise, space, start-of-sequence, and end-of-sequence tokens  ... 
doi:10.1109/taslp.2020.2977776 fatcat:ifwp3m3usnbnra4eb6l5gncere

Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

Kurniawati Azizah, Mirna Adriani, Wisnu Jatmiko
2020 IEEE Access  
ACKNOWLEDGEMENT The authors thank Tokopedia-UI AI Center team and Lab 1231 Fasilkom UI team for their helpful discussion and feedback.  ...  Semi-supervised training proposed by [27] to make use of textual and acoustic knowledge from non-parallel large text and speech corpora for training end-to-end TTS with a small amount of parallel data  ...  END-TO-END DNN-BASED TTS A recent promising beyond parametric speech synthesis (BPSS) is the end-to-end TTS system that combines the main stages of the TTS process into a DNN framework that can be trained  ... 
doi:10.1109/access.2020.3027619 fatcat:w3gjodau3jd43jsexlnvpcc3bm

Lombard Speech Synthesis Using Transfer Learning in a Tacotron Text-to-Speech System

Bajibabu Bollepalli, Lauri Juvela, Paavo Alku
2019 Interspeech 2019  
Publication VII used the CSMAPLR techniques to adapt HMMs of speech in normal speaking style to speech in Lombard style.  ...  The variations including speaker characteristics, speaking styles, and emotions are necessary to make synthetic speech expressive which is crucial for successful communication.  ... 
doi:10.21437/interspeech.2019-1333 dblp:conf/interspeech/BollepalliJA19 fatcat:5uz43svog5erzev5nzakdnc4qe

Listening while Speaking: Speech Chain by Deep Learning [article]

Andros Tjandra, Sakriani Sakti, Satoshi Nakamura
2017 arXiv   pre-print
The sequence-to-sequence model in close-loop architecture allows us to train our model on the concatenation of both labeled and unlabeled data.  ...  Despite the close relationship between speech perception and production, research in automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has progressed more or less independently without  ...  We add the binary prediction layer because the output from the first and second decoder layers is a real value vector, and we cannot use an end-of-sentence (eos) token to determine when to stop the generation  ... 
arXiv:1707.04879v1 fatcat:5bp77f6l3na6lazmepofi55pn4

LSESpeak: A spoken language generator for Deaf people

Verónica López-Ludeña, Roberto Barra-Chicote, Syaheerah Lutfi, Juan Manuel Montero, Rubén San-Segundo
2013 Expert systems with applications  
The emotional TTS converter is based on Hidden Semi-Markov Models (HSMMs) permitting voice gender, type of emotion, and emotional strength to be controlled.  ...  Both translation tools use a phrase-based translation strategy where translation and target language models are trained from parallel corpora.  ...  Authors also thank Mark Hallett for the English revision of this paper and all the other members of the Speech Technology Group for the continuous and fruitful discussion on these topics.  ... 
doi:10.1016/j.eswa.2012.08.062 fatcat:qwzmz64jmbebnoc64l4n6ojza4
« Previous Showing results 1 — 15 out of 1,325 results