A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Filters
Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data
[article]
2021
arXiv
pre-print
However, training these models typically requires a large amount of high-fidelity speech data, and for unseen texts, the prosody of synthesized speech is relatively unnatural. ...
Recent advancements in end-to-end speech synthesis have made it possible to generate highly natural speech. ...
To verify its effectiveness in improving the prosody of synthesized speech for unseen texts, we pre-train the duration predictor on a large-scale noisy dataset and on a relatively small clean dataset, ...
arXiv:2111.07549v1
fatcat:yyq53ir4xrhdlmk32d432mppjm
Hybrid Framework for Speaker-independent Emotion Conversion using I-vector PLDA and Neural Network
2019
IEEE Access
Speaker and text-independent emotion conversion are challenging modeling problems in this paradigm. ...
German (EmoDB), Telugu (IITKGP), and English (SAVEE). The proposed approach delivered superior performance compared to the baseline under both clean and noisy data conditions considered for analysis. ...
ACKNOWLEDGMENT The authors sincerely thank all the native and foreign (German) listeners who participated in the perception tests. ...
doi:10.1109/access.2019.2923003
fatcat:azbzeoxx4vahlazabykyvcw3jm
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
[article]
2019
arXiv
pre-print
We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. ...
We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural ...
Acknowledgements The authors thank Heiga Zen, Yuxuan Wang, Samy Bengio, the Google AI Perception team, and the Google TTS and DeepMind Research teams for their helpful discussions and feedback. ...
arXiv:1806.04558v4
fatcat:wwxuxx42j5bvpgonydabet5gk4
Hierarchical Generative Modeling for Controllable Speech Synthesis
[article]
2018
arXiv
pre-print
This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking ...
In particular, we train a high-quality controllable TTS model on real found data, which is capable of inferring speaker and style attributes from a noisy utterance and use it to synthesize clean speech ...
Saurous, William Chan, RJ Skerry-Ryan, Eric Battenberg, and the Google Brain, Perception and TTS teams for their helpful feedback and discussions. ...
arXiv:1810.07217v2
fatcat:6xyu5omwfzdedplwuwpghlf6hq
Enhance the Word Vector with Prosodic Information for the Recurrent Neural Network Based TTS System
2016
Interspeech 2016
However, these word vectors trained from text data may encode insufficient information related to speech. ...
Besides, we also show that the enhanced vectors provide better initial values than the raw vectors for error back-propagation of the network, which results in further improvement. ...
Shinji Takaki was supported in part by the NAVER Labs.. ...
doi:10.21437/interspeech.2016-390
dblp:conf/interspeech/WangTY16
fatcat:wa3fmedhvzcyladz6l75pmzdhm
On Controlled DeEntanglement for Natural Language Processing
[article]
2019
arXiv
pre-print
I conclude this writeup by a roadmap of experiments that show the applicability of this framework to scalability, flexibility and interpretibility. ...
Thus far, AI has made significant progress in low stake low risk scenarios such as playing Go and we are currently in a transition toward medium stake scenarios such as Visual Dialog. ...
Emphatic Text to Speech I am interested in investigating approaches to incorporate automatically derivable information from speech into the model architecture for better modeling and controlling prosody ...
arXiv:1909.09964v1
fatcat:mi5wm7pnxrddplwluwyqlauuoe
Expressive speech synthesis: a review
2012
International Journal of Speech Technology
The review provided in this paper include, review of the various approaches for text to speech synthesis, various studies on the analysis and estimation of expressive parameters and various studies on ...
In this approach, the ESS is achieved by modifying the parameters of the neutral speech which is synthesized from the text. ...
Acknowledgement The work done in this paper is funded by the on going UK-India Education Research Initiative (UKIERI) project titled "study of source features for speech synthesis and speaker recognition ...
doi:10.1007/s10772-012-9180-2
fatcat:syjgawdjbbdapmdq6d6h5ulzni
Review of end-to-end speech synthesis technology based on deep learning
[article]
2021
arXiv
pre-print
Moreover, this paper also summarizes the open-source speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks, and introduces some commonly used subjective and ...
It mainly consists of three modules: text front-end, acoustic model, and vocoder. ...
In order to avoid ignoring text information during synthesis and thus generating wrong speech, Liu et al. ...
arXiv:2104.09995v1
fatcat:q5lx74ycx5hobjox4ktl3amfta
Karaoker: Alignment-free singing voice synthesis with speech training data
[article]
2022
arXiv
pre-print
Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. ...
We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks that guide the model to an accurate result. ...
The most probable reason for this, is that linguistic information is derived from the formant information within the model rather than from the textual one. ...
arXiv:2204.04127v1
fatcat:p32qhfq4m5enrhhri2sea4mgqm
Voice Conversion
[chapter]
2012
Speech Enhancement, Modeling and Recognition- Algorithms and Applications
In the synthesis phase, the trained HMMs are used to generate speech parameters for text unseen in the training data. ...
Speech conveys a variety of information that can be categorized, for example, into linguistic and nonlinguistic information. ...
This book on Speech Processing consists of seven chapters written by eminent researchers from Italy, Canada, India, Tunisia, Finland and The Netherlands. ...
doi:10.5772/37334
fatcat:2hgxvblj4rccvasfudopppuiau
Perception in Black and White: Effects of Intonational Variables and Filtering Conditions on Sociolinguistic Judgments With Implications for ASR
2021
Frontiers in Artificial Intelligence
, with implications for austomatic speech recognition systems as well as speech synthesis. ...
These results enhance our understanding of cues listeners rely on in making social judgments about speakers, especially in ethnic identification and linguistic profiling, by highlighting perceptual differences ...
improvements in speech synthesis. ...
doi:10.3389/frai.2021.642783
fatcat:g5ggrnjnozdbnkrayg3y54j7pu
Sequence-to-Sequence Emotional Voice Conversion with Strength Control
2021
IEEE Access
By aggregating the emotion embedding vectors for each emotion, a representative vector for the target emotion is obtained and weighted to reflect emotion strength. ...
This paper proposes an improved emotional voice conversion (EVC) method with emotional strength and duration controllability. ...
Mel-spectrogram effectively implies various information in speech, not only linguistic but also non-linguistic, such as the speaker and the emotion. ...
doi:10.1109/access.2021.3065460
fatcat:wk263sv73rfbvok2q4ri5ort7y
Emotional speech synthesis with rich and granularized control
[article]
2019
arXiv
pre-print
This paper proposes an effective emotion control method for an end-to-end text-to-speech (TTS) system. ...
Subjective evaluation results in terms of emotional expressiveness and controllability show the superiority of the proposed algorithm to the conventional methods. ...
INTRODUCTION The objective of a text-to-speech (TTS) system is to synthesize human-like speech signals such that linguistic and paralinguistic information can be conveyed clearly. ...
arXiv:1911.01635v2
fatcat:ht35e4eevndcnmzwawgue2ocdu
Whispered and Lombard Neural Speech Synthesis
[article]
2021
arXiv
pre-print
In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. ...
It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. ...
In the pre-training process, a multi-speaker Tacotron using only linguistic features as input is trained to learn a general text-tospeech task. ...
arXiv:2101.05313v1
fatcat:2khnwh3h4vblbdai4smfsr7v7m
Lombard Speech Synthesis Using Transfer Learning in a Tacotron Text-to-Speech System
2019
Interspeech 2019
Thus, in addition to the linguistic information embedded in the input text, human speech contains a lot of other information about, for example, the speaker identity and emotions. ...
with the unseen speaker data. ...
doi:10.21437/interspeech.2019-1333
dblp:conf/interspeech/BollepalliJA19
fatcat:5uz43svog5erzev5nzakdnc4qe
« Previous
Showing results 1 — 15 out of 94 results