Dynamic Prosody Generation for Speech Synthesis Using Linguistics-Driven Acoustic Embedding Selection

Shubhi Tyagi, Marco Nicolis, Jonas Rohnke, Thomas Drugman, Jaime Lorenzo-Trueba
2020 Interspeech 2020  
Recent advances in Text-to-Speech (TTS) have improved quality and naturalness to near-human capabilities. But something which is still lacking in order to achieve human-like communication is the dynamic variations and adaptability of human speech in more complex scenarios. This work attempts to solve the problem of achieving a more dynamic and natural intonation in TTS systems, particularly for stylistic speech such as the newscaster speaking style. We propose a novel way of exploiting
more » ... c information in VAE systems to drive dynamic prosody generation. We analyze the contribution of both semantic and syntactic features. Our results show that the approach improves the prosody and naturalness for complex utterances as well as in Long Form Reading (LFR). Samples will be available at: https://www.amazon.science/blog/ more-natural-prosody-for-synthesized-speech The objective of this work is to exploit sentence-level prosody variations available in the training dataset while synthesizing speech for the test sentence. The steps executed in this proposed approach are: (i) Generate suitable vector representations containing linguistic information for all the sentences in the train and test sets, (ii) Measure the similarity of the test sentence with each of the sentences in the train set. We do so by using cosine similarity between the vector representations as done in [15] to evaluate linguistic similarity (LS), (iii) Choose the acoustic embedding of the train sentence which gives the highest similarity with the test sentence, (iv) Synthesize speech from VAE-based inference using this acoustic embedding
doi:10.21437/interspeech.2020-1411 dblp:conf/interspeech/TyagiNRDL20 fatcat:5hnx77bkgnay5ed5ffijqu47vi