A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis
[article]
2022
arXiv
pre-print
Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style and thus ignores the multi-scale nature of speech prosody. In this paper, we propose MsEmoTTS, a
arXiv:2201.06460v1
fatcat:jzjhbd6f5req3d2bk4zk24lf5a