A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis
[article]
2021
arXiv
pre-print
Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text. One way to reduce the amount of unexplained variation in training data is to provide acoustic information as an additional learning signal. When generating speech, modifying this acoustic information enables multiple distinct renditions of a text to be produced. Since much of the unexplained variation is in the prosody, we
arXiv:2106.08352v1
fatcat:voo4yudmpre2bbjczawwrfpsuu