Prosody modeling with soft templates

Greg Kochanski, Chilin Shih
2003 Speech Communication  
This paper describes a novel prosody generation model. We intend it to broadly support many linguistic theories and multiple languages, for the model imposes no restriction on accent categories and shapes. This capability is crucial to the next generation of text-to-speech systems that will need to synthesize intonation variations for different speech acts, emotions, and styles of speech. The system supports mark-up tags that are mathematically defined and generate f 0 deterministically.
more » ... ing the tags is an articulatory model of accent interaction which balances physiological and communication constraints. We specify the model by way of an algorithm for calculating the pitch, and by way of examples. The model allows localized, linguistically reasonable tags, and is suitable for a data-driven fitting process. : S 0 1 6 7 -6 3 9 3 ( 0 2 ) 0 0 0 4 7 -X Speech Communication 39 (2003) 311-352 www.elsevier.com/locate/specom system by adding mark-up tags to the text. With marked text, the TTS system does not need to deduce as much, so it need not be designed conservatively. The mark-up system is most useful if it is flexible enough to support any intonation event that a user or a future dialogue system might want to express. A pertinent question is then how to design a pitch generation system that will support linguistic models that are not yet defined. In this paper, we introduce a prosody tagging and generation system Soft TEMplate Mark-up Language (Stem-ML). This system combines mark-up tags and pitch generation in one, therefore allowing future users and dialogue systems to control intonation events without the concern of writing a pitch generation component for the TTS system. We define a set of tags that serve the dual function of marking the text and pitch generation. The user can use these tags to describe linguistic events, and the tags automatically provide pitch generation support. It is thus most important to allow the model we define to represent any possible prosody. 1 A second goal is to mark it in a way that is compatible with standard linguistic assumptions: that accents are localized and associated with stress groups, words or syllables. A final goal is for this model to make use of information that is predictable from text, such as word accents, tones, and prosodic boundaries; this will allow us to minimize the number of tags that need to be added to text. Ultimately, we see this model becoming an "assembly language" where tags and their parameter settings would be produced by automated tools. From a research point of view, it is important to have a model that bridges the gap from linguistic theories to the objective reality of a glottal oscillator with a time-varying frequency. The model needs to be general enough so that it can provide a quantitative representation of many different theories of intonation, and can therefore be used to compare theories.
doi:10.1016/s0167-6393(02)00047-x fatcat:pow5ucslgzblndbung45swwvcy