Aerodynamic Modeling of Coarticulation for Concatenative Speech Synthesis

Kevin Mcgowan
unpublished
We know from decades of speech perception research that listeners can perceive and use a wide array of fine-grained phonetic details, including the detailed coarticulatory influences that nearby sounds have on each other, when perceiving speech. For example, the vowel in can includes a nasalization feature (from the final nasal consonant) not present in the word cat. We believe details like this provide the listener with a rich network of informative cues and are key to understanding our
more » ... hing ability to disambiguate meaningful speech sounds from a seemingly infinite range of noisy inputs. Unfortunately, these cues, whether subtle or overt, are generally missing or contradictory in text to speech (TTS) synthesis output. We present a method of improving concatenative speech synthesis by explicitly modeling coarticulation. The Festival speech synthesis system (Taylor et al. 1998) was modified to use airflow data during unit selection. The output of this modified system and the unmodified system were compared in a listening experiment. Results indicate not only that listeners are sensitive to the sub-categorical phonetic differences but that, in general, they prefer speech synthesized from a hybrid acoustic/articulatory model to standard acoustic-only speech synthesis. Background: Coarticulation
fatcat:cec6zd34fjbbvbnuov4dbdwnqi