FSM and k-nearest-neighbor for corpus based video-realistic audio-visual synthesis

Christian Weiss
2005 Interspeech 2005   unpublished
In this paper we introduce a corpus based 2D videorealistic audio-visual synthesis system. The system combines a concatenative Text-to-Speech (TTS) System with a concatenative Text-to-Visual (TTV) System to an audio lipmovement synchronized Text-to-Audio-Visual-Speech System (TTAVS). For the concatenative TTS we are using a Finite State Machine approach to select non-uniform variablesize audio segments. Analogue to the TTS a k-Nearest-Neighbor algorithm is applied to select the visual segments
more » ... here we perform image filtering previous to the selection process to extract features which are used for the Euclidian distance measure to minimize distortions while concatenating the visual segments. We consider only the particular startframe and end-frame between potential video-frame sequences for the Euclidian metric. The selection of the visual equivalence of the selected segments is based on a visemic transcription according to the phonemic transcription of the given input text. Due to using independent source databases for speech and video we synchronize the generated signals in a linear way. The resulting audio-visual utterance is audio lipmovement synchronized audio-visual speech. The system is adaptable easily to new speakers whether using a different speech or video source.
doi:10.21437/interspeech.2005-789 fatcat:46bntxjvbfh2llqatyhlaf5rei