A Practical Model for Live Speech-Driven Lip-Sync
IEEE Computer Graphics and Applications
T he signal-processing and speech-understanding communities have proposed several approaches to generate speech animation based on live acoustic speech input. For example, based on a real-time recognized phoneme sequence, researchers use simple linear smoothing functions to produce corresponding speech animation. 1,2 Other approaches train statistical models (such as neural networks) to encode the mapping between acoustic speech features and facial movements. 3,4 These approaches have
... ed their real-time runtime ef ciency on an off-the-shelf computer, but their performance is highly speaker-dependent because of the individual-speci c nature of the chosen acoustic speech features. Furthermore, the visual realism of these approaches is insuf cient, so they're less suitable for graphics and animation applications. Live speech-driven lip-sync involves several challenges. First, additional technical challenges are involved compared with prerecorded speech, where expensive global optimization techniques can help nd the most plausible speech motion corresponding to novel spoken or typed input. In contrast, it's extremely dif cult, if not impossible, to directly apply such global optimization techniques to live speech-driven lip-sync applications because the forthcoming (unavailable yet) speech content can't be exploited during the synthesis process. Second, live speech-driven lipsync algorithms must be highly ef cient to ensure real-time speed on an off-the-shelf computer, whereas of ine speech animation synthesis algorithms don't need to meet such tight time constraints. Compared with forced phoneme alignment for prerecorded speech, this last challenge comes from the low accuracy of state-of-the-art live speech phoneme recognition systems (such as the Julius system [http://julius.sourceforge.jp] and the HTK toolkit [http://htk.eng.cam.ac.uk]). To quantify the phoneme recognition accuracy between the prerecorded and live speech cases, we randomly selected 10 prerecorded sentences and extracted their phoneme sequences using the Julius system, rst to do forced phoneme-alignment on the clips (called of ine phoneme alignment) and then as a real-time phoneme recognition engine. By simulating the same prerecorded speech clip as live speech, the system generated phoneme output sequentially while the speech was being fed into it. Then, by taking the of ine phoneme alignment results as the ground truth, we were able to compute the accuracies of the live speech phoneme recognition in our experiment. As Figure 1 illustrates, the live speech phoneme recognition accuracy of the same Julius system varies from 45 to 80 percent. Further empirical analysis didn't show any patterns of incorrectly recognized phonemes (that is, the phonemes often recognized incorrectly in live speech), implying that to produce satisfactory live speech-driven animation results, any phonemebased algorithm must take the relatively low phoneme recognition accuracy (for live speech) into design consideration. Moreover, that algorithm should be able to perform certain self-corrections at runtime because some phonemes could be incorrectly recognized and input into the algorithm in a less predictable manner. A simple, ef cient, yet practical phoneme-based approach to generating realistic speech animation in real time based on live speech input starts with decomposing lower-face movements and ends with applying motion blending. Experiments and comparisons demonstrate the realism of this synthesized speech animation.