Creating Song from Lip and Tongue Videos with a Convolutional Vocoder

Jianyu Zhang, Pierre Roussel, Bruce Denby
2021 IEEE Access  
A convolutional neural network and deep autoencoder are used to predict Line Spectral Frequencies, F0, and a voiced/unvoiced flag in singing data, using as input only ultrasound images of the tongue and visual images of the lips. A novel convolutional vocoder to transform the learned parameters into an audio signal is also presented. Spectral Distortion of predicted Line Spectral Frequencies is reduced compared to that in an earlier study using handcrafted features and multilayer perceptrons on
more » ... the same data set; while predicted F0 and voiced/unvoiced flag predictions are found to be highly correlated with their ground truth values. Comparison of the convolutional vocoder to standard vocoders is made. Results can be of interest in the study of singing articulation as well as for silent speech interface research. Sample predicted audio files are available online. Source code: INDEX TERMS Multimodal speech recognition, convolutional neural networks, ultrasound, line spectral frequencies, silent speech interfaces, vocoder, rare singing.
doi:10.1109/access.2021.3050843 fatcat:i4xx6m5d2nhk3pwzeoi6p5omcq