Lost in segmentation: Three approaches for speech/non-speech detection in consumer-produced videos

Benjamin Elizalde, Gerald Friedland
2013 2013 IEEE International Conference on Multimedia and Expo (ICME)  
Traditional speech/non-speech segmentation systems have been designed for specific acoustic conditions, such as broadcast news or meetings. However, little research has been done on consumer-produced audio. This type of media is constantly growing and has complex characteristics such as low quality recordings, environmental noise and overlapping sounds. This paper discusses an evaluation of three different approaches for speech/non-speech detection on consumer-produced audio. The approaches are
more » ... state-ofthe-art speech/non-speech detectors-one based on Gaussian Mixture Models (GMM), another on Support Vector Machines (SVM), and the last on Neural Networks (NN). Using the TRECVID MED 2012 database, we designed training/testing sets combinations to aid the understanding of what speech/non-speech detection on consumer-produced media entails and how traditional approaches to this detection performed in this domain. The results revealed that the crossdomain state-of-the-art GMM and SVM systems' tests underperformed a one-layer NN algorithm, which had 20 % higher accuracy and computed audio 5 times faster.
doi:10.1109/icme.2013.6607486 dblp:conf/icmcs/ElizaldeF13 fatcat:wiretpir7re77jxwkhrj4yejna