Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling

Khalid Ashraf, Benjamin Elizalde, Forrest Iandola, Matthew Moskewicz, Julia Bernd, Gerald Friedland, Kurt Keutzer
2015 Proceedings of the 5th ACM on International Conference on Multimedia Retrieval - ICMR '15  
This paper presents advances in analyzing audio content information to detect events in videos, such as a parade or a birthday party. We developed a set of tools for audio processing within the predominantly vision-focused deep neural network (DNN) framework Caffe. Using these tools, we show, for the first time, the potential of using only a DNN for audio-based multimedia event detection. Training DNNs for event detection using the entire audio track from each video causes a computational
more » ... neck. Here, we address this problem by developing a sparse audio frame-sampling method that improves event-detection speed and accuracy. We achieved a 10 percentage-point improvement in eventclassification accuracy, with a 200x reduction in the number of training input examples as compared to using the entire track. This reduction in input feature volume led to a 16x reduction in the size of the DNN architecture and a 300x reduction in training time. We applied our method using the recently released YLI-MED dataset and compared our results with a state-of-the-art system and with results reported in the literature for TRECVID MED. Our results show much higher MAP scores compared to a baseline i-vector systemat a significantly reduced computational cost. The speed improvement is relevant for processing videos on a large scale, and could enable more effective deployment in mobile systems.
doi:10.1145/2671188.2749396 dblp:conf/mir/AshrafEIMBFK15 fatcat:tjfd4zwz5zb5dk3m24lmlvmmq4