Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues

W. H. Adams, Giridharan Iyengar, Ching-Yung Lin, Milind Ramesh Naphade, Chalapathy Neti, Harriet J. Nock, John R. Smith
2003 EURASIP Journal on Advances in Signal Processing  
In this paper we present a learning-based approach to semantic indexing of multimedia content using cues derived from audio, visual and text features. We approach the problem by developing a set of statistical models for a predefined lexicon. Novel concepts are then mapped in terms of concepts in the lexicon. To achieve robust detection of concepts, we exploit features from multiple modalities, namely audio, visual and text. Concept representations are modeled using Gaussian Mixtures (GMM),
more » ... en Markov Models (HMM), and Support Vector Machines (SVM). Models such as Bayesian Networks and SVMs are used in a late fusion approach to model concepts that are not explicitly modeled in terms of features. Our experiments indicate promise in the proposed classification and fusion methodologies: Our proposed fusion scheme achieves more than 10% relative improvement over the best uni-modal concept detector. 1 Audio here refers to non-speech content of the sound-track.
doi:10.1155/s1110865703211173 fatcat:rwkygctgzjfx3fey7e722djxgq