Filters








187,701 Hits in 2.7 sec

A Model for Removing Skew in Network Multimedia Communication for Guaranteed QoS in Packet Network by Max- Packet Generation

Shyamalendu Kandar, C. T. Bhunia
2012 International Journal of Modeling and Optimization  
A model of generation of Max-Packet is also described in the paper. Index Terms-Combined packet, max-packet, only video packet, skew.  ...  Multimedia data are human sensible. These types of data are delay intolerable but error tolerable to some extend.  ...  If T a is the time stamp for audio and T v is the timestamp for video data the model acts as the following. If |T a -T V |  200ms , audio and video data are synchronous.  ... 
doi:10.7763/ijmo.2012.v2.210 fatcat:nzw3tuxazjd7xp3ysuys36lhcq

AVGZSLNet: Audio-Visual Generalized Zero-Shot Learning by Reconstructing Label Features from Multi-Modal Embeddings [article]

Pratik Mazumder, Pravendra Singh, Kranti Kumar Parida, Vinay P. Namboodiri
2020 arXiv   pre-print
The cross-modal decoder enforces a constraint that the class label text features can be reconstructed from the audio and video embeddings of data points.  ...  In this paper, we propose a novel approach for generalized zero-shot learning in a multi-modal setting, where we have novel classes of audio/video during testing that are not seen during training.  ...  When only audio or video data is present at test time, our model significantly outperforms the audio-only and video-only base-line models.  ... 
arXiv:2005.13402v3 fatcat:me3aoglrbrdndpnofz2o4syrgu

Car crash detection using ensemble deep learning and multimodal data from dashboard cameras

Jae Gyeong Choi, Chan Woo Kong, Gyeongho Kim, Sunghoon Lim
2021 Expert systems with applications  
video data or audio data only.  ...  While most existing car crash detection systems depend on single modal data (i.e., video data or audio data only), the proposed car crash detection system uses an ensemble deep learning model based on  ...  Method Model 1 A CNN-and-GRU-based classifier using video data only 2 A GRU-based classifier using audio features of audio data only 3 A CNN-based classifier using spectrogram images of audio data only  ... 
doi:10.1016/j.eswa.2021.115400 fatcat:cqqspd6tuncm7f6rjjen3ymjvi

Emotion Recognition in Audio and Video Using Deep Neural Networks [article]

Mandeep Singh, Yuan Fang
2020 arXiv   pre-print
With different architectures explored, we find (CNN+RNN) + 3DCNN multi-model architecture which processes audio spectrograms and corresponding video frames giving emotion prediction accuracy of 54.0% among  ...  Humans are able to comprehend information from multiple domains for e.g. speech, text and visual.  ...  We would like to thank the CS231N Teaching Staff for guiding us through the project.  ... 
arXiv:2006.08129v1 fatcat:j5mm6vyjejdebpmmjru5eryr7y

Audio-Video based Classification using SVM and AANN

K. Subashini, S. Palanivel, V. Ramalingam
2012 International Journal of Computer Applications  
This paper presents a method to classify audio-video data into one of five classes: advertisement, cartoon, news, movie and songs.  ...  Experimental results of audio classification and video classification are combined using weighted sum rule for audio-video based classification.  ...  Modeling technique for audio and video classification is described in Section 5.  ... 
doi:10.5120/6269-8425 fatcat:t7g2qozgrzfmzclyamhftf52ly

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition [article]

Takaki Makino, Brendan Shillingford
2019 arXiv   pre-print
To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content  ...  The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly  ...  Table 5 . 5 Results on noisy versions of YTDEV18 for audio (A), visual (V) and audio-visual (A+V) models trained on uncorrupted training data. All values are WER(%) ± 95% CI.  ... 
arXiv:1911.04890v1 fatcat:m4fxjpwvsnbadi4uw2qylueazu

Page 18 of SMPTE Motion Imaging Journal Vol. 107, Issue 9 [page]

1998 SMPTE Motion Imaging Journal  
As with the Video Essence, there may be no separate physical audio transport: the Audio Essence may often be carried as a multiplexed signal with the Video Essence but, for the purposes of modeling the  ...  In common with the above discussion on Video Essence, for the purpose of this report, Audio Files have been classified as Data Essence, thus Audio Essence is classified as Audio Streams only. 2.4.3.3.  ... 

DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction [article]

Hamed R. Tavakoli, Ali Borji, Esa Rahtu, Juho Kannala
2020 arXiv   pre-print
Despite existing a strong relation between auditory and visual cues for guiding gaze during perception, video saliency models only consider visual cues and neglect the auditory information that is ubiquitous  ...  Our results suggest that (1) audio is a strong contributing cue for saliency prediction, (2) salient visible sound-source is the natural cause of the superiority of our Audio-Visual model, (3) richer feature  ...  In a nutshell, our main contributions include: • Constructing a database for audio-visual saliency prediction, • Providing video categorical annotation for the data to enhance model analysis with respect  ... 
arXiv:1905.10693v2 fatcat:5tby44imzrcnvflhdfj4rrasie

Audio-Video Based Segmentation and Classification using AANN

K. Subashini, S. Palanivel, V. Ramaligam
2012 International Journal of Computer Applications Technology and Research  
This paper presents a method to classify audio-video data into one of seven classes: advertisement, cartoon, news, movie, and songs.  ...  Automatic audio-video classification is very useful to audio-video indexing, content based audio-video retrieval. Mel frequency cepstral coefficients are used to characterize the audio data.  ...  Audio and video data for segmentation The experiments are conducted using the television broadcast audio-video data collected from various channels(both Tamil and English) evaluation database.  ... 
doi:10.7753/ijcat0102.1003 fatcat:uusmr533abdgbbpxarzww4tla4

Sight to Sound: An End-to-End Approach for Visual Piano Transcription

A. Sophia Koepke, Olivia Wiles, Yael Moses, Andrew Zisserman
2020 ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
We also show that combining audio and video data can improve the transcription obtained from each modality alone.  ...  In contrast, visual information (e.g. a video of an instrument being played) does not have such ambiguities. In this work, we address the problem of transcribing piano music from visual data alone.  ...  Acknowledgements This work is supported by the EPSRC programme grant Seebibyte EP/M013774/1: Visual Search for the Era of Big Data. We thank Ruth Fong for help with smoothing the output.  ... 
doi:10.1109/icassp40776.2020.9053115 dblp:conf/icassp/KoepkeWMZ20 fatcat:tv3eplrte5cidchflowkscbguu

Audio keyword generation for sports video analysis

Min Xu, Ling-Yu Duan, Liang-Tien Chia, Chang-sheng Xu
2004 Proceedings of the 12th annual ACM international conference on Multimedia - MULTIMEDIA '04  
Moreover, our system introduces an adaptation mechanism to tune the initial HMM models (obtained from available training data) to improve performance by a small number of data from a new sports game video  ...  In our previous work, we have designed a hierarchical Support Vector Machine (SVM) classifier for audio keyword identification.  ...  Our developed system provides a flexible and efficient tool for sports audio analysis.  ... 
doi:10.1145/1027527.1027702 dblp:conf/mm/XuDCX04 fatcat:xdznflw6qzeu3cfve4rgu43wie

Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model

Yong-Hyeok Lee, Dong-Won Jang, Jae-Bin Kim, Rae-Hong Park, Hyung-Min Park
2020 Applied Sciences  
utilizes both an audio context vector using video query and a video context vector using audio query.  ...  As a result that the audio has richer information than the video related to lips, AVSR is hard to train attentions with balanced modalities.  ...  To consider a video query for audio context in addition to an audio query for video context of the AV align model [13] and to apply them to the transformer model, our DCM attention model has two multi-head  ... 
doi:10.3390/app10207263 fatcat:yyrwdli7pre73ldys7znzkhmh4

Audio-Video Sensor Fusion with Probabilistic Graphical Models [chapter]

Matthew J. Beal, Hagai Attias, Nebojsa Jojic
2002 Lecture Notes in Computer Science  
We present a new approach to modeling and processing multimedia data. This approach is based on graphical models that combine audio and video variables.  ...  It is therefore able to capture and exploit the statistical structure of the audio and video data separately, as well as their mutual dependencies.  ...  Fig. 2 . 2 Graphical model for the audio data. Fig. 3 . 3 Graphical model for the video data.  ... 
doi:10.1007/3-540-47969-4_49 fatcat:med2dpodn5fxnjtqzsvceyqq7q

Truly Multi-modal YouTube-8M Video Classification with Video, Audio, and Text [article]

Zhe Wang, Kingsley Kuan, Mathieu Ravaut, Gaurav Manek, Sibo Song, Yuan Fang, Seokhwan Kim, Nancy Chen, Luis Fernando D'Haro, Luu Anh Tuan, Hongyuan Zhu, Zeng Zeng (+4 others)
2017 arXiv   pre-print
We present a classification framework for the joint use of text, visual and audio features, and conduct an extensive set of experiments to quantify the benefit that this additional mode brings.  ...  In this Kaggle competition, we placed in the top 3% out of 650 participants using released video and audio features.  ...  We propose a multi-model classification framework jointly modeling visual, audio and text data, making this a truly multi-modal approach. 3.  ... 
arXiv:1706.05461v3 fatcat:utlwwc72qneb5hihwxytvig6fy

A self-calibrating algorithm for speaker tracking based on audio-visual statistical models

Matthew J. Beal, Nebojsa Jojic, Hagai Attias
2002 IEEE International Conference on Acoustics Speech and Signal Processing  
The algorithm uses a parametrized statistical model which combines simple models of video and audio. Using unobserved variables, the model describes the process that generates the observed data.  ...  We present a self-calibrating algorithm for audio-visual tracking using two microphones and a camera.  ...  In Fig. 3 , we compare the results of tracking using an audio only model, full audio-visual model and the video only model on the multimodal data containing a moving and talking person with a strong  ... 
doi:10.1109/icassp.2002.5745023 dblp:conf/icassp/BealJA02 fatcat:bj3tg5uu4ra6ncaejs5ri4r4v4
« Previous Showing results 1 — 15 out of 187,701 results