Filters








17,984 Hits in 3.6 sec

Robust Audio-Visual Instance Discrimination [article]

Pedro Morgado, Ishan Misra, Nuno Vasconcelos
2021 arXiv   pre-print
We validate our contributions through extensive experiments on action recognition tasks and show that they address the problems of audio-visual instance discrimination and improve transfer learning performance  ...  First, audio-visual correspondences often produce faulty positives since the audio and video signals can be uninformative of each other.  ...  Discussion and future work We identified and tackled two significant sources of noisy training signals in audio-visual instance discrimination, namely instances with weak audio-visual correspondence (or  ... 
arXiv:2103.15916v1 fatcat:rxp52yp5tfhe3gydlhxdb4kvdi

Robust Audio-Visual Instance Discrimination via Active Contrastive Set Mining [article]

Hanyu Xuan, Yihong Xu, Shuo Chen, Zhiliang Wu, Jian Yang, Yan Yan, Xavier Alameda-Pineda
2022 arXiv   pre-print
As a state-of-the-art solution, Audio-Visual Instance Discrimination (AVID) extends instance discrimination to the audio-visual realm.  ...  The recent success of audio-visual representation learning can be largely attributed to their pervasive property of audio-visual synchronization, which can be used as self-annotated supervision.  ...  Hard sample mining is important in the instance discrimination field but is especially essential in the audio-visual field.  ... 
arXiv:2204.12366v1 fatcat:ixrdqdv4zjdknfpcyllv3yoqyq

Active Contrastive Set Mining for Robust Audio-Visual Instance Discrimination

Hanyu Xuan, Yihong Xu, Shuo Chen, Zhiliang Wu, Jian Yang, Yan Yan, Xavier Alameda-Pineda
2022 Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence   unpublished
As a state-of-the-art solution, Audio-Visual Instance Discrimination (AVID) extends instance discrimination to the audio-visual realm.  ...  The recent success of audio-visual representation learning can be largely attributed to their pervasive property of audio-visual synchronization, which can be used as self-annotated supervision.  ...  Hard sample mining is important in the instance discrimination field but is especially essential in the audio-visual field.  ... 
doi:10.24963/ijcai.2022/503 fatcat:bn576f5jwfderdwudte42tskva

Benchmarking methods for audio-visual recognition using tiny training sets

Xavier Alameda-Pineda, Jordi Sanchez-Riera, Radu Horaud
2013 2013 IEEE International Conference on Acoustics, Speech and Signal Processing  
The problem of choosing a classifier for audio-visual command recognition is addressed. Because such commands are culture-and user-dependant, methods need to learn new commands from a few examples.  ...  We seek for the best trade off between speed, robustness and size of the training set.  ...  ., 10-15 instances per class. Audio-visual discriminative classification approaches can be grouped depending on the way the audio-visual command is represented.  ... 
doi:10.1109/icassp.2013.6638341 dblp:conf/icassp/Alameda-PinedaSH13 fatcat:cx54ril6gfcnzge6ncl4vrglau

Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection [article]

Jiashuo Yu, Jinyu Liu, Ying Cheng, Rui Feng, Yuejie Zhang
2022 arXiv   pre-print
audio-visual learning.  ...  Specifically, we leverage a lightweight two-stream network to generate audio and visual bags, in which unimodal background, violent, and normal instances are clustered into semi-bags in an unsupervised  ...  networks, which reduces the modality noise and benefits robust audio-visual representation.  ... 
arXiv:2207.05500v1 fatcat:msv2qehs6raatikql2ffwdobxu

Autonomous Audio-Supported Learning of Visual Classifiers for Traffic Monitoring

Horst Bischof, Martin Godec, Christian Leistner, Bernhard Rinner, Andreas Starzacher
2010 IEEE Intelligent Systems  
This audio sensor source acts as teacher for the self-learning of the primary visual classifier and helps to resolve ambiguities typically present in single sensor settings.  ...  Our system consists of a robust on-line boosting classifier that allows for continuous learning and concept drift.  ...  For instance, Levin et al. [6] trained a car detector using co-training [5] and Christoudias et al. [7] proposed an audio-visual co-training system for human gesture recognition.  ... 
doi:10.1109/mis.2010.28 fatcat:usiyq77x3bfpllmrfdy4hqa2eq

Audio-visual atoms for generic video concept classification

Wei Jiang, Courtenay Cotton, Shih-Fu Chang, Dan Ellis, Alexander C. Loui
2010 ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)  
Audio atoms are extracted around energy onsets. Visual and audio atoms form AVAs, based on which discriminative audio-visual codebooks are constructed for concept detection.  ...  We extract a novel local representation, Audio-Visual Atom (AVA), which is defined as a region track associated with regional visual features and audio onset features.  ...  Based on the AVA representation, we construct discriminative audio-visual codebooks using the Multiple Instance Learning (MIL) technique [Maron and Lozano-Pérez 1998 ] to capture the representative joint  ... 
doi:10.1145/1823746.1823748 fatcat:ayq6ce4wsvhg5i3o25uqc2ckoy

Short-term audio-visual atoms for generic video concept classification

Wei Jiang, Courtenay Cotton, Shih-Fu Chang, Dan Ellis, Alexander Loui
2009 Proceedings of the seventeen ACM international conference on Multimedia - MM '09  
Discriminative audio-visual codebooks are constructed on top of S-AVAs using Multiple Instance Learning. Codebook-based features are generated for semantic concept detection.  ...  We investigate the challenging issue of joint audio-visual analysis of generic videos targeting at semantic concept detection.  ...  Based on the S-AVA representation, we construct discriminative audio-visual codebooks using Multiple Instance Learning (MIL) [25] to capture the representative joint audiovisual patterns that are salient  ... 
doi:10.1145/1631272.1631277 dblp:conf/mm/JiangCCEL09 fatcat:6kyzmoyy2jafvgisxzv4rpjfbu

Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition [article]

Joanna Hong, Minsu Kim, Daehun Yoo, Yong Man Ro
2022 arXiv   pre-print
This paper focuses on designing a noise-robust end-to-end Audio-Visual Speech Recognition (AVSR) system.  ...  To this end, we propose Visual Context-driven Audio Feature Enhancement module (V-CAFE) to enhance the input noisy audio speech with a help of audio-visual correspondence.  ...  On the other hand, the Discriminative and the proposed V-CAFE show robust performances to the acoustic noise by using audio and visual modalities simultaneously in speech recognition.  ... 
arXiv:2207.06020v1 fatcat:na3vsi5ndzeonpq3icz3ehtehm

Unsupervised Discriminative Learning of Sounds for Audio Event Classification [article]

Sascha Hornauer, Ke Li, Stella X. Yu, Shabnam Ghaffarzadegan, Liu Ren
2021 arXiv   pre-print
Furthermore, we show that our discriminative audio learning can be used to transfer knowledge across audio datasets and optionally include ImageNet pre-training.  ...  Recent progress in network-based audio event classification has shown the benefit of pre-training models on visual data such as ImageNet.  ...  By using Non-Parametric Instance-level Discrimination (NPID) [4] to train ESResNet on audio datasets we learn features beneficial for downstream audio classification tasks, illustrated in fig. 1 .  ... 
arXiv:2105.09279v1 fatcat:65mht72j6rdcdfvivkm6bfbfly

An Adversarial Framework for Generating Unseen Images by Activation Maximization

Yang Zhang, Wang Zhou, Gaoyuan Zhang, David D. Cox, Shiyu Chang
2022 AAAI Conference on Artificial Intelligence  
PROBEGAN consists of a class-conditional generator, a seen-class discriminator, and an all-class unconditional discriminator.  ...  Most of these methods would require the image set to contain some images of the target class to be visualized.  ...  These results suggest that regular classifiers tend to overemphasize features that are visually imperceptible, whereas robust classifiers would only focus on the visually salient ones.  ... 
dblp:conf/aaai/ZhangZZCC22 fatcat:ze3wpoiyu5hjbptlfiosarsgqi

See, Hear, Explore: Curiosity via Audio-Visual Association [article]

Victoria Dean, Shubham Tulsiani, Abhinav Gupta
2021 arXiv   pre-print
We present results on several Atari environments and Habitat (a photorealistic navigation simulator), showing the benefits of using an audio-visual association model for intrinsically guiding learning  ...  For videos and code, see https://vdean.github.io/audio-curiosity.html.  ...  We can then leverage the misalignment likelihood as an indicator of novelty since the discriminator would be uncertain in such instances.  ... 
arXiv:2007.03669v2 fatcat:rhurxli72jezletxnqbbodak4q

Latent-Based Adversarial Neural Networks for Facial Affect Estimations

Decky Aspandi, Adria Mallol-Ragolta, Bjorn Schuller, Xavier Binefa
2020 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)  
Specifically, our models operate by aggregating several modalities to our discriminator, which is further conditioned to the extracted latent features by the generator.  ...  Specifically, we extract the visual latent features of the generator, which are then used to condition the discriminator on its estimations.  ...  The main role of the AEG is to produce cleaned images from noisy images to fool the discriminator, while simultaneously extracting robust latent features.  ... 
doi:10.1109/fg47880.2020.00053 fatcat:v5h4sm46jzelvmilcrjksnw2cu

Sound Localization by Self-Supervised Time Delay Estimation [article]

Ziyang Chen, David F. Fouhey, Andrew Owens
2022 arXiv   pre-print
We also propose a multimodal contrastive learning model that solves a visually-guided localization task: estimating the time delay for a particular person in a multi-speaker mixture, given a visual representation  ...  We propose to learn these correspondences through self-supervision, drawing on recent techniques from visual tracking.  ...  Instance discrimination. We also consider models that can be trained solely with mono audio using instance discrimination [81] .  ... 
arXiv:2204.12489v1 fatcat:qakhyqoxobhedpw334bil3phba

Animating Face using Disentangled Audio Representations [article]

Gaurav Mittal, Baoyuan Wang
2019 arXiv   pre-print
To make talking head generation robust to such variations, we propose an explicit audio representation learning framework that disentangles audio sequences into various factors such as phonetic content  ...  All previous methods for audio-driven talking head generation assume the input audio to be clean with a neutral tone.  ...  Figure 4 : 4 Visual comparison showing the ease of using our disentangled audio representation with existing talking head approaches to improve robustness to speech variations.  ... 
arXiv:1910.00726v1 fatcat:ofmxiseyhnaxneam4mnz55atvu
« Previous Showing results 1 — 15 out of 17,984 results