Filters








19,512 Hits in 5.2 sec

Learning Representations from Audio-Visual Spatial Alignment [article]

Pedro Morgado, Yi Li, Nuno Vasconcelos
2020 arXiv   pre-print
To learn from these spatial cues, we tasked a network to perform contrastive audio-visual spatial alignment of 360 video and spatial audio.  ...  We introduce a novel self-supervised pretext task for learning representations from audio-visual content.  ...  However, deep learning systems are trained from data. Thus, even self-supervised models reflect the biases in the collection process.  ... 
arXiv:2011.01819v1 fatcat:mjof6zfkrffgnprsll3y5mg75a

Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment [article]

Shanshan Wang, Archontis Politis, Annamaria Mesaros, Tuomas Virtanen
2022 arXiv   pre-print
In this work, we present a method for self-supervised representation learning based on audio-visual spatial alignment (AVSA), a more sophisticated alignment task than the audio-visual correspondence (AVC  ...  In addition to the correspondence, AVSA also learns from the spatial location of acoustic and visual content.  ...  Audio-visual spatial alignment learning Audio-visual spatial alignment is a difficult task, and implemented by dividing it into two stages. The first stage is Fig. 4 .  ... 
arXiv:2206.00970v1 fatcat:mrgj4sy3a5audmohp5o5rmivky

Telling Left From Right: Learning Spatial Correspondence of Sight and Sound

Karren Yang, Bryan Russell, Justin Salamon
2020 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs.  ...  not leverage spatial audio cues.  ...  Overall, the analysis suggests that our spatial alignment model learns a representation that maps spatial audio cues to the positions of sound sources in the visual stream.  ... 
doi:10.1109/cvpr42600.2020.00995 dblp:conf/cvpr/YangRS20 fatcat:dutpxdtjgbferc45ui6nyu2ycy

Telling Left from Right: Learning Spatial Correspondence of Sight and Sound [article]

Karren Yang, Bryan Russell, Justin Salamon
2020 arXiv   pre-print
Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs.  ...  not leverage spatial audio cues.  ...  Overall, the analysis suggests that our spatial alignment model learns a representation that maps spatial audio cues to the positions of sound sources in the visual stream.  ... 
arXiv:2006.06175v2 fatcat:nz6y75x5rrgjtpdw42gklpeppm

Audio-Visual Model Distillation Using Acoustic Images

Andres F. Perez, Valentina Sanguineti, Pietro Morerio, Vittorio Murino
2020 2020 IEEE Winter Conference on Applications of Computer Vision (WACV)  
In this paper, we investigate how to learn rich and robust feature representations for audio classification from visual data and acoustic images, a novel audio data modality.  ...  Former models learn audio representations from raw signals or spectral data acquired by a single microphone, with remarkable results in classification and retrieval.  ...  For instance in [2, 3] they learn aligned audio-visual representations, using an audio-visual correspondence task.  ... 
doi:10.1109/wacv45572.2020.9093307 dblp:conf/wacv/PerezSMM20 fatcat:cf3cdewcwndbrddttjysuo3jhe

Audio-Visual Model Distillation Using Acoustic Images [article]

Andrés F. Pérez, Valentina Sanguineti, Pietro Morerio, Vittorio Murino
2020 arXiv   pre-print
In this paper, we investigate how to learn rich and robust feature representations for audio classification from visual data and acoustic images, a novel audio data modality.  ...  Former models learn audio representations from raw signals or spectral data acquired by a single microphone, with remarkable results in classification and retrieval.  ...  For instance in [2, 3] they learn aligned audio-visual representations, using an audio-visual correspondence task.  ... 
arXiv:1904.07933v2 fatcat:wdxa3pcc75cfxdmzgtqm4szkpi

TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation [article]

Tanzila Rahman, Mengyu Yang, Leonid Sigal
2021 arXiv   pre-print
From a technical perspective, as part of the TriBERT architecture, we introduce a learned visual tokenization scheme based on spatial attention and leverage weak-supervision to allow granular cross-modal  ...  In addition, we show that the learned TriBERT representations are generic and significantly improve performance on other audio-visual tasks such as cross-modal audio-visual-pose retrieval by as much as  ...  Audio-visual representation learning has, in comparison, received much less attention. Most prior works [51] assume a single sound source per video and rely on audio-visual alignment objectives.  ... 
arXiv:2110.13412v1 fatcat:oejb6j7hebaiflohlib76r2pae

VisualEchoes: Spatial Image Representation Learning through Echolocation [article]

Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, Kristen Grauman
2020 arXiv   pre-print
Then we propose a novel interaction-based representation learning framework that learns useful visual features via echolocation.  ...  Our work opens a new path for representation learning for embodied agents, where supervision comes from interacting with the physical world.  ...  visual representation (no audio) at test time.  ... 
arXiv:2005.01616v2 fatcat:jkeld22crbg5fo3eu7bamdssym

The Impact of Spatiotemporal Augmentations on Self-Supervised Audiovisual Representation Learning [article]

Haider Al-Tahan, Yalda Mohsenzadeh
2021 arXiv   pre-print
In this paper, we present a contrastive framework to learn audiovisual representations from unlabeled videos.  ...  However, there are still major questions on how we could integrate principles learned from both domains to attain effective audiovisual representations.  ...  Recently, contrastive self-supervised learning is at the forefront in learning abstract representations from unlabeled visual or auditory data (He et al., 2020; Chen et al., 2020a; Al-Tahan & Mohsenzadeh  ... 
arXiv:2110.07082v1 fatcat:emnzw4ryafd35busxynex5yege

Contrastive Learning of Global-Local Video Representations [article]

Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song
2021 arXiv   pre-print
We achieve this by optimizing two contrastive objectives that together encourage our model to learn global-local visual information given audio signals.  ...  However, existing approaches optimize for learning representations specific to downstream scenarios, i.e., global representations suitable for tasks such as classification or local representations for  ...  Audio-visual video representation learning. Learning video representations from the natural audiovisual correspondence has been studied extensively.  ... 
arXiv:2104.05418v2 fatcat:xn6c22ne2va47lnblxn2klnhsa

Contrastive Learning of Global and Local Video Representations

Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song
2021 Neural Information Processing Systems  
We achieve this by optimizing two contrastive objectives that together encourage our model to learn global-local visual information given audio signals.  ...  However, existing approaches optimize for learning representations specific to downstream scenarios, i.e., global representations suitable for tasks such as classification or local representations for  ...  Audio-visual video representation learning. Learning video representations from the natural audiovisual correspondence has been studied extensively.  ... 
dblp:conf/nips/MaZMS21 fatcat:ipt7bzw2evh6dif75no7l36nla

Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models [article]

Khazar Khorrami, Okko Räsänen
2021 arXiv   pre-print
This work studies multimodal learning in the context of visually grounded speech (VGS) models, and focuses on their recently demonstrated capability to extract spatiotemporal alignments between spoken  ...  in aligning visual objects and spoken words, and propose a new VGS model variant for the alignment task utilizing cross-modal attention layer.  ...  These attention scores are then used to produce corresponding spatially weighted representations for audio (left) and time-weighted representations for image (right), followed by average pooling to get  ... 
arXiv:2108.02562v1 fatcat:ijwgjor7nrgh5puxjo3l2xjmrm

Learning in Audio-visual Context: A Review, Analysis, and New Perspective [article]

Yake Wei, Di Hu, Yapeng Tian, Xuelong Li
2022 arXiv   pre-print
To mimic human perception ability, audio-visual learning, aimed at developing computational approaches to learn from both audio and visual modalities, has been a flourishing field in recent years.  ...  future direction of the audio-visual learning area.  ...  Audio-visual Representation Learning How to effectively extract representation from heterogeneous audio-visual modalities without human annotations, is an important topic.  ... 
arXiv:2208.09579v1 fatcat:xrjedf2ezbhbzbkysw2z2jsm7e

Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning [article]

Shaobo Min, Qi Dai, Hongtao Xie, Chuang Gan, Yongdong Zhang, Jingdong Wang
2021 arXiv   pre-print
Existing methods focus on distinguishing different video clips by visual and audio representations.  ...  The CMAC approach aims to align the regional attention generated purely from the visual signal with the target attention generated under the guidance of acoustic signal, and do a similar alignment for  ...  In this work, we focus on visual-audio unsupervised learning, which aims at jointly learning visual and audio representations in an unsupervised manner.  ... 
arXiv:2106.06939v1 fatcat:3ikfqcp5zvhzbaq6jw2ueyrcgm

Visual Representations of Physiological Signals for Fake Video Detection [article]

Kalin Stefanov, Bhawna Paliwal, Abhinav Dhall
2022 arXiv   pre-print
This paper presents a multimodal learning-based method for detection of real and fake videos. The method combines information from three modalities - audio, video, and physiology.  ...  Both strategies for combining the two modalities rely on a novel method for generation of visual representations of physiological signals.  ...  The proposed visual representations are either used to augment the original face crops or the relationship between the face crops and the proposed visual representations is learned from data through a  ... 
arXiv:2207.08380v1 fatcat:2iqcewzorjepdg5hi5onwrnice
« Previous Showing results 1 — 15 out of 19,512 results