Filters








2,009 Hits in 7.2 sec

Self-Supervised Generation of Spatial Audio for 360 Video [article]

Pedro Morgado, Nuno Vasconcelos, Timothy Langlois, Oliver Wang
2018 arXiv   pre-print
Spatial audio is an important component of immersive 360 video viewing, but spatial audio microphones are still rare in current 360 video production.  ...  During training, ground-truth spatial audio serves as self-supervision and a mixed down mono track forms the input to our network.  ...  Inspired by these approaches, we propose a self-supervised technique for audio spatialization.  ... 
arXiv:1809.02587v1 fatcat:2at6tsjuujaede254ueasdw3fu

Telling Left from Right: Learning Spatial Correspondence of Sight and Sound [article]

Karren Yang, Bryan Russell, Justin Salamon
2020 arXiv   pre-print
We also show how to extend our self-supervised approach to 360 degree videos with ambisonic audio.  ...  Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs.  ...  Acknowledgements The authors would like to thank Andrew Owens for helpful discussions in the early stages of the project.  ... 
arXiv:2006.06175v2 fatcat:nz6y75x5rrgjtpdw42gklpeppm

Telling Left From Right: Learning Spatial Correspondence of Sight and Sound

Karren Yang, Bryan Russell, Justin Salamon
2020 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
We also show how to extend our self-supervised approach to 360 degree videos with ambisonic audio.  ...  Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs.  ...  To address this constraint, we introduce a generalization of our audio-visual correspondence task that learns strong spatial audio cues from 360-degree videos with real spatial audio in a self-supervised  ... 
doi:10.1109/cvpr42600.2020.00995 dblp:conf/cvpr/YangRS20 fatcat:dutpxdtjgbferc45ui6nyu2ycy

Learning Representations from Audio-Visual Spatial Alignment [article]

Pedro Morgado, Yi Li, Nuno Vasconcelos
2020 arXiv   pre-print
We introduce a novel self-supervised pretext task for learning representations from audio-visual content.  ...  To learn from these spatial cues, we tasked a network to perform contrastive audio-visual spatial alignment of 360 video and spatial audio.  ...  Thus, even self-supervised models reflect the biases in the collection process. To mitigate collection biases, we searched for 360videos using queries translated into multiple languages.  ... 
arXiv:2011.01819v1 fatcat:mjof6zfkrffgnprsll3y5mg75a

Self-supervised Audio Spatialization with Correspondence Classifier [article]

Yu-Ding Lu, Hsin-Ying Lee, Hung-Yu Tseng, Ming-Hsuan Yang
2019 arXiv   pre-print
In this work, we propose a self-supervised audio spatialization network that can generate spatial audio given the corresponding video and monaural audio.  ...  We collect a large-scale video dataset with spatial audio to validate the proposed method. Experimental results demonstrate the effectiveness of the proposed model on the audio spatialization task.  ...  In this work, we propose an audio spatialization network (ASN), a self-supervised framework for audio spatialization.  ... 
arXiv:1905.05375v1 fatcat:ampzayxugrf5zjwcd6ueaen45q

Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling [article]

Yoshiki Masuyama, Yoshiaki Bando, Kohei Yatabe, Yoko Sasaki, Masaki Onishi, Yasuhiro Oikawa
2020 arXiv   pre-print
To solve this problem, this paper presents a self-supervised training method using 360 images and multichannel audio signals.  ...  Most of conventional self-supervised learning uses monaural audio signals and images and cannot distinguish sound source objects having similar appearances due to poor spatial information in audio signals  ...  Yu Hoshina for their support in the experiment in Miraikan. This study was partially supported by JSPS KAKENHI No. 18H06490 for funding.  ... 
arXiv:2007.13976v1 fatcat:k4sho4ggnbafbfc3wyhngmg76a

Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation

Yan-Bo Lin, Yu-Chiang Frank Wang
2021 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
use of our model for audio spatialization.  ...  of video data with ground truth binaural audio data during training.  ...  Acknowledgements This work is supported in part by the Ministry of Science and Technology of Taiwan under grant MOST 109-2634-F-002-037.  ... 
doi:10.1609/aaai.v35i3.16302 fatcat:4ocbuk7qzbfuvf7y6mhy7agodi

Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation [article]

Yan-Bo Lin, Yu-Chiang Frank Wang
2021 arXiv   pre-print
use of our model for audio spatialization.  ...  of video data with ground truth binaural audio data during training.  ...  Acknowledgements This work is supported in part by the Ministry of Science and Technology of Taiwan under grant MOST 109-2634-F-002-037.  ... 
arXiv:2105.00708v1 fatcat:f7xkdl3ilzhizidnfbf54u4br4

2.5D Visual Sound

Ruohan Gao, Kristen Grauman
2019 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
In addition to sound generation, we show the self-supervised representation learned by our network benefits audio-visual source separation.  ...  Binaural audio provides a listener with 3D sound sensation, allowing a rich perceptual experience of the scene.  ...  Miller, Jacob Donley, Pablo Hoffmann, Vladimir Tourbabin, Vamsi Ithapu, Varun Nair, Abesh Thakur, Jaime Morales, Chetan Gupta from Facebook, Xinying Hao, Dongguang You, and the UT Austin vision group for  ... 
doi:10.1109/cvpr.2019.00041 dblp:conf/cvpr/GaoG19 fatcat:h6ip5qkyubcgbkdp3qxogr3zsi

Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment [article]

Shanshan Wang, Archontis Politis, Annamaria Mesaros, Tuomas Virtanen
2022 arXiv   pre-print
In this work, we present a method for self-supervised representation learning based on audio-visual spatial alignment (AVSA), a more sophisticated alignment task than the audio-visual correspondence (AVC  ...  Based on 360^o video and Ambisonics audio, we propose selection of visual objects using object detection, and beamforming of the audio signal towards the detected objects, attempting to learn the spatial  ...  [13] use object detection and depth maps from 360°video as supervision for an audio network using four pairs of binaural microphones.  ... 
arXiv:2206.00970v1 fatcat:mrgj4sy3a5audmohp5o5rmivky

2.5D Visual Sound [article]

Ruohan Gao, Kristen Grauman
2019 arXiv   pre-print
In addition to sound generation, we show the self-supervised representation learned by our network benefits audio-visual source separation.  ...  Binaural audio provides a listener with 3D sound sensation, allowing a rich perceptual experience of the scene.  ...  Miller, Jacob Donley, Pablo Hoffmann, Vladimir Tourbabin, Vamsi Ithapu, Varun Nair, Abesh Thakur, Jaime Morales, Chetan Gupta from Facebook, Xinying Hao, Dongguang You, and the UT Austin vision group for  ... 
arXiv:1812.04204v4 fatcat:m5nvtzv3c5hzvahykx62c3wsgi

Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications [article]

Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon
2019 arXiv   pre-print
automatic camera view panning in 360-degree videos.  ...  To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network.  ...  We use the AUTOCAM [48] method to generate a path of the sound source in 360videos.  ... 
arXiv:1911.09649v1 fatcat:gcan5noupzdkhkooidx73n3lqu

Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation [article]

Hang Zhou, Xudong Xu, Dahua Lin, Xiaogang Wang, Ziwei Liu
2020 arXiv   pre-print
Recent research has explored the usage of visual information as guidance to generate binaural or ambisonic audio from mono ones with stereo supervision.  ...  However, this fully supervised paradigm suffers from an inherent drawback: the recording of stereophonic audio usually requires delicate devices that are expensive for wide accessibility.  ...  This work is supported by SenseTime Group Limited, the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14202217, CUHK14203118, CUHK14205615, CUHK14207814, CUHK14208619  ... 
arXiv:2007.09902v1 fatcat:ucqdbcdwynhgjg6eqdlwyivip4

Pano-AVQA: Grounded Audio-Visual Question Answering on 360^∘ Videos [article]

Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, Gunhee Kim
2021 arXiv   pre-print
360^∘ videos convey holistic views for the surroundings of a scene. It provides audio-visual cues beyond pre-determined normal field of views and displays distinctive spatial relations on a sphere.  ...  However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings.  ...  [8] , panoramic saliency detection [9, 10] , and self-supervised spatial audio generation [11] .  ... 
arXiv:2110.05122v1 fatcat:adni6hekonabpb76ypbizfi6cu

Self-Supervised Moving Vehicle Detection from Audio-Visual Cues [article]

Jannik Zürn, Wolfram Burgard
2022 arXiv   pre-print
To tackle this problem, we propose a self-supervised approach that leverages audio-visual cues to detect moving vehicles in videos.  ...  Robust detection of moving vehicles is a critical task for any autonomously operating outdoor robot or self-driving vehicle.  ...  Self-Supervised Audio-Visual Sound Source Localization The advancement of Deep Learning enabled a multitude of self-supervised approaches for localizing sounds in recent years.  ... 
arXiv:2201.12771v2 fatcat:t7l7mnzi7rem5ksadrwka2vf3m
« Previous Showing results 1 — 15 out of 2,009 results