Filters








517 Hits in 6.6 sec

Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data [article]

Haytham M. Fayek, Anurag Kumar
2020 arXiv   pre-print
We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings.  ...  Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model, which outperforms the single-modal models, and state-of-the-art fusion and multi-modal models  ...  Large-scale sound event detection has been possible primarily through weakly supervised learning [Kumar and Raj, 2016a] and the release of large-scale weakly labeled sound events datasets, such as Audioset  ... 
arXiv:2006.01595v1 fatcat:caryqox5cfcdlb4tg5lpkvc3ny

Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data

Haytham M. Fayek, Anurag Kumar
2020 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence  
We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings.  ...  Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model, which outperforms the single-modal models, and state-of-the-art fusion and multi-modal models  ...  Large-scale sound event detection has been possible primarily through weakly supervised learning [Kumar and Raj, 2016a] and the release of large-scale weakly labeled sound events datasets, such as Audioset  ... 
doi:10.24963/ijcai.2020/78 dblp:conf/ijcai/FayekK20 fatcat:vbc4g6xejvcsfb4bko7xtkxuba

Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events

Wim Boes, Hugo Van hamme
2019 Proceedings of the 27th ACM International Conference on Multimedia - MM '19  
We perform extensive experiments with these adapted transformers on an audiovisual data set, obtained by appending relevant visual information to an existing large-scale weakly labeled audio collection  ...  The employed multi-label data contains clip-level annotation indicating the presence or absence of 17 classes of environmental sounds, and does not include temporal information.  ...  ACKNOWLEDGMENTS This work is supported by a PhD Fellowship of Research Foundation Flanders (FWO-Vlaanderen).  ... 
doi:10.1145/3343031.3350873 dblp:conf/mm/Boesh19 fatcat:oi66fxokzjhgzjmpiy33daxte4

Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing [article]

Yapeng Tian, Dingzeyu Li, Chenliang Xu
2020 arXiv   pre-print
Furthermore, we discover and mitigate modality bias and noisy label issues with an individual-guided learning mechanism and label smoothing technique, respectively.  ...  Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels.  ...  The weak labels are easier to annotate and can be gathered in a large scale from web videos.  ... 
arXiv:2007.10558v1 fatcat:kcexne6cpbe2tfyimttmercbka

Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events [article]

Sanjeel Parekh, Slim Essid, Alexey Ozerov, Ngoc Q. K. Duong, Patrick Pérez, Gaël Richard
2018 arXiv   pre-print
We achieve state-of-the-art results on a large-scale dataset of weakly-labeled audio event videos.  ...  Audio-visual representation learning is an important task from the perspective of designing machines with the ability to understand complex events.  ...  We use the recently introduced dataset for DCASE challenge on large-scale weakly supervised sound event detection for smart cars [36] .  ... 
arXiv:1804.07345v2 fatcat:bxwvb5z2lbff3jft5hxr34ea3q

Audio-Visual Event Localization in Unconstrained Videos [chapter]

Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu
2018 Lecture Notes in Computer Science  
Our experiments support the following findings: joint modeling of auditory and visual modalities outperforms independent modeling, the learned attention can capture semantics of sounding objects, temporal  ...  In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos.  ...  We gratefully acknowledge the gift donations of Markable, Inc., Tencent and the support of NVIDIA Corporation with the donation of the GPUs used for this research.  ... 
doi:10.1007/978-3-030-01216-8_16 fatcat:t4dbgoypsnaixmrtrrtrwcv2sy

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching [article]

Di Hu, Rui Qian, Minyue Jiang, Xiao Tan, Shilei Wen, Errui Ding, Weiyao Lin, Dejing Dou
2020 arXiv   pre-print
Experimental results in both realistic and synthesized cocktail-party videos demonstrate that our model is superior in filtering out silent objects and pointing out the location of sounding objects of  ...  In this paper, we propose a two-stage learning framework to perform self-supervised class-aware sounding object localization.  ...  While in our work, we alternatively use audiovisual correspondence and pseudo labels from clustering to boost audiovisual learning and learn object representations.  ... 
arXiv:2010.05466v1 fatcat:nyuc5qnrrjgnfcqpforcw3liri

Dual-modality seq2seq network for audio-visual event localization [article]

Yan-Bo Lin, Yu-Jhe Li, Yu-Chiang Frank Wang
2020 arXiv   pre-print
fully supervised or weakly supervised settings.  ...  Empiricalresults confirm that our proposed method performs favorablyagainst recent deep learning approaches in both settings.  ...  To better learn the visual and audio embedding features, our CNNs are learned from the large-scale dataset (ImageNet [15] and AudioSet [16] ) which are highly shown useful for vision and audition tasks  ... 
arXiv:1902.07473v2 fatcat:3xrodudh2vfy3ne5tzaxof7pwa

Content-based analysis for accessing audiovisual archives: Alternatives for concept-based indexing and search

Tinne Tuytelaars
2012 2012 13th International Workshop on Image Analysis for Multimedia Interactive Services  
This includes i) the use of knowledge modeling to bridge the semantic gap; ii) on-the-fly learning of new, userdefined concepts; and iii) weakly supervised methods that learn from associated text data.  ...  Huge amounts of audiovisual material have been digitized recently, resulting in a great source of information relevant both from a cultural and historical point of view.  ...  CONCLUSIONS In this paper we have discussed automatic tools for contentbased analysis of audiovisual content, with the purpose of opening up large scale multimedia archives.  ... 
doi:10.1109/wiamis.2012.6226770 dblp:conf/wiamis/Tuytelaars12 fatcat:eprudqnimrg4rjtyqauvf6mspy

Limitations of weak labels for embedding and tagging [article]

Nicolas Turpault
2020 arXiv   pre-print
Many datasets and approaches in ambient sound analysis use weakly labeled data.Weak labels are employed because annotating every data sample with a strong label is too expensive.Yet, their impact on the  ...  to weakly labeled data.  ...  Most current approaches rely on training "big" classifiers in an end-to-end fashion on large-scale labeled data.  ... 
arXiv:2002.01687v4 fatcat:posq5tqx2nc4hnhotskkugy5bu

Limitations of Weak Labels for Embedding and Tagging

Nicolas Turpault, Romain Serizel, Emmanuel Vincent
2020 ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
Most current approaches rely on training "big" classifiers in an end-to-end fashion on large-scale labeled data.  ...  The use of weakly labeled data to train a sound event detection system, which outputs labels together with their time localization in a segment of audio, has been studied in recent years [15] [16] [17  ... 
doi:10.1109/icassp40776.2020.9053160 dblp:conf/icassp/TurpaultSV20 fatcat:w2dstraorzhlvoio7pg34jriqm

Cross modal video representations for weakly supervised active speaker localization [article]

Rahul Sharma, Krishna Somandepalli, Shrikanth Narayanan
2021 arXiv   pre-print
This is however a challenging problem due to the vast variety and contextual variability in the media content, and the lack of labeled data.  ...  Avoiding the need for manual annotations for active speakers in visual frames, acquiring of which is very expensive, we present a weakly supervised system for the task of localizing active speakers in  ...  Supervised modeling of such videos requires large amounts of (labeled) data.  ... 
arXiv:2003.04358v2 fatcat:bpdufkl34zf53mui3atxo76k74

Learning in Audio-visual Context: A Review, Analysis, and New Perspective [article]

Yake Wei, Di Hu, Yapeng Tian, Xuelong Li
2022 arXiv   pre-print
Overall, this survey reviews and outlooks the current audio-visual learning field from different aspects. We hope it can provide researchers with a better understanding of this area.  ...  future direction of the audio-visual learning area.  ...  These novel model architectures bring the ability of audio-visual speech recognition to a new peak, while the dependence of deep learning methods on large-scale available data, makes the cost of labelling  ... 
arXiv:2208.09579v1 fatcat:xrjedf2ezbhbzbkysw2z2jsm7e

A Light-Weight Multimodal Framework for Improved Environmental Audio Tagging [article]

Juncheng Li, Yun Wang, Joseph Szurley, Florian Metze, Samarjit Das
2018 arXiv   pre-print
It is trained with the audio tracks of a large collection of weakly labeled YouTube video excerpts; the video branch uses pretrained state-of-the-art image recognition networks and word embeddings to extract  ...  The lack of strong labels has severely limited the state-of-the-art fully supervised audio tagging systems to be scaled to larger dataset.  ...  Compared with strongly labeled datasets, weakly labeled datasets are much less expensive to collect at scale and can cover a wider range of sound event types.  ... 
arXiv:1712.09680v2 fatcat:mkwhw4ths5a3hcaedot2tqmjiq

Visually-aware Acoustic Event Detection using Heterogeneous Graphs [article]

Amir Shirian, Krishna Somandepalli, Victor Sanchez, Tanaya Guha
2022 arXiv   pre-print
Our model can easily be adapted to different scales of events through relevant hyperparameters. Experiments on AudioSet, a large benchmark, shows that our model achieves state-of-the-art performance.  ...  Using heterogeneous graph approaches to address the task of visually-aware acoustic event classification, which serves as a compact, efficient and scalable way to represent data in the form of graphs.  ...  Dataset We use a large scale weakly labelled dataset AudioSet [30] , which contains audio segments from YouTube videos.  ... 
arXiv:2207.07935v1 fatcat:3nxickdx7vghncq2bf75jotwf4
« Previous Showing results 1 — 15 out of 517 results