Filters








3,857 Hits in 7.5 sec

2021 Index IEEE Transactions on Multimedia Vol. 23

2021 IEEE transactions on multimedia  
The Author Index contains the primary entry for each item, listed under the first author's name.  ...  -that appeared in this periodical during 2021, and items from previous years that were commented upon or corrected in 2021.  ...  ., +, TMM 2021 624-635 C-GCN: Correlation Based Graph Convolutional Network for Audio-Video Emotion Recognition.  ... 
doi:10.1109/tmm.2022.3141947 fatcat:lil2nf3vd5ehbfgtslulu7y3lq

Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques [article]

Grzegorz Chrupała
2021 arXiv   pre-print
The current paper brings together these contributions in order to provide a useful introduction and overview for practitioners in all these areas.  ...  This survey provides an overview of the evolution of visually grounded models of spoken language over the last 20 years.  ...  instructional cooking videos and explore the various degrees of supervision applied to sampling audio-visual fragments.  ... 
arXiv:2104.13225v3 fatcat:edodewkhljbqtpcrm2knd2zw7i

Visually Grounded Models of Spoken Language: A Survey of Datasets, Architectures and Evaluation Techniques

Grzegorz Chrupała
2022 The Journal of Artificial Intelligence Research  
The current paper brings together these contributions in order to provide a useful introduction and overview for practitioners in all these areas.  ...  This survey provides an overview of the evolution of visually grounded models of spoken language over the last 20 years.  ...  Then the network (which is trained on the Places dataset in the regular fashion) computes pairwise similarities between each audio candidate and each visual candidate in the embedding space.  ... 
doi:10.1613/jair.1.12967 fatcat:zib2mr5wkjdmteyrgac6gxekli

Table of Contents

2021 IEEE transactions on multimedia  
Fu Image/Video/Graphics Analysis and Synthesis Attentive Composite Residual Network for Robust Rain Removal from Single Images . . . . . . Y. Que, S. Li, and H. J.  ...  Zhou Multimedia Search and Retrieval Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval . . . . . . . .W. Wang, J. Gao, X. Yang, and C.  ... 
doi:10.1109/tmm.2021.3132246 fatcat:el7u2udtybddrpbl5gxkvfricy

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Wenhao Chai, Gaoang Wang
2022 Applied Sciences  
Deep vision multimodal learning aims at combining deep visual representation learning with other modalities, such as text, sound, and data collected from other sensors.  ...  Several applications and benchmarks on vision tasks are listed to help researchers gain a deeper understanding of progress in the field.  ...  [70] use video and audio signals to extract features and design a contrastive loss, and then fuse the video and audio features with text features for a contrastive loss.  ... 
doi:10.3390/app12136588 fatcat:bokdxwkcwbgjlpblfrwbj4mtxm

Foley Music: Learning to Generate Music from Videos [chapter]

Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, Antonio Torralba
2020 Lecture Notes in Computer Science  
We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings.  ...  In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments.  ...  This work is supported by ONR MURI N00014-16-1-2007, the Center for Brain, Minds, and Machines (CBMM, NSF STC award CCF-1231216), and IBM Research.  ... 
doi:10.1007/978-3-030-58621-8_44 fatcat:7rcvic77mjbkxmrmx4r6vgvw3i

Foley Music: Learning to Generate Music from Videos [article]

Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, Antonio Torralba
2020 arXiv   pre-print
We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings.  ...  In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments.  ...  This work is supported by ONR MURI N00014-16-1-2007, the Center for Brain, Minds, and Machines (CBMM, NSF STC award CCF-1231216), and IBM Research.  ... 
arXiv:2007.10984v1 fatcat:a5ktcxsufnftvdtqb7j4rmnc44

Self-supervised Audiovisual Representation Learning for Remote Sensing Data [article]

Konrad Heidler, Lichao Mou, Di Hu, Pu Jin, Guangyao Li, Chuang Gan, Ji-Rong Wen, Xiao Xiang Zhu
2021 arXiv   pre-print
In order to contribute towards the availability of pre-trained backbone networks in remote sensing, we devise a self-supervised approach for pre-training deep neural networks.  ...  By exploiting the correspondence between geo-tagged audio recordings and remote sensing imagery, this is done in a completely label-free manner, eliminating the need for laborious manual annotation.  ...  training a student network for the audio (visual) modality [7] , [8] .  ... 
arXiv:2108.00688v1 fatcat:bhvcwavkibhxfmezayic5yryfe

Creating A Multi-track Classical Music Performance Dataset for Multi-modal Music Analysis: Challenges, Insights, and Applications

Bochen Li, Xinzhao Liu, Karthik Dinesh, Zhiyao Duan, Gaurav Sharma
2018 IEEE transactions on multimedia  
For each piece, we provide the musical score in MIDI format, the audio recordings of the individual tracks, the audio and video recording of the assembled mixture, and ground-truth annotation files including  ...  We introduce a dataset for facilitating audio-visual analysis of music performances.  ...  Ming-Lun Lee and Yunn-Shan Ma for participating in our various attempts at crossplayer synchronization, Andrea Cogliati for recording the conducting videos, and Prof.  ... 
doi:10.1109/tmm.2018.2856090 fatcat:lh5idozlsncfxdimd6x455w65m

Multimodal Image Synthesis and Editing: A Survey [article]

Fangneng Zhan, Yingchen Yu, Rongliang Wu, Jiahui Zhang, Shijian Lu, Lingjie Liu, Adam Kortylewski, Christian Theobalt, Eric Xing
2022 arXiv   pre-print
Instead of providing explicit guidance for network training, multimodal guidance offers intuitive and flexible means for image synthesis and editing.  ...  As information exists in various modalities in real world, effective interaction and fusion among multimodal information plays a key role for the creation and perception of multimodal data in computer  ...  ACKNOWLEDGMENTS This study is supported under the RIE2020 Industry Alignment Fund -Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry  ... 
arXiv:2112.13592v3 fatcat:46twjhz3hbe6rpm33k6ilnisga

Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with Visual Computing for Improved Music Video Analysis [article]

Alexander Schindler
2020 arXiv   pre-print
Additionally, new visual features are introduced capturing rhythmic visual patterns. In all of these experiments the audio-based results serve as benchmark for the visual and audio-visual approaches.  ...  Evaluations range from low-level visual features to high-level concepts retrieved by means of Deep Convolutional Neural Networks.  ...  An audio-visual approach to segmentations of music videos was proposed in [82] , including an evaluation of audio-visual correlations with an intended application in audio retrieval from video.  ... 
arXiv:2002.00251v1 fatcat:6cz6rivc3fbg7fahdsnokxfrk4

Deep Cross-Modal Audio-Visual Generation [article]

Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, Chenliang Xu
2017 arXiv   pre-print
We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation.  ...  Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances.  ...  [9] learn an audio-visual bimodal compositional model using sparse coding.  ... 
arXiv:1704.08292v1 fatcat:u4pnfius5ja4pasp7lmlru7o3u

Applications of video-content analysis and retrieval

N. Dimitrova, Hong-Jiang Zhang, B. Shahraray, I. Sezan, T. Huang, A. Zakhor
2002 IEEE Multimedia  
Content magnets attracting story segments in the content-based personal video recorder Video Scout application.  ...  Therefore, the core research in content-based video retrieval is developing technologies to automatically parse video, audio, and text to identify meaningful composition structure and to extract and represent  ...  We can easily retrieve such a compact presentation over low-bandwidth communications networks. When more bandwidth is available, the presentation can include audio and video information.  ... 
doi:10.1109/mmul.2002.1022858 fatcat:bur6qinxrvfvveeybffaptsod4

VLP: A Survey on Vision-Language Pre-training [article]

Feilong Chen, Duzhen Zhang, Minglun Han, Xiuyi Chen, Jing Shi, Shuang Xu, Bo Xu
2022 arXiv   pre-print
This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training.  ...  Researchers have explored this problem and made significant progress.  ...  Research on the energy-efficient Spiking Neural Networks [155, 156, 157] in the brain-inspired field may also provide insights into the exploration of novel VLP architectures.  ... 
arXiv:2202.09061v4 fatcat:wgpc6bxlsjddjltwgge6q366f4

Deep Cross-Modal Audio-Visual Generation

Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, Chenliang Xu
2017 Proceedings of the on Thematic Workshops of ACM Multimedia 2017 - Thematic Workshops '17  
We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation.  ...  Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances.  ...  ACKNOWLEDGEMENT We would like to thank Bochen Li and Yichi Zhang, Department of ECE, University of Rochester, for helpful suggestions and help with the URMP dataset.  ... 
doi:10.1145/3126686.3126723 dblp:conf/mm/ChenSDX17 fatcat:avrabhiokfed3kzm2ue6hgb7kq
« Previous Showing results 1 — 15 out of 3,857 results