3,120 Hits in 5.6 sec

Query by Video: Cross-modal Music Retrieval

Bochen Li, Aparna Kumar
2019 Zenodo  
To retrieve music for an input video, the trained model ranks tracks in the music database by cross-modal distances to the query video.  ...  We also present cross-modal music retrieval experiments on Spotify music using user-generated videos from Instagram and Youtube as queries, and subjective evaluations show that the proposed model can retrieve  ...  "Query by Video: Cross-modal Music Retrieval", 20th International Society for Music Information Retrieval Conference, Delft, The Netherlands, 2019.  ... 
doi:10.5281/zenodo.3527881 fatcat:cwwcc6objbca7puhyt5rbxln6u

Deep Music Retrieval for Fine-Grained Videos by Exploiting Cross-Modal-Encoded Voice-Overs [article]

Tingtian Li, Zixun Sun, Haoruo Zhang, Jin Li, Ziming Wu, Hui Zhan, Yipeng Yu, Hengcan Shi
2021 arXiv   pre-print
However, existing video-music retrieval methods only based on the visual modality cannot show promising performance regarding videos with fine-grained virtual contents.  ...  Recently, the witness of the rapidly growing popularity of short videos on different Internet platforms has intensified the need for a background music (BGM) retrieval system.  ...  A proper video-music retrieval system can retrieve reasonable BGM for ordinary users to enhance the video phenomena and lower the video production barrier.  ... 
arXiv:2104.10557v1 fatcat:whamnued5bfdrax7eh3ucrrrea

Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval [article]

Yi Yu, Suhua Tang, Francisco Raposo, Lei Chen
2017 arXiv   pre-print
Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.  ...  Particularly, our end-to-end deep architecture contains two properties: simultaneously implementing feature learning and cross-modal correlation learning, and learning joint representation by considering  ...  Co-occurring changes in audio and video content of music videos can be detected, where the correlations can be used in cross-modal audio-visual music retrieval.  ... 
arXiv:1711.08976v2 fatcat:m5uk6lbadrcanpb3prxfv7lueu

Using weakly aligned score–audio pairs to train deep chroma models for cross-modal music retrieval

Frank Zalkow, Meinard Müller
2020 Zenodo  
We then apply this model to a cross-modal retrieval task, where we aim at finding relevant audio recordings of Western classical music, given a short monophonic musical theme in symbolic notation as a  ...  query.  ...  Acknowledgments: Frank Zalkow and Meinard Müller are supported by the German Research Foundation (DFG-MU 2686/11-1, MU 2686/12-1).  ... 
doi:10.5281/zenodo.4245399 fatcat:rtla4xknznb7fbgl4hgsf4itqa

Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint [article]

Sungeun Hong, Woobin Im, Hyun S. Yang
2017 arXiv   pre-print
Up to now, only limited research has been conducted on cross-modal retrieval of suitable music for a specified video or vice versa.  ...  This paper introduces a new content-based, cross-modal retrieval method for video and music that is implemented through deep neural networks.  ...  cross-modal item to the query in the embedding space of our VM-NET model.  ... 
arXiv:1704.06761v2 fatcat:rmno6ua55bhhnbz3kxzlt7ktge

Cross-modal Embeddings for Video and Audio Retrieval [chapter]

Didac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto
2019 Lecture Notes in Computer Science  
These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given query audio.  ...  The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning.  ...  The present work is focused on using the information present in each modality to create a joint embedding space to perform cross-modal retrieval.  ... 
doi:10.1007/978-3-030-11018-5_62 fatcat:4noobvtqhrdlljrul6cd3m2o3y

Cross-modal Embeddings for Video and Audio Retrieval [article]

Didac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto
2018 arXiv   pre-print
These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given a query audio.  ...  The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning.  ...  Amanda Duarte was funded by the mobility grant of the Severo Ochoa Program at Barcelona Supercomputing Center (BSC-CNS).  ... 
arXiv:1801.02200v1 fatcat:24brpnqsyjf4povlekxcelbtoi

Cross-Modal Music-Video Recommendation: A Study of Design Choices

Laure Pretet, Gael Richard, Geoffroy Peeters
2021 2021 International Joint Conference on Neural Networks (IJCNN)  
In this work, we build upon a recent video-music retrieval system (the VM-NET), which originally relies on an audio representation obtained by a set of statistics computed over handcrafted features.  ...  More precisely, we jointly learn audio and video embeddings by using their co-occurrence in music-video clips.  ...  Use of audio embeddings for cross-modal recommendation: Music-video embeddings can also be used for cross-modal recommendation or retrieval tasks.  ... 
doi:10.1109/ijcnn52387.2021.9533662 fatcat:m7blu353drephjgylms2oc6vci

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding [article]

Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, Luke Zettlemoyer
2021 arXiv   pre-print
Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning  ...  with two unimodal encoders, limiting early cross-modal fusion.  ...  It has open domain video clips, and each training clip has 20 captioning sentences labeled by humans. There are 200K clip-text pairs from 10K videos in 20 categories, including sports, music, etc.  ... 
arXiv:2105.09996v3 fatcat:3rlayivh3fawxjjpmy7qcxikxi

The bad and the good singer: query tuning analysis for audio to audio Query by Humming

Filippo Morelli, Emilia Gómez, Justin Salamon
2013 Zenodo  
A tuning quality evaluation algorithm meant to be used as a user advising subsystem for Query-by-Humming applications is designed and presented.  ...  A preexisting tuning quality expert annotation is expanded by collecting new evaluations in a specific experiment.  ...  The task of interpreting the user vocal query and retrieving a corresponding target song is called Query-by-Humming (QbH) or Query-by-Singing/Humming (QbSH) [3] .  ... 
doi:10.5281/zenodo.3754284 fatcat:xtrfaw4qn5chnlwerjlri2n7ua

It's Time for Artistic Correspondence in Music and Video [article]

Didac Suris, Carl Vondrick, Bryan Russell, Justin Salamon
2022 arXiv   pre-print
each modality.  ...  For instance, we can condition music retrieval based on visually defined attributes.  ...  Cross-modal self-supervision. The music and visual tracks in videos have a strong correspondence. The music that plays on top of the video is artistically related to the content of the video.  ... 
arXiv:2206.07148v1 fatcat:5kiflxr34vayxbcmcnjlg5dniu

Learning Video-Text Aligned Representations for Video Captioning

Yaya Shi, Haiyang Xu, Chunfeng Yuan, Bing Li, Weiming Hu, Zheng-Jun Zha
2022 ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)  
Then, we employ an alignment unit with the input of the video and retrieved sentences to conduct the video-text alignment.  ...  The representations of two modal inputs are aligned in a shared semantic space. The obtained video-text aligned representations are used to generate semantically correct captions.  ...  Moreover, our work employ the cross-modal retrieval model and focus on cross-modal task which is more complex and challenging.  ... 
doi:10.1145/3546828 fatcat:3whtdi2aajh25ossrv2kz4vdzm

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection [article]

Ye Liu, Siyuan Li, Yang Wu, Chang Wen Chen, Ying Shan, Xiaohu Qie
2022 arXiv   pre-print
Finding relevant moments and highlights in videos according to natural language queries is a natural and highly valuable common need in the current video content explosion era.  ...  As far as we are aware, this is the first scheme to integrate multi-modal (visual-audio) learning for either joint optimization or the individual moment retrieval task, and tackles moment retrieval as  ...  Acknowledgements This research is supported in part by Key-Area Research and Development Program of Guangdong Province, China with Grant 2019B010155002 and financial support from ARC Lab, Tencent PCG.  ... 
arXiv:2203.12745v2 fatcat:hyaxgsmncjdmrebqrzqcukpmym

Cross-modal Variational Auto-encoder for Content-based Micro-video Background Music Recommendation [article]

Jing Yi and Yaochen Zhu and Jiayi Xie and Zhenzhong Chen
2021 arXiv   pre-print
alignment of two corresponding embeddings of a matched video-music pair is achieved by cross-generation.  ...  In this paper, we propose a cross-modal variational auto-encoder (CMVAE) for content-based micro-video background music recommendation.  ...  Figure 5 : Qualitative assessment of CMVAE for micro-video background music recommendation by visualizing some examples of query videos in the test set and the retrieved music clips ranked by their matching  ... 
arXiv:2107.07268v1 fatcat:5w7v7ywuqjhu3cklth4hlujsx4

Suggesting Sounds for Images from Video Collections [chapter]

Matthias Solèr, Jean-Charles Bazin, Oliver Wang, Andreas Krause, Alexander Sorkine-Hornung
2016 Lecture Notes in Computer Science  
In this paper we aim to retrieve sounds corresponding to a query image.  ...  To solve this challenging task, our approach exploits the correlation between the audio and visual modalities in video collections.  ...  The parameters of the filters are set by cross-validation.  ... 
doi:10.1007/978-3-319-48881-3_59 fatcat:wwlv3vvldrgvjctvmw2hsq5bia
« Previous Showing results 1 — 15 out of 3,120 results