Cross-modal Embeddings for Video and Audio Retrieval [chapter]

Didac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto
2019 Lecture Notes in Computer Science  
In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projecting the audio and visual features into a common feature space, to obtain joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning.
doi:10.1007/978-3-030-11018-5_62 fatcat:4noobvtqhrdlljrul6cd3m2o3y