A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and Videos
[article]
2020
arXiv
pre-print
In this work, we propose an effective approach for training unique embedding representations by combining three simultaneous modalities: image and spoken and textual narratives. The proposed methodology departs from a baseline system that spawns a embedding space trained with only spoken narratives and image cues. Our experiments on the EPIC-Kitchen and Places Audio Caption datasets show that introducing the human-generated textual transcriptions of the spoken narratives helps to the training
arXiv:2006.00785v1
fatcat:yee3epxdozhgdalg3tg3d7zxwq