A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Unsupervised Cross-Modal Audio Representation Learning from Unstructured Multilingual Text
[article]
2020
arXiv
pre-print
We present an approach to unsupervised audio representation learning. Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness. By applying Latent Semantic Indexing (LSI) we embed corresponding textual information into a latent vector space from which we derive track relatedness for online triplet selection. This LSI topic modelling facilitates fine-grained selection of similar and dissimilar audio-track pairs
arXiv:2003.12265v1
fatcat:s7hw33hhk5ho3jsvpgqhi4hfvm