A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is
One Perceptron to Rule Them All: Language, Vision, Audio and Speech
Proceedings of the 2020 International Conference on Multimedia Retrieval
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep neural representation. This tutorial will firstly review the basic neural architectures to encode and decode vision, text and audio, to later review thedoi:10.1145/3372278.3390740 dblp:conf/mir/Giro-i-Nieto20 fatcat:fal5cxrkojgybasiy35xlt6cfy