Representations of language in a model of visually grounded speech signal

Grzegorz Chrupała, Lieke Gelderloos, Afra Alishahi
2017 Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)  
We present a visually grounded model of speech perception which projects spoken utterances and images to a joint semantic space. We use a multi-layer recurrent highway network to model the temporal nature of spoken speech, and show that it learns to extract both form and meaning-based linguistic knowledge from the input signal. We carry out an in-depth analysis of the representations used by different components of the trained model and show that encoding of semantic aspects tends to become
more » ... er as we go up the hierarchy of layers, whereas encoding of form-related aspects of the language input tends to initially increase and then plateau or decrease.
doi:10.18653/v1/p17-1057 dblp:conf/acl/ChrupalaGA17 fatcat:tlwabgi43bci3dxwitqqahskyi