Filters








1,239 Hits in 4.7 sec

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech [article]

David Harwath, Galen Chuang, James Glass
2018 arXiv   pre-print
Using spoken captions collected in English and Hindi, we show that the same model architecture can be successfully applied to both languages.  ...  Further, we demonstrate that training a multilingual model simultaneously on both languages offers improved performance over the monolingual models.  ...  We demonstrated that multilingual variants of these models can outperform their monolingual counterparts for speech/image association, and also provided evidence that a shared visual context can dramatically  ... 
arXiv:1804.03052v1 fatcat:3hulp36a7nd5rlbf7l4zskmn7y

Multimodal Attention for Neural Machine Translation [article]

Ozan Caglayan, Loïc Barrault, Fethi Bougares
2016 arXiv   pre-print
We train several variants of our proposed attention mechanism on the Multi30k multilingual image captioning dataset.  ...  Recently, the effectiveness of attention has also been explored in the context of image captioning.  ...  from a source language model using the IAPR-TC12 multilingual image captioning dataset.  ... 
arXiv:1609.03976v1 fatcat:2j35yvu6yncqlmhpitz2lpl5sm

Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval [article]

Gregor Geigle, Jonas Pfeiffer, Nils Reimers, Ivan Vulić, Iryna Gurevych
2021 arXiv   pre-print
To address these crucial gaps towards both improved and efficient cross-modal retrieval, we propose a novel fine-tuning framework which turns any pretrained text-image multi-modal model into an efficient  ...  and objects in an image.  ...  For multilingual experiments, we use the standard Multi30k dataset [14, 13, 5] , which extends Flickr30k with 5 German captions and one French and Czech caption per image.  ... 
arXiv:2103.11920v1 fatcat:fzlzr23ki5dena4skps7tf3vnq

Language Learning Using Speech to Image Retrieval

Danny Merkx, Stefan L. Frank, Mirjam Ernestus
2019 Interspeech 2019  
Using a combination of a multi-layer GRU, importance sampling, cyclic learning rates, ensembling and vectorial self-attention our results show a remarkable increase in image-caption retrieval performance  ...  Furthermore, we investigate which layers in the model learn to recognise words in the input.  ...  In the current study we present an image-caption retrieval model that extends our previous work to spoken input. In [12, 13] , the authors adapted text based caption-image retrieval (e.g.  ... 
doi:10.21437/interspeech.2019-3067 dblp:conf/interspeech/MerkxFE19 fatcat:bgyjqvufkbf2hfs4wof2lvxktu

Evaluation of multilingual websites using localization matrix

Sarika Katkade, Jayashree Katti, Chandrakant Dhutadmal
2017 Proceedings of the Second International Conference on Research in Intelligent and Computing in Engineering  
Organizations always want to keep a global image for that it required to use the website localization for local users.  ...  People can communicate via multilingual websites which need to browse independent screens for each one. It brings one of the most crucial areas exposed up by the period of electronic procedure.  ...  Dynamic generated Localization An international web application consider web text structure and image captions provide randomly just like it treats actual web data elements as financial values.  ... 
doi:10.15439/2017r122 dblp:conf/rice/KatkadeKD17 fatcat:k3uji2uzwvdoxmdbltwup7qvwa

Multimodal Machine Translation through Visuals and Speech

Umut Sulubacak, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, Jörg Tiedemann
2019 Zenodo  
These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language.  ...  Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data.  ...  The benchmark is structurally similar to the multilingual image caption datasets commonly used by contemporary image-guided translation systems.  ... 
doi:10.5281/zenodo.3690791 fatcat:otdy5i33fzfsnnbb3xgb6zph6q

Multimodal machine translation through visuals and speech

Umut Sulubacak, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, Jörg Tiedemann
2020 Machine Translation  
These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language.  ...  Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data.  ...  The benchmark is structurally similar to the multilingual image caption datasets commonly used by contemporary image-guided translation systems.  ... 
doi:10.1007/s10590-020-09250-0 fatcat:jod3ghcsnnbipotcqp6sme4lna

"Wikily" Supervised Neural Translation Tailored to Cross-Lingual Tasks [article]

Mohammad Sadegh Rasooli, Chris Callison-Burch, Derry Tanti Wijaya
2021 arXiv   pre-print
Our captioning results on Arabic are slightly better than that of its supervised model.  ...  Moreover, we tailor our wikily supervised translation models to unsupervised image captioning, and cross-lingual dependency parser transfer.  ...  To facilitate multi-task learning with image captioning, our model has an image encoder that is used in cases of image captioning (more details in §4.1).  ... 
arXiv:2104.08384v2 fatcat:vswaqg27mve4fpepwuxqzougru

Multimodal Machine Translation with Reinforcement Learning [article]

Xin Qian, Ziyi Zhong, Jieli Zhou
2018 arXiv   pre-print
We experiment our proposed algorithm on the Multi30K multilingual English-German image description dataset and the Flickr30K image entity dataset.  ...  Our model takes two channels of inputs, image and text, uses translation evaluation metrics as training rewards, and achieves better results than supervised learning MLE baseline models.  ...  In other words, the task can be decomposed into a translation task and a image captioning task.  ... 
arXiv:1805.02356v1 fatcat:ythbzdbamvashmmg7k4a55sl2m

Textual Supervision for Visually Grounded Spoken Language Understanding [article]

Bertrand Higy, Desmond Elliott, Grzegorz Chrupała
2020 arXiv   pre-print
Recent work showed that these models can be improved if transcriptions are available at training time.  ...  With low-resource languages in mind, we also show that translations can be effectively used in place of transcriptions but more data is needed to obtain similar results.  ...  We can see in Figure 3 (left) that, as the amount of transcribed data decreases, the score of the textimage and pipeline models progressively goes toward 0% of R@10.  ... 
arXiv:2010.02806v2 fatcat:npcuflmooveozey4lktptsk2ua

Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques [article]

Grzegorz Chrupała
2021 arXiv   pre-print
Such models are inspired by the observation that when children pick up a language, they rely on a wide range of indirect and noisy clues, crucially including signals from the visual modality co-occurring  ...  This survey provides an overview of the evolution of visually grounded models of spoken language over the last 20 years.  ...  They find that a multilingual model trained on both languages outperforms monolingual models, and also show the feasibility of semantic cross-lingual speech-to-speech retrieval using a multilingual model  ... 
arXiv:2104.13225v3 fatcat:edodewkhljbqtpcrm2knd2zw7i

A Rapid Review of Image Captioning

Adriyendi Adriyendi
2021 Journal of Information Technology and Computer Science  
Furthermore, we also build application with 3 application models. We also provide research opinions on trends and future research that can be developed with image caption generation.  ...  Image captioning can be further developed on computer vision versus human vision.  ...  The novel object caption can produce object descriptions in text and images that are not in the dataset. Steps: (1) Separate lexical classifications and language models.  ... 
doi:10.25126/jitecs.202162316 fatcat:jebopkpe65gr3puusjzr4yzegy

From Show to Tell: A Survey on Deep Learning-based Image Captioning [article]

Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, Rita Cucchiara
2021 arXiv   pre-print
For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences.  ...  However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet.  ...  We also want to thank the authors who provided us with the captions and model weights for some of the surveyed approaches.  ... 
arXiv:2107.06912v3 fatcat:ezhutcovnvh4reiweedfmxjlve

CLIPScore: A Reference-free Evaluation Metric for Image Captioning [article]

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, Yejin Choi
2021 arXiv   pre-print
In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation  ...  of image captioning without the need for references.  ...  But, references can be expensive to collect and comparing reference captions -Two dogs are running towards each other across the sand. -Two dogs are running towards each other on a beach.  ... 
arXiv:2104.08718v2 fatcat:m5e3ze5jvfdodaejsoapztgmsa

Translation in the language classroom: Multilingualism, diversity, collaboration

Gioia Panzarella, Caterina Sinibaldi
2018 EuroAmerican Journal of Applied Linguistics and Languages  
EN The aim of this article is to discuss the ways in which translation can be used to foster multilingual competence and intercultural awareness in the foreign language classroom.  ...  The translation activities that have been selected place emphasis on collaboration and are designed to challenge cultural stereotypes, as well as monolingual and monocultural assumptions.  ...  to the broader linguistic competences that students can acquire through multilingual practices.  ... 
doi:10.21283/2376905x.9.140 fatcat:hnrh5dojbnc23hvjracyzglrdi
« Previous Showing results 1 — 15 out of 1,239 results