853 Hits in 12.3 sec

Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding [article]

Dexin Wang, Deyi Xiong
In this paper, we propose an object-level visual context modeling framework (OVC) to efficiently capture and explore visual information for multimodal machine translation.  ...  Visual context provides grounding information for multimodal machine translation (MMT).  ...  We would like to thank the anonymous reviewers for their insightful comments. The corresponding author is Deyi Xiong (  ... 
doi:10.48550/arxiv.2101.05208 fatcat:g4xsjrqr55b6jc7rj5ol4py37m

Image-to-Image Translation: Methods and Applications [article]

Yingxue Pang, Jianxin Lin, Tao Qin, Zhibo Chen
2021 arXiv   pre-print
Image-to-image translation (I2I) aims to transfer images from a source domain to a target domain while preserving the content representations.  ...  2018 unpaired No instance-level UI2I; segmentation mask; cyclic loss; INIT[103] 2019 unpaired Yes instance-level UI2I; object+global; cyclic loss; DUNIT 2020 unpaired Yes instance-level  ...  [143] argue that I2I can barely perform shape changes, remove objects or ignore irrelevant texture because of the strict pixel-level constraint of cycle-consistent loss.  ... 
arXiv:2101.08629v2 fatcat:i6pywjwnvnhp3i7cmgza2slnle

Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking

Rajarshi Biswas, Michael Barz, Daniel Sonntag
2020 Künstliche Intelligenz  
We compute visual attention on the joint embedding space formed by the union of high-level features and the lowlevel features obtained from the object specific salient regions of the input image.  ...  Yet, captions generated by humans are still considered better, which makes it an interesting application for interactive machine learning and explainable artificial intelligence methods.  ...  This enables users to easily change the focus for the generation process, e.g., if the model wrongly puts emphasis on an irrelevant object.  ... 
doi:10.1007/s13218-020-00679-2 fatcat:rt57cxqtmrhajnfauhzxlx4etm

Multimodal Research in Vision and Language: A Review of Current and Emerging Trends [article]

Shagun Uppal, Sarthak Bhagat, Devamanyu Hazarika, Navonil Majumdar, Soujanya Poria, Roger Zimmermann, Amir Zadeh
2020 arXiv   pre-print
In this paper, we present a detailed overview of the latest trends in research pertaining to visual and language modalities.  ...  Multimodal Machine Translation (MMT): Multimodal Machine Translation is a two-fold task of translation and description generation.  ...  Multimodal Machine Translation (MMT) MMT is a task wherein visual data acts as a supplement for fostering the primary task of translating descriptions from one language to another.  ... 
arXiv:2010.09522v2 fatcat:l4npstkoqndhzn6hznr7eeys4u

Attention, please! A survey of Neural Attention Models in Deep Learning [article]

Alana de Santana Correia, Esther Luna Colombini
2021 arXiv   pre-print
Finally, we list possible trends and opportunities for further research, hoping that this review will provide a succinct overview of the main attentional models in the area and guide researchers in developing  ...  For the last six years, this property has been widely explored in deep neural networks.  ...  Doubly Attentive Transformer [107] proposes a multimodal machine-translation method, incorporating visual information.  ... 
arXiv:2103.16775v1 fatcat:lwkw42lrircorkymykpgdmlbwq

A Review on Explainability in Multimodal Deep Neural Nets

Gargi Joshi, Rahee Walambe, Ketan Kotecha
2021 IEEE Access  
Visual common sense reasoning uses other object detection techniques for better text to image grounding and assigns attributes to object grounding with fewer parameters.  ...  , and that has a little marginal cost per run., METEOR [226] is an automatic machine translation metric used for unigram matching between the machine-produced translation and human-produced reference  ... 
doi:10.1109/access.2021.3070212 fatcat:5wtxr4nf7rbshk5zx7lzbtcram

Pre-Trained Models: Past, Present and Future [article]

Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan Yao, Ao Zhang, Liang Zhang, Wentao Han, Minlie Huang (+12 others)
2021 arXiv   pre-print
It is now the consensus of the AI community to adopt PTMs as backbone for downstream tasks rather than learning models from scratch.  ...  Owing to sophisticated pre-training objectives and huge model parameters, large-scale PTMs can effectively capture knowledge from massive labeled and unlabeled data.  ...  CoVE adopts machine translation as its pre-training objective.  ... 
arXiv:2106.07139v3 fatcat:kn6gk2bg4jecndvlhhvq32x724

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Khaled Bayoudh, Raja Knani, Fayçal Hamdaoui, Abdellatif Mtibaa
2021 The Visual Computer  
In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the  ...  This involves the development of models capable of processing and analyzing the multimodal information uniformly.  ...  For instance, the Mask R-CNN model offers the possibility of locating instances of objects with class labels and segmenting them with semantic masks.  ... 
doi:10.1007/s00371-021-02166-7 pmid:34131356 pmcid:PMC8192112 fatcat:jojwyc6slnevzk7eaiutlmlgfe

Refract Journal, Volume 2: "Translation"

Refract Journal Editorial Board
2019 Refract An Open Access Visual Studies Journal  
To help illustrate the way that context surrounding an event is masked, I want to think about violence in terms of two interactive forms: blatant violence is when the context surrounding violence is clear  ...  But beyond this, they also produce a variety of new visual culture exemplars for others to model themselves after, creating urgent and necessary interventions into visual culture that otherwise demeans  ...  Is this a sign of the irrelevance of the designator that my graduate degree assigns me?  ... 
doi:10.5070/r72145897 fatcat:pswqtamgoretticqtesag426em

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models [article]

Feng Li, Hao Zhang, Yi-Fan Zhang, Shilong Liu, Jian Guo, Lionel M. Ni, PengChuan Zhang, Lei Zhang
2022 arXiv   pre-print
We believe that this review will be of help for researchers and practitioners of AI and ML, especially those interested in computer vision and natural language processing.  ...  Then we focus on VLP methods and comprehensively review key components of the model structures and training methods.  ...  has been explored on a broad range of language tasks, including machine translation (Ive et al., 2019; , semantic parsing (Shi et al., 2019a; Kojima et al., 2020) , and language grounding (Bordes et  ... 
arXiv:2203.01922v1 fatcat:vnjfetgkpzedpfhklufooqet7y

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Aditya Mogadala, Marimuthu Kalimuthu, Dietrich Klakow
2021 The Journal of Artificial Intelligence Research  
Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video.  ...  Much of the growth in these fields has been made possible with deep learning, a sub-area of machine learning that uses artificial neural networks.  ...  We extend our special thanks to Matthew Kuhn and Stephanie Lund for painstakingly proofing the whole manuscript.  ... 
doi:10.1613/jair.1.11688 fatcat:kvfdrg3bwrh35fns4z67adqp6i

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods [article]

Aditya Mogadala and Marimuthu Kalimuthu and Dietrich Klakow
2020 arXiv   pre-print
Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video.  ...  This success can be partly attributed to the advancements made in the sub-fields of AI such as Machine Learning (ML), Computer Vision (CV), and Natural Language Processing (NLP).  ...  We extend our special thanks to Matthew Kuhn and Stephanie Lund for painstakingly proofing the whole manuscript.  ... 
arXiv:1907.09358v2 fatcat:4fyf6kscy5dfbewll3zs7yzsuq

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound [article]

Rowan Zellers and Jiasen Lu and Ximing Lu and Youngjae Yu and Yanpeng Zhao and Mohammadreza Salehi and Aditya Kusupati and Jack Hessel and Ali Farhadi and Yejin Choi
2022 arXiv   pre-print
Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.  ...  Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding.  ...  Last, but not least, thanks to the YouTubers whose work and creativity helps machines to learn about the multimodal world.  ... 
arXiv:2201.02639v4 fatcat:deywuxyj45eqvacjwwns7kmbh4

Core Challenges in Embodied Vision-Language Planning [article]

Jonathan Francis, Nariaki Kitamura, Felix Labelle, Xiaopeng Lu, Ingrid Navarro, Jean Oh
2022 arXiv   pre-print
Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing  ...  Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment  ...  Acknowledgements The authors thank Alessandro Oltramari, Yonatan Bisk, Eric Nyberg, and Louis-Philippe Morency for insightful discussions; we thank Mayank Mali for support throughout the editing process  ... 
arXiv:2106.13948v3 fatcat:tk32nr4jtjekboh33zutellvnm

A Roadmap for Big Model [article]

Sha Yuan, Hanyu Zhao, Shuai Zhao, Jiahong Leng, Yangxiao Liang, Xiaozhi Wang, Jifan Yu, Xin Lv, Zhou Shao, Jiaao He, Yankai Lin, Xu Han (+88 others)
2022 arXiv   pre-print
, Commonsense Reasoning, Reliability&Security, Governance, Evaluation, Machine Translation, Text Generation, Dialogue and Protein Research.  ...  With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm.  ...  the object semantic class for each masked visual part.  ... 
arXiv:2203.14101v4 fatcat:rdikzudoezak5b36cf6hhne5u4
« Previous Showing results 1 — 15 out of 853 results