31 Hits in 2.0 sec

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks [article]

Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee
2019 arXiv   pre-print
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.  ...  We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question  ...  The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the  ... 
arXiv:1908.02265v1 fatcat:6qlqpknrcnf5lmhe27t7jht5ca

Are we pretraining it right? Digging deeper into visio-linguistic pretraining [article]

Amanpreet Singh, Vedanuj Goswami, Devi Parikh
2020 arXiv   pre-print
Numerous recent works have proposed pretraining generic visio-linguistic representations and then finetuning them for downstream vision and language tasks.  ...  This suggests that despite the numerous recent efforts, vision & language pretraining does not quite work "out of the box" yet.  ...  Acknowledgments We would like to thank Marcus Rohrbach for helpful discussions and feedback. Are we pretraining it right? Digging deeper into visio-linguistic pretraining  ... 
arXiv:2004.08744v1 fatcat:tatrebjtkjhdhdy3mi3ylylxaq

TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation [article]

Tanzila Rahman, Mengyu Yang, Leonid Sigal
2021 arXiv   pre-print
The recent success of transformer models in language, such as BERT, has motivated the use of such architectures for multi-modal feature learning and tasks.  ...  In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of  ...  Acknowledgments: This work was funded in part by the Vector Institute for AI, Canada CIFAR AI Chair, NSERC Canada Research Chair (CRC) and an NSERC Discovery and Discovery Accelerator Supplement Grants  ... 
arXiv:2110.13412v1 fatcat:oejb6j7hebaiflohlib76r2pae

12-in-1: Multi-Task Vision and Language Representation Learning [article]

Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee
2020 arXiv   pre-print
Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills  ...  In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime.  ...  The views and conclusions contained herein are those of the authors and should not be in-terpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the  ... 
arXiv:1912.02315v2 fatcat:bjlhdvftabdfdpskqwzd5yzia4

Recent, rapid advancement in visual question answering architecture: a review [article]

Venkat Kodali, Daniel Berleant
2022 arXiv   pre-print
Understanding visual question answering is going to be crucial for numerous human activities. However, it presents major challenges at the heart of the artificial intelligence endeavor.  ...  The following are some of the most influential relevant articles of the year. 1) ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks: Lu et al.  ...  [43] introduced VilBERT "for learning task-agnostic joint representations of image content and natural language." 2) LXMERT: Learning Cross-Modality Encoder Representations from Transformers: Tan and  ... 
arXiv:2203.01322v3 fatcat:pexwnnw5rvfohly647coyth2za

Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions [article]

Arjun R Akula, Spandana Gella, Yaser Al-Onaizan, Song-Chun Zhu, Siva Reddy
2020 arXiv   pre-print
We also propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT, the current state-of-the-art model for this task.  ...  Visual referring expression recognition is a challenging task that requires natural language understanding in the context of an image.  ...  Acknowledgements We would like to thank Volkan Cirik, Licheng Yu, Jiasen Lu for their help with GroundNet, MattNet and ViLBERT respectively, Keze Wang for his help with technical issues, and AWS AI data  ... 
arXiv:2005.01655v1 fatcat:4p4geps4kbcoxal2giroftmvza

A Multimodal Framework for the Detection of Hateful Memes [article]

Phillip Lippe, Nithin Holla, Shantanu Chandra, Santhosh Rajamanickam, Georgios Antoniou, Ekaterina Shutova, Helen Yannakoudakis
2020 arXiv   pre-print
The detection of multimodal hate speech is an intrinsically difficult and open problem: memes convey a message using both images and text and, hence, require multimodal reasoning and joint visual and language  ...  In this work, we seek to advance this line of research and develop a multimodal framework for the detection of hateful memes.  ...  Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pages 13–23. [24] B. Miller and D.  ... 
arXiv:2012.12871v2 fatcat:rttlifokijczthrcgduaerswey

UNITER: UNiversal Image-TExt Representation Learning [article]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu
2020 arXiv   pre-print
Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding.  ...  We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA).  ...  Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS (2019) 2, 3, 11, 22, 24, 25 30.  ... 
arXiv:1909.11740v3 fatcat:zdlyfiquxngzrnpvl4epubj3p4

Hateful Memes Challenge: An Enhanced Multimodal Framework [article]

Aijing Gao, Bingjun Wang, Jiaqi Yin, Yating Tian
2021 arXiv   pre-print
In this paper, we enhance the hateful detection framework, including utilizing Detectron for feature extraction, exploring different setups of VisualBERT and UNITER models with different loss functions  ...  , researching the association between the hateful memes and the sensitive text features, and finally building ensemble method to boost model performance.  ...  Vilbert: them as the input features can be a promising direction for Pretraining task-agnostic visiolinguistic representations for future improvement of this task.  ... 
arXiv:2112.11244v1 fatcat:lga6pjxoe5dbzdv33mge5wlyea

MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets [article]

Shraman Pramanick, Shivam Sharma, Dimitar Dimitrov, Md Shad Akhtar, Preslav Nakov, Tanmoy Chakraborty
2021 arXiv   pre-print
To solve these tasks, we propose MOMENTA (MultimOdal framework for detecting harmful MemEs aNd Their tArgets), a novel multimodal deep neural network that uses global and local perspectives to detect harmful  ...  We focus on two tasks: (i)detecting harmful memes, and (ii)identifying the social entities they target.  ...  ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.  ... 
arXiv:2109.05184v2 fatcat:ntmq4pv6kjdhvebjyohuikqppe

Learning to Scale Multilingual Representations for Vision-Language Tasks [article]

Andrea Burns, Donghyun Kim, Derry Wijaya, Kate Saenko, Bryan A. Plummer
2020 arXiv   pre-print
SMALR learns a fixed size language-agnostic representation for most words in a multilingual vocabulary, keeping language-specific features for just a few.  ...  The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date.  ...  .: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv:1908.02265 (2019) 30. Makiewicz, A., Ratajczak, W.: Principal components analysis (pca).  ... 
arXiv:2004.04312v2 fatcat:ua5grxstzrbepdz4uscznccrfq

Critical Analysis of Deconfounded Pretraining to Improve Visio-Linguistic Models

Nathan Cornille, Katrien Laenen, Marie-Francine Moens
2022 Frontiers in Artificial Intelligence  
Furthermore, we create a human-labeled ground truth causality dataset for objects in a scene to empirically verify whether and how well confounders are found.  ...  Finally, we summarize the current limitations of AutoDeconfounding to solve the issue of spurious correlations and provide directions for the design of novel AutoDeconfounding methods that are aimed at  ...  Deep Modular co- visiolinguistic representations for vision-and-language tasks. arXiv:1908.02265 attention networks for visual question answering. arXiv:1906.10770 [cs] arXiv: [cs] arXiv  ... 
doi:10.3389/frai.2022.736791 pmid:35402901 pmcid:PMC8993511 fatcat:stbufg6l55h2rhvq7qghfazdlq

Incorporating Concreteness in Multi-Modal Language Models with Curriculum Learning

Erhan Sezerer, Selma Tekir
2021 Applied Sciences  
Over the last few years, there has been an increase in the studies that consider experiential (visual) information by building multi-modal language models and representations.  ...  To show the performance of the proposed model, downstream tasks and ablation studies are performed.  ...  Acknowledgments: The Titan V used for the experiments in this work is donated by the NVIDIA Corporation. Conflicts of Interest: The authors declare no conflict of interest.  ... 
doi:10.3390/app11178241 doaj:2e3333c60acf45a0851c1b4a145a5350 fatcat:7ntyo3q5vnbbfddkpfzloojl6i

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts [article]

Renrui Zhang, Longtian Qiu, Wei Zhang, Ziyao Zeng
2021 arXiv   pre-print
Contrastive Vision-Language Pre-training (CLIP) has drown increasing attention recently for its transferable visual representation learning.  ...  In this paper, we propose VT-CLIP to enhance vision-language modeling via visual-guided texts.  ...  Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. 2019. 3 [29] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi.  ... 
arXiv:2112.02399v1 fatcat:pk7gjz5ewnfdvayab7ljiwpfli

Learning to Learn Words from Visual Scenes [article]

Dídac Surís, Dave Epstein, Heng Ji, Shih-Fu Chang, Carl Vondrick
2020 arXiv   pre-print
We leverage the natural compositional structure of language to create training episodes that cause a meta-learner to learn strong policies for language acquisition.  ...  Visualizations and analysis suggest visual information helps our approach learn a rich cross-modal representation from minimal examples. Project webpage is available at  ...  Acknowledgements: We thank Alireza Zareian, Bobby Wu, Spencer Whitehead, Parita Pooj and Boyuan Chen for helpful discussion. Funding for this research was provided by DARPA GAILA HR00111990058.  ... 
arXiv:1911.11237v3 fatcat:hveh5cjwzjdgzmseg32uefslju
« Previous Showing results 1 — 15 out of 31 results