A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
[article]
2019
arXiv
pre-print
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. ...
We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question ...
The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ...
arXiv:1908.02265v1
fatcat:6qlqpknrcnf5lmhe27t7jht5ca
Are we pretraining it right? Digging deeper into visio-linguistic pretraining
[article]
2020
arXiv
pre-print
Numerous recent works have proposed pretraining generic visio-linguistic representations and then finetuning them for downstream vision and language tasks. ...
This suggests that despite the numerous recent efforts, vision & language pretraining does not quite work "out of the box" yet. ...
Acknowledgments We would like to thank Marcus Rohrbach for helpful discussions and feedback. Are we pretraining it right? Digging deeper into visio-linguistic pretraining ...
arXiv:2004.08744v1
fatcat:tatrebjtkjhdhdy3mi3ylylxaq
TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation
[article]
2021
arXiv
pre-print
The recent success of transformer models in language, such as BERT, has motivated the use of such architectures for multi-modal feature learning and tasks. ...
In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of ...
Acknowledgments: This work was funded in part by the Vector Institute for AI, Canada CIFAR AI Chair, NSERC Canada Research Chair (CRC) and an NSERC Discovery and Discovery Accelerator Supplement Grants ...
arXiv:2110.13412v1
fatcat:oejb6j7hebaiflohlib76r2pae
12-in-1: Multi-Task Vision and Language Representation Learning
[article]
2020
arXiv
pre-print
Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills ...
In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime. ...
The views and conclusions contained herein are those of the authors and should not be in-terpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ...
arXiv:1912.02315v2
fatcat:bjlhdvftabdfdpskqwzd5yzia4
Recent, rapid advancement in visual question answering architecture: a review
[article]
2022
arXiv
pre-print
Understanding visual question answering is going to be crucial for numerous human activities. However, it presents major challenges at the heart of the artificial intelligence endeavor. ...
The following are some of the most influential relevant articles of the year.
1) ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks: Lu et al. ...
[43] introduced VilBERT "for learning task-agnostic joint representations of image content and natural language." 2) LXMERT: Learning Cross-Modality Encoder Representations from Transformers: Tan and ...
arXiv:2203.01322v3
fatcat:pexwnnw5rvfohly647coyth2za
Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions
[article]
2020
arXiv
pre-print
We also propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT, the current state-of-the-art model for this task. ...
Visual referring expression recognition is a challenging task that requires natural language understanding in the context of an image. ...
Acknowledgements We would like to thank Volkan Cirik, Licheng Yu, Jiasen Lu for their help with GroundNet, MattNet and ViLBERT respectively, Keze Wang for his help with technical issues, and AWS AI data ...
arXiv:2005.01655v1
fatcat:4p4geps4kbcoxal2giroftmvza
A Multimodal Framework for the Detection of Hateful Memes
[article]
2020
arXiv
pre-print
The detection of multimodal hate speech is an intrinsically difficult and open problem: memes convey a message using both images and text and, hence, require multimodal reasoning and joint visual and language ...
In this work, we seek to advance this line of research and develop a multimodal framework for the detection of hateful memes. ...
Vilbert: Pretraining task-agnostic
visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information
Processing Systems, pages 13–23.
[24] B. Miller and D. ...
arXiv:2012.12871v2
fatcat:rttlifokijczthrcgduaerswey
UNITER: UNiversal Image-TExt Representation Learning
[article]
2020
arXiv
pre-print
Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. ...
We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). ...
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS (2019) 2, 3, 11, 22, 24, 25 30. ...
arXiv:1909.11740v3
fatcat:zdlyfiquxngzrnpvl4epubj3p4
Hateful Memes Challenge: An Enhanced Multimodal Framework
[article]
2021
arXiv
pre-print
In this paper, we enhance the hateful detection framework, including utilizing Detectron for feature extraction, exploring different setups of VisualBERT and UNITER models with different loss functions ...
, researching the association between the hateful memes and the sensitive text features, and finally building ensemble method to boost model performance. ...
Vilbert:
them as the input features can be a promising direction for Pretraining task-agnostic visiolinguistic representations for
future improvement of this task. ...
arXiv:2112.11244v1
fatcat:lga6pjxoe5dbzdv33mge5wlyea
MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets
[article]
2021
arXiv
pre-print
To solve these tasks, we propose MOMENTA (MultimOdal framework for detecting harmful MemEs aNd Their tArgets), a novel multimodal deep neural network that uses global and local perspectives to detect harmful ...
We focus on two tasks: (i)detecting harmful memes, and (ii)identifying the social entities they target. ...
ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. ...
arXiv:2109.05184v2
fatcat:ntmq4pv6kjdhvebjyohuikqppe
Learning to Scale Multilingual Representations for Vision-Language Tasks
[article]
2020
arXiv
pre-print
SMALR learns a fixed size language-agnostic representation for most words in a multilingual vocabulary, keeping language-specific features for just a few. ...
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date. ...
.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv:1908.02265 (2019) 30. Makiewicz, A., Ratajczak, W.: Principal components analysis (pca). ...
arXiv:2004.04312v2
fatcat:ua5grxstzrbepdz4uscznccrfq
Critical Analysis of Deconfounded Pretraining to Improve Visio-Linguistic Models
2022
Frontiers in Artificial Intelligence
Furthermore, we create a human-labeled ground truth causality dataset for objects in a scene to empirically verify whether and how well confounders are found. ...
Finally, we summarize the current limitations of AutoDeconfounding to solve the issue of spurious correlations and provide directions for the design of novel AutoDeconfounding methods that are aimed at ...
Deep Modular co-
visiolinguistic representations for vision-and-language tasks. arXiv:1908.02265 attention networks for visual question answering. arXiv:1906.10770 [cs] arXiv:
[cs] arXiv ...
doi:10.3389/frai.2022.736791
pmid:35402901
pmcid:PMC8993511
fatcat:stbufg6l55h2rhvq7qghfazdlq
Incorporating Concreteness in Multi-Modal Language Models with Curriculum Learning
2021
Applied Sciences
Over the last few years, there has been an increase in the studies that consider experiential (visual) information by building multi-modal language models and representations. ...
To show the performance of the proposed model, downstream tasks and ablation studies are performed. ...
Acknowledgments: The Titan V used for the experiments in this work is donated by the NVIDIA Corporation.
Conflicts of Interest: The authors declare no conflict of interest. ...
doi:10.3390/app11178241
doaj:2e3333c60acf45a0851c1b4a145a5350
fatcat:7ntyo3q5vnbbfddkpfzloojl6i
VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts
[article]
2021
arXiv
pre-print
Contrastive Vision-Language Pre-training (CLIP) has drown increasing attention recently for its transferable visual representation learning. ...
In this paper, we propose VT-CLIP to enhance vision-language modeling via visual-guided texts. ...
Vilbert:
Pretraining task-agnostic visiolinguistic representations for
vision-and-language tasks. 2019. 3
[29] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. ...
arXiv:2112.02399v1
fatcat:pk7gjz5ewnfdvayab7ljiwpfli
Learning to Learn Words from Visual Scenes
[article]
2020
arXiv
pre-print
We leverage the natural compositional structure of language to create training episodes that cause a meta-learner to learn strong policies for language acquisition. ...
Visualizations and analysis suggest visual information helps our approach learn a rich cross-modal representation from minimal examples. Project webpage is available at https://expert.cs.columbia.edu/ ...
Acknowledgements: We thank Alireza Zareian, Bobby Wu, Spencer Whitehead, Parita Pooj and Boyuan Chen for helpful discussion. Funding for this research was provided by DARPA GAILA HR00111990058. ...
arXiv:1911.11237v3
fatcat:hveh5cjwzjdgzmseg32uefslju
« Previous
Showing results 1 — 15 out of 31 results