Filters








40,604 Hits in 3.1 sec

Towards Developing a Multilingual and Code-Mixed Visual Question Answering System by Knowledge Distillation

Humair Raj Khan, Deepak Gupta, Asif Ekbal
2021 Findings of the Association for Computational Linguistics: EMNLP 2021   unpublished
Learning to specialize with Distilling the knowledge in a neural network. arXiv knowledge distillation for visual question answering.  ...  visual question answering.  ... 
doi:10.18653/v1/2021.findings-emnlp.151 fatcat:k7r4as6crbcdreetop3h5gbg6u

Dealing with Missing Modalities in the Visual Question Answer-Difference Prediction Task through Knowledge Distillation [article]

Jae Won Cho, Dong-Jin Kim, Jinsoo Choi, Yunjae Jung, In So Kweon
2021 arXiv   pre-print
In this work, we address the issues of missing modalities that have arisen from the Visual Question Answer-Difference prediction task and find a novel method to solve the task at hand.  ...  distill knowledge to a target network (student) that only takes the image/question pair as its inputs.  ...  Problem Definition The Visual Question Answering (VQA) [4] tasks is required to generate a correct answerâ for a given visual question (x, q).  ... 
arXiv:2104.05965v1 fatcat:yrhar6ewvvh2xdtu77wsds4j3m

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks [article]

Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Jianwei Yang, Xiyang Dai, Bin Xiao, Haoxuan You, Shih-Fu Chang, Lu Yuan
2022 arXiv   pre-print
First, we introduce an evaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual Entailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of data availability constraints  ...  In this work, we seek to answer these questions through two key contributions.  ...  Acknowledgement Thanks to Liunian Harold Li for his help in the implementation of CLIP-ViL and feedbacks of the idea.  ... 
arXiv:2201.05729v2 fatcat:5j2i65vw5nepfmhgyennrrvzhq

Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks [article]

Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Xiyang Dai, Bin Xiao, Jianwei Yang, Haoxuan You, Kai-Wei Chang, Shih-fu Chang, Lu Yuan
2022 arXiv   pre-print
Second, to better capture nuanced impacts on VL task performance, we introduce an evaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual Entailment (SNLI-VE), and Visual Question  ...  Answering (VQA), across a variety of data constraints and conditions of domain shift.  ...  text-only question answering and low-level visual question answering tasks (i.e.  ... 
arXiv:2204.10496v2 fatcat:t2lgj4cpxfg6nfdbnnayoud2mq

Joint Answering and Explanation for Visual Commonsense Reasoning [article]

Zhenyang Li, Yangyang Guo, Kejie Wang, Yinwei Wei, Liqiang Nie, Mohan Kankanhalli
2022 arXiv   pre-print
It is composed of two indispensable processes: question answering over a given image and rationale inference for answer explanation.  ...  Visual Commonsense Reasoning (VCR), deemed as one challenging extension of the Visual Question Answering (VQA), endeavors to pursue a more high-level visual comprehension.  ...  Abstract-Visual Commonsense Reasoning (VCR), deemed as one challenging extension of the Visual Question Answering (VQA), endeavors to pursue a more high-level visual comprehension.  ... 
arXiv:2202.12626v1 fatcat:zglxtlf4kndlxl63lijvmve7oy

Towards Developing a Multilingual and Code-Mixed Visual Question Answering System by Knowledge Distillation [article]

Humair Raj Khan, Deepak Gupta, Asif Ekbal
2021 arXiv   pre-print
Pre-trained language-vision models have shown remarkable performance on the visual question answering (VQA) task.  ...  Unlike the existing knowledge distillation methods, which only use the output from the last layer of the teacher network for distillation, our student model learns and imitates the teacher from multiple  ...  Acknowledgement Asif Ekbal acknowledges the Young Faculty Research Fellowship (YFRF), supported by Visvesvaraya PhD scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY  ... 
arXiv:2109.04653v1 fatcat:6oac4nogpbbctpoavh64j6iqhy

Revisiting EmbodiedQA: A Simple Baseline and Beyond

Yu Wu, Lu Jiang, Yi Yang
2020 IEEE Transactions on Image Processing  
In Embodied Question Answering (EmbodiedQA), an agent interacts with an environment to gather necessary information for answering user questions.  ...  a chance to adapt the trained model to a new environment before it actually answers users questions.  ...  It origins from two widely studied tasks, the visual navigation task and the visual question answering task.  ... 
doi:10.1109/tip.2020.2967584 pmid:31995489 fatcat:5swi3w4nzrfwtoxhgw2jebdkpe

Compact Trilinear Interaction for Visual Question Answering [article]

Tuong Do, Thanh-Toan Do, Huy Tran, Erman Tjiputra, Quang D. Tran
2019 arXiv   pre-print
In Visual Question Answering (VQA), answers have a great correlation with question meaning and visual contents.  ...  Moreover, knowledge distillation is first time applied in Free-form Opened-ended VQA.  ...  Introduction The aim of VQA is to find out a correct answer for a given question which is consistent with visual content of a given image [25, 3, 10] .  ... 
arXiv:1909.11874v1 fatcat:4frajelllbfizny5s5nb7syemm

Single-Modal Entropy based Active Learning for Visual Question Answering [article]

Dong-Jin Kim, Jae Won Cho, Jinsoo Choi, Yunjae Jung, In So Kweon
2021 arXiv   pre-print
Constructing a large-scale labeled dataset in the real world, especially for high-level tasks (eg, Visual Question Answering), can be expensive and time-consuming.  ...  In this work, we address Active Learning in the multi-modal setting of Visual Question Answering (VQA).  ...  This work was supported by the Institute for Information & Communications Technology Promotion (2017-0-01772) grant funded by the Korea government.  ... 
arXiv:2110.10906v2 fatcat:qpxptbj2pfb2jp7lphzjqzjq2q

NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks [article]

Fawaz Sammani, Tanmoy Mukherjee, Nikos Deligiannis
2022 arXiv   pre-print
We first conduct pre-training on large scale data of image-caption pairs for general understanding of images, and then formulate the answer as a text prediction task along with the explanation.  ...  We introduce NLX-GPT, a general, compact and faithful language model that can simultaneously predict an answer and explain it.  ...  Particularly, we choose NLE for visual-question answering, visual entailment and visual commonsense reasoning (VQA-X [40] , e-SNLI-VE [22] and VCR [64] ) as vision-language tasks, and NLE for activity  ... 
arXiv:2203.05081v1 fatcat:hh7zkumhlng35lgecvbgk6jbku

Optimized Transformer Models for FAQ Answering [chapter]

Sonam Damani, Kedhar Nath Narahari, Ankush Chatterjee, Manish Gupta, Puneet Agrawal
2020 Lecture Notes in Computer Science  
Given a set of FAQ pages s for an enterprise, and a user query, we need to find the best matching question-answer pairs from s.  ...  state-of-the-art for FAQ answering.  ...  This visualization helps us understand what pairs of words in the (query, question, answer) have high self-attention weights.  ... 
doi:10.1007/978-3-030-47426-3_19 fatcat:uotywipvwffsne3jx74depc3si

Spatial Knowledge Distillation to aid Visual Reasoning [article]

Somak Aditya, Rudra Saha, Yezhou Yang, Chitta Baral
2018 arXiv   pre-print
neural networks for the task of Visual Question Answering.  ...  A representative task is Visual Question Answering where large diagnostic datasets have been proposed to test a system's capability of answering questions about images.  ...  We also acknowledge NVIDIA for the donation of GPUs.  ... 
arXiv:1812.03631v2 fatcat:6x5nsnj725hdpiajcjcqr3asra

Introspective Distillation for Robust Question Answering [article]

Yulei Niu, Hanwang Zhang
2021 arXiv   pre-print
Question answering (QA) models are well-known to exploit data bias, e.g., the language prior in visual QA and the position bias in reading comprehension.  ...  In this paper, we present a novel debiasing method called Introspective Distillation (IntroD) to make the best of both worlds for QA.  ...  Acknowledgement We thank anonymous ACs and reviewers for their valuable discussion and insightful suggestions. This work was supported in part by NTU-Alibaba JRI and MOE AcRF Tier 2 grant.  ... 
arXiv:2111.01026v1 fatcat:ozmz7wvzmjdh3nhqz4g2crs5pa

Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation [article]

Wenliang Dai, Lu Hou, Lifeng Shang, Xin Jiang, Qun Liu, Pascale Fung
2022 arXiv   pre-print
Experimental results show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.  ...  To tackle this problem, we propose to augment the dual-stream VLP model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD), enabling the capability for multimodal  ...  multimodal tasks such as image captioning and openended visual question answering (VQA).  ... 
arXiv:2203.06386v2 fatcat:oi6r6xjmofeold7dnku2steab4

Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks [article]

Guohao Li, Hang Su, Wenwu Zhu
2017 arXiv   pre-print
Extensive experiments demonstrate that our model not only achieves the state-of-the-art performance in the visual question answering task, but can also answer open-domain questions effectively by leveraging  ...  Visual Question Answering (VQA) has attracted much attention since it offers insight into the relationships between the multi-modal analysis of images and natural language.  ...  model for answering opendomain visual questions.  ... 
arXiv:1712.00733v1 fatcat:tqfh5otaqrgl7jq6l6bjmenvm4
« Previous Showing results 1 — 15 out of 40,604 results