Filters








9 Hits in 3.2 sec

VQA-LOL: Visual Question Answering under the Lens of Logic [article]

Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou Yang
2020 arXiv   pre-print
When put under this Lens of Logic, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions.  ...  In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions.  ...  When put under this Lens of Logic, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions.  ... 
arXiv:2002.08325v2 fatcat:dft3d4x7cjccdk4luspbcuf7ga

A Closer Look at the Robustness of Vision-and-Language Pre-trained Models [article]

Linjie Li, Zhe Gan, Jingjing Liu
2021 arXiv   pre-print
Visual Content Manipulation; and (iv) Answer Distribution Shift.  ...  To investigate, we conduct a host of thorough evaluations on existing pre-trained models over 4 different types of V+L specific model robustness: (i) Linguistic Variation; (ii) Logical Reasoning; (iii)  ...  It consists of two datasets: VQA-LOL Compose (logical combinations of multiple closed binary questions about the same image in VQA v2) and VQA-LOL Supplement (logical combinations of additional questions  ... 
arXiv:2012.08673v2 fatcat:orl3dt3r3fg3xjac2rt4xwqxxu

WeaQA: Weak Supervision via Captions for Visual Question Answering [article]

Pratyay Banerjee, Tejas Gokhale, Yezhou Yang, Chitta Baral
2021 arXiv   pre-print
Methodologies for training visual question answering (VQA) models assume the availability of datasets with human-annotated Image-Question-Answer (I-Q-A) triplets.  ...  Our experiments on three VQA benchmarks demonstrate the efficacy of this weakly-supervised approach, especially on the VQA-CP challenge, which tests performance under changing linguistic priors.  ...  Acknowledgements The authors acknowledge support from the DARPA SAIL-ON program W911NF2020006, ONR award N00014-20-1-2332, and NSF grant 1816039, and the anonymous reviewers for their insightful discussion  ... 
arXiv:2012.02356v2 fatcat:yoqklfrx2vhctm7u24elycwwsi

MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering [article]

Tejas Gokhale and Pratyay Banerjee and Chitta Baral and Yezhou Yang
2020 arXiv   pre-print
While progress has been made on the visual question answering leaderboards, models often utilize spurious correlations and priors in datasets under the i.i.d. setting.  ...  Under this paradigm, models utilize a consistency-constrained training objective to understand the effect of semantic changes in input (question-image pair) on the output (answer).  ...  Acknowledgements The authors acknowledge support from the NSF Robust Intelligence Program project #1816039, the DARPA KAIROS program (LESTAT project), the DARPA SAIL-ON program, and ONR award N00014-20  ... 
arXiv:2009.08566v2 fatcat:hpbd4nm5pzh3zc6gmxudcnylaa

Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering [article]

Man Luo, Yankai Zeng, Pratyay Banerjee, Chitta Baral
2021 arXiv   pre-print
Knowledge-based visual question answering (VQA) requires answering questions with external knowledge in addition to the content of images.  ...  The visual retriever aims to retrieve relevant knowledge, and the visual reader seeks to predict answers based on given knowledge.  ...  Acknowledgements The authors acknowledge support from the NSF grant 1816039, DARPA grant W911NF2020006, DARPA grant FA875019C0003, and ONR award N00014-20-1-2332; and thank the reviewers for their feedback  ... 
arXiv:2109.04014v1 fatcat:rnm2ghrosbd4xkctt4jnozfndu

Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering [article]

JianJian Cao and Xiameng Qin and Sanyuan Zhao and Jianbing Shen
2021 arXiv   pre-print
Answering semantically-complicated questions according to an image is challenging in Visual Question Answering (VQA) task.  ...  Firstly, it not only builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.  ...  Yang, “Vqa-lol: Visual 2019. question answering under the lens of logic,” in European Conference [51] X. Chen, H. Fang, T. Y. Lin, R. Vedantam, S. Gupta, P.  ... 
arXiv:2112.07270v1 fatcat:oco2bjv4rrfpjfylwcmxa2pfky

Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning [article]

Zhiyuan Fang, Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou Yang
2020 arXiv   pre-print
Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions.  ...  Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects.  ...  ZF, TG, YY thank the organizers and the participants of the Telluride Neuromorphic Cognition Workshop, especially the Machine Common Sense (MCS) group.  ... 
arXiv:2003.05162v3 fatcat:xgri7zaajjejhmujw5crlxmnti

Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering

Jihyung Kil, Cheng Zhang, Dong Xuan, Wei-Lun Chao
2021 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing   unpublished
Vqa-lol: Visual question an- garet Mitchell, Dhruv Batra, C Lawrence Zitnick, swering under the lens of logic. In Proceedings of and Devi Parikh. 2015.  ...  Self-supervised vqa: Answering v in vqa matter: Elevating the role of image under- visual questions using images and captions. arXiv standing in visual question answering.  ... 
doi:10.18653/v1/2021.emnlp-main.512 fatcat:ip333delvzhgbgicuibho7wiju

Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning

Zhiyuan Fang, Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou Yang
2020 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)   unpublished
Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions.  ...  Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698. . 2017. Video question answering via gradually refined attention over appearance and motion.  ...  ZF, TG, YY thank the organizers and the participants of the Telluride Neuromorphic Cognition Workshop, especially the Machine Common Sense (MCS) group.  ... 
doi:10.18653/v1/2020.emnlp-main.61 fatcat:lrtywfat25ejbmct72jmlkxane