Towards Knowledge-Augmented Visual Question Answering

Maryam Ziaeefard, Freddy Lecue
2020 Proceedings of the 28th International Conference on Computational Linguistics   unpublished
Visual Question Answering (VQA) remains algorithmically challenging while it is effortless for humans. Humans combine visual observations with general and commonsense knowledge to answer a question about a given image. In this paper, we address the problem of incorporating general knowledge into VQA models while leveraging the visual information. We propose a model that captures the interactions between objects in a visual scene and entities in an external knowledge source. Our model is a
more » ... based approach that combines scene graphs with concept graphs, which learns a question-adaptive graph representation of related knowledge instances. We use Graph Attention Networks to set higher importance to key knowledge instances that are mostly relevant to each question. We exploit ConceptNet as the source of general knowledge and evaluate the performance of our model on the challenging OK-VQA dataset. Our code will be available at https://github.com/ZiaMaryam/KVQA
doi:10.18653/v1/2020.coling-main.169 fatcat:gtrquytov5fvvnvk36zafetwta