Generative Attention Model with Adversarial Self-learning for Visual Question Answering

Ilija Ilievski, Jiashi Feng
2017 Proceedings of the on Thematic Workshops of ACM Multimedia 2017 - Thematic Workshops '17  
Visual question answering (VQA) is arguably one of the most challenging multimodal understanding problems as it requires reasoning and deep understanding of the image, the question, and their semantic relationship. Existing VQA methods heavily rely on attention mechanisms to semantically relate the question words with the image contents for answering the related questions. However, most of the attention models are simplified as a linear transformation, over the multimodal representation, which
more » ... resentation, which we argue is insufficient for capturing the complex nature of the multimodal data. In this paper we propose a novel generative attention model obtained by adversarial self-learning. The proposed adversarial attention produces more diverse visual attention maps and it is able to generalize the attention better to new questions. The experiments show the proposed adversarial attention leads to a state-of-the-art VQA model on the two VQA benchmark datasets, VQA v1.0 and v2.0.
doi:10.1145/3126686.3126695 dblp:conf/mm/IlievskiF17 fatcat:dao5cu52dnej3cts2srfkshkk4