A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is
Visual question answering (VQA) is arguably one of the most challenging multimodal understanding problems as it requires reasoning and deep understanding of the image, the question, and their semantic relationship. Existing VQA methods heavily rely on attention mechanisms to semantically relate the question words with the image contents for answering the related questions. However, most of the attention models are simplified as a linear transformation, over the multimodal representation, whichdoi:10.1145/3126686.3126695 dblp:conf/mm/IlievskiF17 fatcat:dao5cu52dnej3cts2srfkshkk4