A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is
Recent progress on fine-grained visual recognition and visual question answering has featured Bilinear Pooling, which effectively models the 2^nd order interactions across multi-modal inputs. Nevertheless, there has not been evidence in support of building such interactions concurrently with attention mechanism for image captioning. In this paper, we introduce a unified attention block – X-Linear attention block, that fully employs bilinear pooling to selectively capitalize on visualarXiv:2003.14080v1 fatcat:ii7jfpw7jjgchjk3qz6a55v7zu