X-Linear Attention Networks for Image Captioning [article]

Yingwei Pan and Ting Yao and Yehao Li and Tao Mei
2020 arXiv   pre-print
Recent progress on fine-grained visual recognition and visual question answering has featured Bilinear Pooling, which effectively models the 2^nd order interactions across multi-modal inputs. Nevertheless, there has not been evidence in support of building such interactions concurrently with attention mechanism for image captioning. In this paper, we introduce a unified attention block – X-Linear attention block, that fully employs bilinear pooling to selectively capitalize on visual
more » ... or perform multi-modal reasoning. Technically, X-Linear attention block simultaneously exploits both the spatial and channel-wise bilinear attention distributions to capture the 2^nd order interactions between the input single-modal or multi-modal features. Higher and even infinity order feature interactions are readily modeled through stacking multiple X-Linear attention blocks and equipping the block with Exponential Linear Unit (ELU) in a parameter-free fashion, respectively. Furthermore, we present X-Linear Attention Networks (dubbed as X-LAN) that novelly integrates X-Linear attention block(s) into image encoder and sentence decoder of image captioning model to leverage higher order intra- and inter-modal interactions. The experiments on COCO benchmark demonstrate that our X-LAN obtains to-date the best published CIDEr performance of 132.0 When further endowing Transformer with X-Linear attention blocks, CIDEr is boosted up to 132.8 .
arXiv:2003.14080v1 fatcat:ii7jfpw7jjgchjk3qz6a55v7zu