12,716 Hits in 5.2 sec

X-Linear Attention Networks for Image Captioning [article]

Yingwei Pan and Ting Yao and Yehao Li and Tao Mei
2020 arXiv   pre-print
Furthermore, we present X-Linear Attention Networks (dubbed as X-LAN) that novelly integrates X-Linear attention block(s) into image encoder and sentence decoder of image captioning model to leverage higher  ...  Nevertheless, there has not been evidence in support of building such interactions concurrently with attention mechanism for image captioning.  ...  More remarkably, we obtain new state-of-theart performances on this captioning dataset with X-LAN. Figure 3 . 3 Overview of our X-Linear Attention Networks (X-LAN) for image captioning.  ... 
arXiv:2003.14080v1 fatcat:ii7jfpw7jjgchjk3qz6a55v7zu

X-Linear Attention Networks for Image Captioning

Yingwei Pan, Ting Yao, Yehao Li, Tao Mei
2020 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
Furthermore, we present X-Linear Attention Networks (dubbed as X-LAN) that novelly integrates X-Linear attention block(s) into image encoder and sentence decoder of image captioning model to leverage higher  ...  Nevertheless, there has not been evidence in support of building such interactions concurrently with attention mechanism for image captioning.  ...  Overview of our X-Linear Attention Networks (X-LAN) for image captioning. Faster R-CNN is firstly utilized to detect a set of image regions.  ... 
doi:10.1109/cvpr42600.2020.01098 dblp:conf/cvpr/PanYLM20 fatcat:nf5bki4675g5rn72nzyiraid6e

AMC: Attention guided Multi-modal Correlation Learning for Image Search [article]

Kan Chen, Trung Bui, Fang Chen, Zhaowen Wang, Ram Nevatia
2017 arXiv   pre-print
Conditioned on query's intent, intra-attention networks (i.e., visual intra-attention network and language intra-attention network) attend on informative parts within each modality; a multi-modal inter-attention  ...  In this paper, we leverage visual and textual modalities for image search by learning their correlation with input query.  ...  To learn the query's embedding q m and query-guided multi-modal representation x q for image x, we propose a multi-modal inter-attention network (MTN) to attend on informative modalities.  ... 
arXiv:1704.00763v1 fatcat:uvlsm4nh6ne3hdzhulcsgwrrfm

An Analysis of the Use of Feed-Forward Sub-Modules for Transformer-Based Image Captioning Tasks

Raymond Ian Osolo, Zhan Yang, Jun Long
2021 Applied Sciences  
In this paper, we use a series of feed-forward layers to encode image features, and caption embeddings, alleviating some of the effects of the computational complexities that accompany the use of the self-attention  ...  We perform an empirical and qualitative analysis of the use of linear transforms in place of self-attention layers in vision-language models, and obtain competitive results on the MSCOCO dataset.  ...  networks to analyze their effects and show that fully connected layers can be worthy replacements for self-attention in image captioning.  ... 
doi:10.3390/app112411635 fatcat:ivmaovr6wjchtl6jusuh2an2ba

CPTR: Full Transformer Network for Image Captioning [article]

Wei Liu, Sihan Chen, Longteng Guo, Xinxin Zhu, Jing Liu
2021 arXiv   pre-print
In this paper, we consider the image captioning task from a new sequence-to-sequence prediction perspective and propose CaPtion TransformeR (CPTR) which takes the sequentialized raw images as the input  ...  Besides, we provide detailed visualizations of the self-attention between patches in the encoder and the "words-to-patches" attention in the decoder thanks to the full Transformer architecture.  ...  Currently, most captioning algorithms follow an encoder-decoder architecture in which a decoder network is used to predict words according to the feature extracted by the encoder network via attention  ... 
arXiv:2101.10804v3 fatcat:e3jbdxop7zdkxliuvikyu2ltoq

AutoCaption: Image Captioning with Neural Architecture Search [article]

Xinxin Zhu and Weining Wang and Longteng Guo and Jing Liu
2021 arXiv   pre-print
Image captioning transforms complex visual information into abstract natural language for representation, which can help computers understanding the world quickly.  ...  Neural Architecture Search (NAS) has shown its important role in a variety of image recognition tasks. Besides, RNN plays an essential role in the image captioning task.  ...  Image captioning also urgently need an automatic network design method to design a more effective network for image understanding and text generation in image captioning.  ... 
arXiv:2012.09742v3 fatcat:nfyh3mf5wjeitjkbzjbkex4fp4

Variational Autoencoder-Based Multiple Image Captioning Using a Caption Attention Map

Boeun Kim, Saim Shin, Hyedong Jung
2019 Applied Sciences  
Because an image feature plays an important role when generating captions, a method to extract a Caption Attention Map (CAM) of the image is proposed, and CAMs are projected to a latent distribution.  ...  Image captioning is a promising research topic that is applicable to services that search for desired content in a large amount of video data and a situation explanation service for visually impaired people  ...  In this paper, a method to generate multiple captions for a single image using the variational autoencoder (VAE) [9] structure and image attention information is proposed.  ... 
doi:10.3390/app9132699 fatcat:5eqct5befzgmnfunvfzu7umkh4

Multi-Gate Attention Network for Image Captioning

Weitao Jiang, Xiying Li, Haifeng Hu, Qiang Lu, Bohong Liu
2021 IEEE Access  
INDEX TERMS Image captioning, self-attention, transformer, multi-gate attention.  ...  Furthermore, most current image captioning methods apply the original transformer designed for natural language processing task, to refine image features directly.  ...  Finally, we describe how to construct the Multi-Gate Attention Network (MGAN) for image captioning. A. FRAMEWORK Most existing image captioning methods adopt the encoderdecoder paradigm.  ... 
doi:10.1109/access.2021.3067607 fatcat:ogqwtb4lqrcslpk6kmtfcemjui

Image-to-Text Transduction with Spatial Self-Attention

Sebastian Springenberg, Egor Lakomkin, Cornelius Weber, Stefan Wermter
2018 The European Symposium on Artificial Neural Networks  
Self-attention combines image features of regions based on their similarity before they are made accessible to the decoder through inter-attention.  ...  In this paper we show that the concepts of self-and inter-attention can effectively be applied in an image-to-text task.  ...  Conclusion In this paper we have shown that a network relying primarily on attention operations can efficiently be applied to image captioning.  ... 
dblp:conf/esann/SpringenbergLWW18 fatcat:6cekry5ywzd6lir6b6bsp3oxmy

Attention Beam: An Image Captioning Approach [article]

Anubhav Shrimal, Tanmoy Chakraborty
2020 arXiv   pre-print
In recent times, encoder-decoder based architectures have achieved state-of-the-art results for image captioning.  ...  The aim of image captioning is to generate textual description of a given image.  ...  for Attention Beam Image Captioning system.  ... 
arXiv:2011.01753v2 fatcat:3uppmit4gfbd3p3hkzxaxuwc7i

A sequential guiding network with attention for image captioning [article]

Daouda Sow and Zengchang Qin and Mouhamed Niasse and Tao Wan
2019 arXiv   pre-print
In this challenge, the encoder-decoder framework has achieved promising performance when a convolutional neural network (CNN) is used as image encoder and a recurrent neural network (RNN) as decoder.  ...  The new model is an extension of the encoder-decoder framework with attention that has an additional guiding long short-term memory (LSTM) and can be trained in an end-to-end manner by using image/descriptions  ...  CONCLUSION In this paper, we have extended the encoder-decoder framework for image captioning by inserting a guiding network.  ... 
arXiv:1811.00228v3 fatcat:735nrfjbo5c7pjzl5nxgnp75ei

Geometry Attention Transformer with Position-aware LSTMs for Image Captioning [article]

Chi Wang, Yulin Shen, Luping Ji
2021 arXiv   pre-print
Aiming to further promote image captioning by transformers, this paper proposes an improved Geometry Attention Transformer (GAT) model.  ...  Besides, this model includes the two work modules: 1) a geometry gate-controlled self-attention refiner, for explicitly incorporating relative spatial information into image region representations in encoding  ...  COCO is one of the largest datasets for image captioning, consisting of 123,287 images with five captions labeled for each.  ... 
arXiv:2110.00335v1 fatcat:emucqxpc3rdfpeu3xoewpwuj2i

Multi-view pedestrian captioning with an attention topic CNN model

Quan Liu, Yingying Chen, Jinqiao Wang, Sijiong Zhang
2018 Computers in industry (Print)  
This feature vector is taken as input to a hierarchical recurrent neural network to generate multi-view captions for pedestrian images.  ...  Therefore, in this paper, we propose a novel approach to generate multi-view captions for pedestrian images with a topic attention mechanism on global and local semantic regions.  ...  Thus, we use Long-Short Term Memory (LSTM) [7] networks for caption generation.  ... 
doi:10.1016/j.compind.2018.01.015 fatcat:thfb3bi5gzcqhjahxdkuri47wy

Pre-training for Video Captioning Challenge 2020 Summary [article]

Yingwei Pan and Jun Xu and Yehao Li and Ting Yao and Tao Mei
2020 arXiv   pre-print
The Pre-training for Video Captioning Challenge 2020 Summary: results and challenge participants' technical reports.  ...  XlanV Model for Video Captioning The overall paradigm of our model, which leverages the X-Linear Attention network [1] as the backbone framework, is shown in Fig. 1 .  ...  CONCLUSION In this paper, we introduce a structure of X-linear Attention network for video captioning, which fully integrates video features by adaptively fusing multi-modality video features.  ... 
arXiv:2008.00947v1 fatcat:v73v4n5j5bhgbj5tneugrb32qa

A Modularized Architecture of Multi-Branch Convolutional Neural Network for Image Captioning

Shan He, Yuanyao Lu
2019 Electronics  
Image captioning is a comprehensive task in computer vision (CV) and natural language processing (NLP).  ...  In order to get better image captioning extraction, we propose a highly modularized multi-branch CNN, which could increase accuracy while maintaining the number of hyper-parameters unchanged.  ...  designed specifically for image captioning.  ... 
doi:10.3390/electronics8121417 fatcat:5avr5izj25d3ddw456lbyfd6em
« Previous Showing results 1 — 15 out of 12,716 results