A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
X-Linear Attention Networks for Image Captioning
[article]
2020
arXiv
pre-print
Furthermore, we present X-Linear Attention Networks (dubbed as X-LAN) that novelly integrates X-Linear attention block(s) into image encoder and sentence decoder of image captioning model to leverage higher ...
Nevertheless, there has not been evidence in support of building such interactions concurrently with attention mechanism for image captioning. ...
More remarkably, we obtain new state-of-theart performances on this captioning dataset with X-LAN. Figure 3 . 3 Overview of our X-Linear Attention Networks (X-LAN) for image captioning. ...
arXiv:2003.14080v1
fatcat:ii7jfpw7jjgchjk3qz6a55v7zu
X-Linear Attention Networks for Image Captioning
2020
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Furthermore, we present X-Linear Attention Networks (dubbed as X-LAN) that novelly integrates X-Linear attention block(s) into image encoder and sentence decoder of image captioning model to leverage higher ...
Nevertheless, there has not been evidence in support of building such interactions concurrently with attention mechanism for image captioning. ...
Overview of our X-Linear Attention Networks (X-LAN) for image captioning. Faster R-CNN is firstly utilized to detect a set of image regions. ...
doi:10.1109/cvpr42600.2020.01098
dblp:conf/cvpr/PanYLM20
fatcat:nf5bki4675g5rn72nzyiraid6e
AMC: Attention guided Multi-modal Correlation Learning for Image Search
[article]
2017
arXiv
pre-print
Conditioned on query's intent, intra-attention networks (i.e., visual intra-attention network and language intra-attention network) attend on informative parts within each modality; a multi-modal inter-attention ...
In this paper, we leverage visual and textual modalities for image search by learning their correlation with input query. ...
To learn the query's embedding q m and query-guided multi-modal representation x q for image x, we propose a multi-modal inter-attention network (MTN) to attend on informative modalities. ...
arXiv:1704.00763v1
fatcat:uvlsm4nh6ne3hdzhulcsgwrrfm
An Analysis of the Use of Feed-Forward Sub-Modules for Transformer-Based Image Captioning Tasks
2021
Applied Sciences
In this paper, we use a series of feed-forward layers to encode image features, and caption embeddings, alleviating some of the effects of the computational complexities that accompany the use of the self-attention ...
We perform an empirical and qualitative analysis of the use of linear transforms in place of self-attention layers in vision-language models, and obtain competitive results on the MSCOCO dataset. ...
networks to analyze their effects and show that fully connected layers can be worthy replacements for self-attention in image captioning. ...
doi:10.3390/app112411635
fatcat:ivmaovr6wjchtl6jusuh2an2ba
CPTR: Full Transformer Network for Image Captioning
[article]
2021
arXiv
pre-print
In this paper, we consider the image captioning task from a new sequence-to-sequence prediction perspective and propose CaPtion TransformeR (CPTR) which takes the sequentialized raw images as the input ...
Besides, we provide detailed visualizations of the self-attention between patches in the encoder and the "words-to-patches" attention in the decoder thanks to the full Transformer architecture. ...
Currently, most captioning algorithms follow an encoder-decoder architecture in which a decoder network is used to predict words according to the feature extracted by the encoder network via attention ...
arXiv:2101.10804v3
fatcat:e3jbdxop7zdkxliuvikyu2ltoq
AutoCaption: Image Captioning with Neural Architecture Search
[article]
2021
arXiv
pre-print
Image captioning transforms complex visual information into abstract natural language for representation, which can help computers understanding the world quickly. ...
Neural Architecture Search (NAS) has shown its important role in a variety of image recognition tasks. Besides, RNN plays an essential role in the image captioning task. ...
Image captioning also urgently need an automatic network design method to design a more effective network for image understanding and text generation in image captioning. ...
arXiv:2012.09742v3
fatcat:nfyh3mf5wjeitjkbzjbkex4fp4
Variational Autoencoder-Based Multiple Image Captioning Using a Caption Attention Map
2019
Applied Sciences
Because an image feature plays an important role when generating captions, a method to extract a Caption Attention Map (CAM) of the image is proposed, and CAMs are projected to a latent distribution. ...
Image captioning is a promising research topic that is applicable to services that search for desired content in a large amount of video data and a situation explanation service for visually impaired people ...
In this paper, a method to generate multiple captions for a single image using the variational autoencoder (VAE) [9] structure and image attention information is proposed. ...
doi:10.3390/app9132699
fatcat:5eqct5befzgmnfunvfzu7umkh4
Multi-Gate Attention Network for Image Captioning
2021
IEEE Access
INDEX TERMS Image captioning, self-attention, transformer, multi-gate attention. ...
Furthermore, most current image captioning methods apply the original transformer designed for natural language processing task, to refine image features directly. ...
Finally, we describe how to construct the Multi-Gate Attention Network (MGAN) for image captioning.
A. FRAMEWORK Most existing image captioning methods adopt the encoderdecoder paradigm. ...
doi:10.1109/access.2021.3067607
fatcat:ogqwtb4lqrcslpk6kmtfcemjui
Image-to-Text Transduction with Spatial Self-Attention
2018
The European Symposium on Artificial Neural Networks
Self-attention combines image features of regions based on their similarity before they are made accessible to the decoder through inter-attention. ...
In this paper we show that the concepts of self-and inter-attention can effectively be applied in an image-to-text task. ...
Conclusion In this paper we have shown that a network relying primarily on attention operations can efficiently be applied to image captioning. ...
dblp:conf/esann/SpringenbergLWW18
fatcat:6cekry5ywzd6lir6b6bsp3oxmy
Attention Beam: An Image Captioning Approach
[article]
2020
arXiv
pre-print
In recent times, encoder-decoder based architectures have achieved state-of-the-art results for image captioning. ...
The aim of image captioning is to generate textual description of a given image. ...
for Attention Beam Image Captioning system. ...
arXiv:2011.01753v2
fatcat:3uppmit4gfbd3p3hkzxaxuwc7i
A sequential guiding network with attention for image captioning
[article]
2019
arXiv
pre-print
In this challenge, the encoder-decoder framework has achieved promising performance when a convolutional neural network (CNN) is used as image encoder and a recurrent neural network (RNN) as decoder. ...
The new model is an extension of the encoder-decoder framework with attention that has an additional guiding long short-term memory (LSTM) and can be trained in an end-to-end manner by using image/descriptions ...
CONCLUSION In this paper, we have extended the encoder-decoder framework for image captioning by inserting a guiding network. ...
arXiv:1811.00228v3
fatcat:735nrfjbo5c7pjzl5nxgnp75ei
Geometry Attention Transformer with Position-aware LSTMs for Image Captioning
[article]
2021
arXiv
pre-print
Aiming to further promote image captioning by transformers, this paper proposes an improved Geometry Attention Transformer (GAT) model. ...
Besides, this model includes the two work modules: 1) a geometry gate-controlled self-attention refiner, for explicitly incorporating relative spatial information into image region representations in encoding ...
COCO is one of the largest datasets for image captioning, consisting of 123,287 images with five captions labeled for each. ...
arXiv:2110.00335v1
fatcat:emucqxpc3rdfpeu3xoewpwuj2i
Multi-view pedestrian captioning with an attention topic CNN model
2018
Computers in industry (Print)
This feature vector is taken as input to a hierarchical recurrent neural network to generate multi-view captions for pedestrian images. ...
Therefore, in this paper, we propose a novel approach to generate multi-view captions for pedestrian images with a topic attention mechanism on global and local semantic regions. ...
Thus, we use Long-Short Term Memory (LSTM) [7] networks for caption generation. ...
doi:10.1016/j.compind.2018.01.015
fatcat:thfb3bi5gzcqhjahxdkuri47wy
Pre-training for Video Captioning Challenge 2020 Summary
[article]
2020
arXiv
pre-print
The Pre-training for Video Captioning Challenge 2020 Summary: results and challenge participants' technical reports. ...
XlanV Model for Video Captioning The overall paradigm of our model, which leverages the X-Linear Attention network [1] as the backbone framework, is shown in Fig. 1 . ...
CONCLUSION In this paper, we introduce a structure of X-linear Attention network for video captioning, which fully integrates video features by adaptively fusing multi-modality video features. ...
arXiv:2008.00947v1
fatcat:v73v4n5j5bhgbj5tneugrb32qa
A Modularized Architecture of Multi-Branch Convolutional Neural Network for Image Captioning
2019
Electronics
Image captioning is a comprehensive task in computer vision (CV) and natural language processing (NLP). ...
In order to get better image captioning extraction, we propose a highly modularized multi-branch CNN, which could increase accuracy while maintaining the number of hyper-parameters unchanged. ...
designed specifically for image captioning. ...
doi:10.3390/electronics8121417
fatcat:5avr5izj25d3ddw456lbyfd6em
« Previous
Showing results 1 — 15 out of 12,716 results