A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Filters
End-to-End Transformer Based Model for Image Captioning
[article]
2022
arXiv
pre-print
In this paper, we build a pure Transformer-based model, which integrates image captioning into one stage and realizes end-to-end training. ...
CNN-LSTM based architectures have played an important role in image captioning, but limited by the training efficiency and expression ability, researchers began to explore the CNN-Transformer based models ...
it difficult to train image captioning model end-to-end from image pixels to descriptions, and also limits potential applications in the actual scene (Jiang et al. 2020) . ...
arXiv:2203.15350v1
fatcat:fkozxwintzc2zejmabq2s6qcui
Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using Patches
[article]
2022
arXiv
pre-print
For this purpose, we design an end-to-end detector and feature extractor-free captioning model by utilizing the patch-based shifted window technique. ...
We propose Shifted Window-Based Multi-Layer Perceptrons Transformer Captioning model (SwinMLP-TranCAP) with faster inference speed and less computation. ...
To achieve end-to-end captioning framework, ViTCAP model [6] uses the Vision Transformer (ViT) [5] which encodes image patches as grid representations. ...
arXiv:2207.00113v1
fatcat:bmo5gcuqdncsbn3gomxsmishzy
End-to-end Image Captioning Exploits Multimodal Distributional Similarity
[article]
2018
arXiv
pre-print
We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn 'distributional similarity' in a multimodal feature space by mapping a test image to similar ...
To validate our hypothesis, we focus on the 'image' side of image captioning, and vary the input image representation but keep the RNN text generation component of a CNN-RNN model constant. ...
The authors also thank the anonymous reviewers for their valuable feedback on an earlier draft of the paper. ...
arXiv:1809.04144v1
fatcat:3odhf3xtcfeq7obx57b766rwhm
Synthesizing spoken descriptions of images
2021
IEEE/ACM Transactions on Audio Speech and Language Processing
are sought to evaluate the image-to-phoneme task, and 3) an end-to-end image-tospeech model that is able to synthesize spoken descriptions of images bypassing both text and phonemes is proposed. ...
However, current text-based image captioning methods cannot be applied to approximately half of the world's languages due to these languages' lack of a written form. ...
END-TO-END IMAGE-TO-SPEECH The proposed end-to-end model, referred to as the Show and Speak (SAS) model, is based on an encoder-decoder framework. ...
doi:10.1109/taslp.2021.3120644
fatcat:iyfneb6murdafa4og63zasml2y
Image Captioning In the Transformer Age
[article]
2022
arXiv
pre-print
This drawback inspires the researchers to develop a homogeneous architecture that facilitates end-to-end training, for which Transformer is the perfect one that has proven its huge potential in both vision ...
However, since CNN and RNN do not share the basic network component, such a heterogeneous pipeline is hard to be trained end-to-end where the visual encoder will not learn anything from the caption supervision ...
To solve this natural defect, we discussed the feasibility of building a Transformer-based homogeneous architecture for facilitating the end-to-end training. ...
arXiv:2204.07374v1
fatcat:ftsoam2ei5da5fkygq4pztzxda
A Frustratingly Simple Approach for End-to-End Image Captioning
[article]
2022
arXiv
pre-print
To alleviate such defects, we propose a frustratingly simple but highly effective end-to-end image captioning framework, Visual Conditioned GPT (VC-GPT), by connecting the pre-trained visual encoder (CLIP-ViT ...
Before training the captioning models, an extra object detector is utilized to recognize the objects in the image at first. ...
Based on such design, we propose our Visual Conditioned GPT (VC-GPT) framework for end-to-end image captioning. ...
arXiv:2201.12723v3
fatcat:ix4bz4aigzc2pd3uamxjswjgia
Injecting Semantic Concepts into End-to-End Image Captioning
[article]
2022
arXiv
pre-print
In this paper, we are concerned with a better-performing detector-free image captioning model, and propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations ...
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning. ...
propose to inject semantic concepts into end-to-end captioning by learning from open-form captions. ...
arXiv:2112.05230v2
fatcat:6ztitkrb7zgnxmrjkktnblipgq
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
[article]
2021
arXiv
pre-print
In this paper, we propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where we build a unified Transformer framework to jointly learn ...
We incorporate the tasks of object detection and image captioning into pre-training with a unified Transformer encoder-decoder architecture for enhancing visual learning. ...
The CNN backbone for visual representation learning and the Transformer for cross-modal semantic fusion is combined into a single model, which is end-to-end trainable. ...
arXiv:2106.01804v2
fatcat:echgyssdmrh7hmz2d4pwlslslq
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features
[article]
2022
arXiv
pre-print
Moreover, its monolithic design consisting only of Transformers enables end-to-end training of the model. ...
Current state-of-the-art methods for image captioning employ region-based features, as they provide object-level information that is essential to describe the content of images; they are usually extracted ...
Grid-and Region-based Image captioning Transformer This section describes the architecture of GRIT (Grid-and Region-based Image captioning Transformer). ...
arXiv:2207.09666v1
fatcat:l6l6gkgeyrh5fk33nymf2nnpiq
Areas of Attention for Image Captioning
[article]
2017
arXiv
pre-print
We propose "Areas of Attention", a novel attention-based model for automatic image captioning. ...
In contrast to previous attention-based approaches that associate image regions only to the RNN state, our method allows a direct association between caption words and image regions. ...
We thank NVIDIA for donating GPUs used in this research. This work was partially supported by the grants ERC Allegro, ANR-16-CE23-0006, and ANR-11-LABX-0025-01. ...
arXiv:1612.01033v2
fatcat:q2m32iyx3zayplz2monjhp5ha4
Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering
[article]
2016
arXiv
pre-print
end-to-end system. ...
Specifically, we first equip CNN-based visual encoder with a differentiable layer to enable spatially invariant transformation of visual signals. ...
loop between CNN-based encoder and LSTM-based decoder to form an end-to-end formulation for image captioning task. ...
arXiv:1612.04949v1
fatcat:l72kpcj4tbb25j4kihp7hcqw5a
An Empirical Study of Training End-to-End Vision-and-Language Transformers
[article]
2022
arXiv
pre-print
In this paper, we present METER, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. ...
best fully transformer-based model by 1.6%. ...
To close the performance gap, we present METER (Multimodal End-to-end TransformER), through which we thoroughly investigate how to design and pretrain a fully transformer-based VLP model in an end-to-end ...
arXiv:2111.02387v3
fatcat:uvimqu4vizdo5at5gspwjztdhi
Dual-Level Decoupled Transformer for Video Captioning
[article]
2022
arXiv
pre-print
" paradigm, releasing the potential of using dedicated model(e.g. image-text pre-training) to connect the pre-training and downstream tasks, and makes the entire model end-to-end trainable. ...
For the former, "couple" means learning spatio-temporal representation in a single model(3DCNN), resulting the problems named disconnection in task/pre-train domain and hard for end-to-end training. ...
To tackle the above drawbacks, we propose D 2 , a dual-level decoupled pure transformer pipeline for end-to-end video captioning. ...
arXiv:2205.03039v1
fatcat:omrzfavtlngotbf27d43nwe4k4
Let's Talk! Striking Up Conversations via Conversational Visual Question Generation
[article]
2022
arXiv
pre-print
The existing vision-to-question models mostly generate tedious and obvious questions, which might not be ideals conversation starters. ...
This paper introduces a two-phase framework that first generates a visual story for the photo set and then uses the story to produce an interesting question. ...
Stage 2: Response-Provoking Question Generation from the Story We utilize a Transformer-based end-to-end model (Lopez et al. 2020) as our question generation model. ...
arXiv:2205.09327v1
fatcat:ok4qjcrr4bautb62hhiyqy6vwe
The role of image representations in vision to language tasks
2018
Natural Language Engineering
Most state-of-the-art approaches make use of image representations obtained from a deep neural network, which are used to generate language information in a variety of ways with end-to-end neural-network-based ...
In this paper, we probe the representational contribution of the image features in an end-to-end neural modeling framework and study the properties of different types of image representations. ...
End-to-end systems also require a lot of parallel corpora (images with captions) for training, making it hard to adapt to different languages, styles or domains. ...
doi:10.1017/s1351324918000116
fatcat:3hiuawskdbfmjfnwz2mctluffy
« Previous
Showing results 1 — 15 out of 38,545 results