38,545 Hits in 5.3 sec

End-to-End Transformer Based Model for Image Captioning [article]

Yiyu Wang, Jungang Xu, Yingfei Sun
2022 arXiv   pre-print
In this paper, we build a pure Transformer-based model, which integrates image captioning into one stage and realizes end-to-end training.  ...  CNN-LSTM based architectures have played an important role in image captioning, but limited by the training efficiency and expression ability, researchers began to explore the CNN-Transformer based models  ...  it difficult to train image captioning model end-to-end from image pixels to descriptions, and also limits potential applications in the actual scene (Jiang et al. 2020) .  ... 
arXiv:2203.15350v1 fatcat:fkozxwintzc2zejmabq2s6qcui

Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using Patches [article]

Mengya Xu and Mobarakol Islam and Hongliang Ren
2022 arXiv   pre-print
For this purpose, we design an end-to-end detector and feature extractor-free captioning model by utilizing the patch-based shifted window technique.  ...  We propose Shifted Window-Based Multi-Layer Perceptrons Transformer Captioning model (SwinMLP-TranCAP) with faster inference speed and less computation.  ...  To achieve end-to-end captioning framework, ViTCAP model [6] uses the Vision Transformer (ViT) [5] which encodes image patches as grid representations.  ... 
arXiv:2207.00113v1 fatcat:bmo5gcuqdncsbn3gomxsmishzy

End-to-end Image Captioning Exploits Multimodal Distributional Similarity [article]

Pranava Madhyastha, Josiah Wang, Lucia Specia
2018 arXiv   pre-print
We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn 'distributional similarity' in a multimodal feature space by mapping a test image to similar  ...  To validate our hypothesis, we focus on the 'image' side of image captioning, and vary the input image representation but keep the RNN text generation component of a CNN-RNN model constant.  ...  The authors also thank the anonymous reviewers for their valuable feedback on an earlier draft of the paper.  ... 
arXiv:1809.04144v1 fatcat:3odhf3xtcfeq7obx57b766rwhm

Synthesizing spoken descriptions of images

Xinsheng Wang, Justin Van der Hout, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg
2021 IEEE/ACM Transactions on Audio Speech and Language Processing  
are sought to evaluate the image-to-phoneme task, and 3) an end-to-end image-tospeech model that is able to synthesize spoken descriptions of images bypassing both text and phonemes is proposed.  ...  However, current text-based image captioning methods cannot be applied to approximately half of the world's languages due to these languages' lack of a written form.  ...  END-TO-END IMAGE-TO-SPEECH The proposed end-to-end model, referred to as the Show and Speak (SAS) model, is based on an encoder-decoder framework.  ... 
doi:10.1109/taslp.2021.3120644 fatcat:iyfneb6murdafa4og63zasml2y

Image Captioning In the Transformer Age [article]

Yang Xu, Li Li, Haiyang Xu, Songfang Huang, Fei Huang, Jianfei Cai
2022 arXiv   pre-print
This drawback inspires the researchers to develop a homogeneous architecture that facilitates end-to-end training, for which Transformer is the perfect one that has proven its huge potential in both vision  ...  However, since CNN and RNN do not share the basic network component, such a heterogeneous pipeline is hard to be trained end-to-end where the visual encoder will not learn anything from the caption supervision  ...  To solve this natural defect, we discussed the feasibility of building a Transformer-based homogeneous architecture for facilitating the end-to-end training.  ... 
arXiv:2204.07374v1 fatcat:ftsoam2ei5da5fkygq4pztzxda

A Frustratingly Simple Approach for End-to-End Image Captioning [article]

Ziyang Luo, Yadong Xi, Rongsheng Zhang, Jing Ma
2022 arXiv   pre-print
To alleviate such defects, we propose a frustratingly simple but highly effective end-to-end image captioning framework, Visual Conditioned GPT (VC-GPT), by connecting the pre-trained visual encoder (CLIP-ViT  ...  Before training the captioning models, an extra object detector is utilized to recognize the objects in the image at first.  ...  Based on such design, we propose our Visual Conditioned GPT (VC-GPT) framework for end-to-end image captioning.  ... 
arXiv:2201.12723v3 fatcat:ix4bz4aigzc2pd3uamxjswjgia

Injecting Semantic Concepts into End-to-End Image Captioning [article]

Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lin Liang, Zhe Gan, Lijuan Wang, Yezhou Yang, Zicheng Liu
2022 arXiv   pre-print
In this paper, we are concerned with a better-performing detector-free image captioning model, and propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations  ...  For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.  ...  propose to inject semantic concepts into end-to-end captioning by learning from open-form captions.  ... 
arXiv:2112.05230v2 fatcat:6ztitkrb7zgnxmrjkktnblipgq

E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning [article]

Haiyang Xu, Ming Yan, Chenliang Li, Bin Bi, Songfang Huang, Wenming Xiao, Fei Huang
2021 arXiv   pre-print
In this paper, we propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where we build a unified Transformer framework to jointly learn  ...  We incorporate the tasks of object detection and image captioning into pre-training with a unified Transformer encoder-decoder architecture for enhancing visual learning.  ...  The CNN backbone for visual representation learning and the Transformer for cross-modal semantic fusion is combined into a single model, which is end-to-end trainable.  ... 
arXiv:2106.01804v2 fatcat:echgyssdmrh7hmz2d4pwlslslq

GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features [article]

Van-Quang Nguyen, Masanori Suganuma, Takayuki Okatani
2022 arXiv   pre-print
Moreover, its monolithic design consisting only of Transformers enables end-to-end training of the model.  ...  Current state-of-the-art methods for image captioning employ region-based features, as they provide object-level information that is essential to describe the content of images; they are usually extracted  ...  Grid-and Region-based Image captioning Transformer This section describes the architecture of GRIT (Grid-and Region-based Image captioning Transformer).  ... 
arXiv:2207.09666v1 fatcat:l6l6gkgeyrh5fk33nymf2nnpiq

Areas of Attention for Image Captioning [article]

Marco Pedersoli, Thomas Lucas, Cordelia Schmid, Jakob Verbeek
2017 arXiv   pre-print
We propose "Areas of Attention", a novel attention-based model for automatic image captioning.  ...  In contrast to previous attention-based approaches that associate image regions only to the RNN state, our method allows a direct association between caption words and image regions.  ...  We thank NVIDIA for donating GPUs used in this research. This work was partially supported by the grants ERC Allegro, ANR-16-CE23-0006, and ANR-11-LABX-0025-01.  ... 
arXiv:1612.01033v2 fatcat:q2m32iyx3zayplz2monjhp5ha4

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering [article]

Hao Liu, Yang Yang, Fumin Shen, Lixin Duan, Heng Tao Shen
2016 arXiv   pre-print
end-to-end system.  ...  Specifically, we first equip CNN-based visual encoder with a differentiable layer to enable spatially invariant transformation of visual signals.  ...  loop between CNN-based encoder and LSTM-based decoder to form an end-to-end formulation for image captioning task.  ... 
arXiv:1612.04949v1 fatcat:l72kpcj4tbb25j4kihp7hcqw5a

An Empirical Study of Training End-to-End Vision-and-Language Transformers [article]

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu, Michael Zeng
2022 arXiv   pre-print
In this paper, we present METER, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner.  ...  best fully transformer-based model by 1.6%.  ...  To close the performance gap, we present METER (Multimodal End-to-end TransformER), through which we thoroughly investigate how to design and pretrain a fully transformer-based VLP model in an end-to-end  ... 
arXiv:2111.02387v3 fatcat:uvimqu4vizdo5at5gspwjztdhi

Dual-Level Decoupled Transformer for Video Captioning [article]

Yiqi Gao, Xinglin Hou, Wei Suo, Mengyang Sun, Tiezheng Ge, Yuning Jiang, Peng Wang
2022 arXiv   pre-print
" paradigm, releasing the potential of using dedicated model(e.g. image-text pre-training) to connect the pre-training and downstream tasks, and makes the entire model end-to-end trainable.  ...  For the former, "couple" means learning spatio-temporal representation in a single model(3DCNN), resulting the problems named disconnection in task/pre-train domain and hard for end-to-end training.  ...  To tackle the above drawbacks, we propose D 2 , a dual-level decoupled pure transformer pipeline for end-to-end video captioning.  ... 
arXiv:2205.03039v1 fatcat:omrzfavtlngotbf27d43nwe4k4

Let's Talk! Striking Up Conversations via Conversational Visual Question Generation [article]

Shih-Han Chan, Tsai-Lun Yang, Yun-Wei Chu, Chi-Yang Hsu, Ting-Hao Huang, Yu-Shian Chiu, Lun-Wei Ku
2022 arXiv   pre-print
The existing vision-to-question models mostly generate tedious and obvious questions, which might not be ideals conversation starters.  ...  This paper introduces a two-phase framework that first generates a visual story for the photo set and then uses the story to produce an interesting question.  ...  Stage 2: Response-Provoking Question Generation from the Story We utilize a Transformer-based end-to-end model (Lopez et al. 2020) as our question generation model.  ... 
arXiv:2205.09327v1 fatcat:ok4qjcrr4bautb62hhiyqy6vwe

The role of image representations in vision to language tasks

2018 Natural Language Engineering  
Most state-of-the-art approaches make use of image representations obtained from a deep neural network, which are used to generate language information in a variety of ways with end-to-end neural-network-based  ...  In this paper, we probe the representational contribution of the image features in an end-to-end neural modeling framework and study the properties of different types of image representations.  ...  End-to-end systems also require a lot of parallel corpora (images with captions) for training, making it hard to adapt to different languages, styles or domains.  ... 
doi:10.1017/s1351324918000116 fatcat:3hiuawskdbfmjfnwz2mctluffy
« Previous Showing results 1 — 15 out of 38,545 results