An Analysis of the Use of Feed-Forward Sub-Modules for Transformer-Based Image Captioning Tasks
In the quest to make deep learning systems more capable, a number of more complex, more computationally expensive and memory intensive algorithms have been proposed. This switchover glosses over the capabilities of many of the simpler systems or modules within them to adequately address current and future problems. This has led to some of the deep learning research being inaccessible to researchers who don't possess top-of-the-line hardware. The use of simple feed forward networks has not been
... xplicitly explored in the current transformer-based vision-language field. In this paper, we use a series of feed-forward layers to encode image features, and caption embeddings, alleviating some of the effects of the computational complexities that accompany the use of the self-attention mechanism and limit its application in long sequence task scenarios. We demonstrate that a decoder does not require masking for conditional short sequence generation where the task is not only dependent on the previously generated sequence, but another input such as image features. We perform an empirical and qualitative analysis of the use of linear transforms in place of self-attention layers in vision-language models, and obtain competitive results on the MSCOCO dataset. Our best feed-forward model obtains average scores of over 90% of the current state-of-the-art pre-trained Oscar model in the conventional image captioning metrics. We also demonstrate that the proposed models take less time training and use less memory at larger batch sizes and longer sequence lengths.