206 Hits in 5.1 sec

Video Question-Answering Techniques, Benchmark Datasets and Evaluation Metrics Leveraging Video Captioning: A Comprehensive Survey

Khushboo Khurana, Umesh Deshpande
2021 IEEE Access  
The captions generated by video captioning can be further utilized for video retrieval, summarization, question-answering, etc.  ...  INDEX TERMS Video question answering, video captioning, video description generation, natural language processing, deep learning, computer vision, LSTM, CNN, attention model, memory network.  ...  Augmented attention mechanism is employed in [3] that models the temporal dynamics and semantic attributes of the video.  ... 
doi:10.1109/access.2021.3058248 fatcat:bnjmbffxgreb5jkjuxethaqnde

Multimodal Research in Vision and Language: A Review of Current and Emerging Trends [article]

Shagun Uppal, Sarthak Bhagat, Devamanyu Hazarika, Navonil Majumdar, Soujanya Poria, Roger Zimmermann, Amir Zadeh
2020 arXiv   pre-print
More recently, this has enhanced research interests in the intersection of the Vision and Language arena with its numerous applications and fast-paced growth.  ...  We also address task-specific trends, along with their evaluation strategies and upcoming challenges.  ...  Image Paragraph Captioning, as pursued by [12] , generates detailed paragraphs describing the images at a finer level.  ... 
arXiv:2010.09522v2 fatcat:l4npstkoqndhzn6hznr7eeys4u

A Roadmap for Big Model [article]

Sha Yuan, Hanyu Zhao, Shuai Zhao, Jiahong Leng, Yangxiao Liang, Xiaozhi Wang, Jifan Yu, Xin Lv, Zhou Shao, Jiaao He, Yankai Lin, Xu Han (+88 others)
2022 arXiv   pre-print
With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm.  ...  In this paper, we cover not only the BM technologies themselves but also the prerequisites for BM training and applications with BMs, dividing the BM review into four parts: Resource, Models, Key Technologies  ...  For the model inputs, knowledge augmentation aims to enhance the inputs with abundant related knowledge [162, 412] .  ... 
arXiv:2203.14101v4 fatcat:rdikzudoezak5b36cf6hhne5u4

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications [article]

Chao Zhang, Zichao Yang, Xiaodong He, Li Deng
2020 arXiv   pre-print
Regarding applications, selected areas of a broad interest in the current literature are covered, including image-to-text caption generation, text-to-image generation, and visual question answering.  ...  Regarding multimodal fusion, this review focuses on special architectures for the integration of representations of unimodal signals for a particular task.  ...  ACKNOWLEDGEMENT The authors are grateful to the editor and anonymous reviewers for their valuable suggestions that helped to make this paper better.  ... 
arXiv:1911.03977v3 fatcat:ojazuw3qzvfqrdweul6qdpxuo4

Adversarial Text-to-Image Synthesis: A Review [article]

Stanislav Frolov, Tobias Hinz, Federico Raue, Jörn Hees, Andreas Dengel
2021 arXiv   pre-print
It is a flexible and intuitive way for conditional image generation with significant progress in the last years regarding visual realism, diversity, and semantic alignment.  ...  With the advent of generative adversarial networks, synthesizing images from textual descriptions has recently become an active research area.  ...  ., attention mechanisms, cycle consistency, dynamic memory, Siamese architectures).  ... 
arXiv:2101.09983v1 fatcat:as5i4mk4kndrzpcshlewkbgge4

Recent Advances in Neural Text Generation: A Task-Agnostic Survey [article]

Chen Tang, Frank Guerin, Yucheng Li, Chenghua Lin
2022 arXiv   pre-print
Finally we discuss the future directions for the development of neural text generation including neural pipelines and exploiting back-ground knowledge.  ...  These advances have been achieved by numerous developments, which we group under the following four headings: data construction, neural frameworks, training and inference strategies, and evaluation metrics  ...  Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.  ... 
arXiv:2203.03047v1 fatcat:iupgvcw2hbge5ioy6quiotnra4

Cross Modal Retrieval with Querybank Normalisation [article]

Simion-Vlad Bogolin, Ioana Croitoru, Hailin Jin, Yang Liu, Samuel Albanie
2022 arXiv   pre-print
Profiting from large-scale training datasets, advances in neural architecture design and efficient inference, joint embeddings have become the dominant approach for tackling cross-modal retrieval.  ...  We showcase QB-Norm across a range of cross modal retrieval models and benchmarks where it consistently enhances strong baselines beyond the state of the art.  ...  The authors thank Bruno Korbar for his assistance. S.A. would like to acknowledge Z. Novak and S. Carlson in supporting his contribution.  ... 
arXiv:2112.12777v3 fatcat:iu5tnhg62ncebbtykfxfrq22aq

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Aditya Mogadala, Marimuthu Kalimuthu, Dietrich Klakow
2021 The Journal of Artificial Intelligence Research  
Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video.  ...  Much of the growth in these fields has been made possible with deep learning, a sub-area of machine learning that uses artificial neural networks.  ...  We extend our special thanks to Matthew Kuhn and Stephanie Lund for painstakingly proofing the whole manuscript.  ... 
doi:10.1613/jair.1.11688 fatcat:kvfdrg3bwrh35fns4z67adqp6i

Video Description: Datasets & Evaluation Metrics

Muhammad Rafiq, Ghazala Rafiq, Gyu Sang Choi
2021 IEEE Access  
Finally, we concluded with the need for essential enhancements and encouraging research directions on the topic.  ...  INDEX TERMS Datasets, evaluation metrics, sequence to sequence, video description, video captioning, vision to language, vision to text. 121666 VOLUME 9, 2021  ...  BERT [95] ; language modeling based on the transformer got attention for both performance enhancement due to parallelization (transformer mechanism employment) and pre-training approach.  ... 
doi:10.1109/access.2021.3108565 fatcat:tlqiaopvrbefpjeo4cvcbqdxoq

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods [article]

Aditya Mogadala and Marimuthu Kalimuthu and Dietrich Klakow
2020 arXiv   pre-print
Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video.  ...  The largest of the growths in these fields has been made possible with deep learning, a sub-area of machine learning, which uses the principles of artificial neural networks.  ...  We extend our special thanks to Matthew Kuhn and Stephanie Lund for painstakingly proofing the whole manuscript.  ... 
arXiv:1907.09358v2 fatcat:4fyf6kscy5dfbewll3zs7yzsuq

From Show to Tell: A Survey on Deep Learning-based Image Captioning [article]

Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, Rita Cucchiara
2021 arXiv   pre-print
For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences.  ...  This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics.  ...  We also want to thank the authors who provided us with the captions and model weights for some of the surveyed approaches.  ... 
arXiv:2107.06912v3 fatcat:ezhutcovnvh4reiweedfmxjlve

Deep Learning Based Text Classification: A Comprehensive Review [article]

Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, Jianfeng Gao
2021 arXiv   pre-print
We also provide a summary of more than 40 popular datasets widely used for text classification.  ...  In this paper, we provide a comprehensive review of more than 150 deep learning based models for text classification developed in recent years, and discuss their technical contributions, similarities,  ...  ACKNOWLEDGMENTS The authors would like to thank Richard Socher, Kristina Toutanova, and Brooke Cowan for reviewing this work, and providing very insightful comments.  ... 
arXiv:2004.03705v3 fatcat:al5hstylsbhfpldvokuvlpomam

Neural Language Generation: Formulation, Methods, and Evaluation [article]

Cristina Garbacea, Qiaozhu Mei
2020 arXiv   pre-print
Next we include a comprehensive outline of methods and neural architectures employed for generating diverse texts.  ...  Recent advances in neural network-based generative modeling have reignited the hopes in having computer systems capable of seamlessly conversing with humans and able to understand natural language.  ...  Image / Video Captioning Image captioning is designed to generate captions in the form of textual descriptions for an image.  ... 
arXiv:2007.15780v1 fatcat:oixtreazxvbgvclicpxiqzbxrm

Explainable Deep Learning Methods in Medical Diagnosis: A Survey [article]

Cristiano Patrício, João C. Neves, Luís F. Teixeira
2022 arXiv   pre-print
Moreover, this work reviews the existing medical imaging datasets and the existing metrics for evaluating the quality of the explanations .  ...  Finally, the major challenges in applying XAI to medical imaging are also discussed.  ...  In adversarial training, examples of the training set are augmented with adversarial perturbations at each training loop.  ... 
arXiv:2205.04766v1 fatcat:sqgaaat6qrag5gtoh7mo7anapy

Deep Image Synthesis from Intuitive User Input: A Review and Perspectives [article]

Yuan Xue, Yuan-Chen Guo, Han Zhang, Tao Xu, Song-Hai Zhang, Xiaolei Huang
2021 arXiv   pre-print
While classic works that allow such automatic image content generation have followed a framework of image retrieval and composition, recent advances in deep generative models such as generative adversarial  ...  This paper reviews recent works for image synthesis given intuitive user input, covering advances in input versatility, image generation methodology, benchmark datasets, and evaluation metrics.  ...  [152] introduces a gating mechanism where a writing gate writes selected important textual features from the given sentence into a dynamic memory, and a response gate adaptively reads from the memory  ... 
arXiv:2107.04240v2 fatcat:ticrsi27nzhozmw7dp7wwja2ni
« Previous Showing results 1 — 15 out of 206 results