A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Filters
Video Question-Answering Techniques, Benchmark Datasets and Evaluation Metrics Leveraging Video Captioning: A Comprehensive Survey
2021
IEEE Access
The captions generated by video captioning can be further utilized for video retrieval, summarization, question-answering, etc. ...
INDEX TERMS Video question answering, video captioning, video description generation, natural language processing, deep learning, computer vision, LSTM, CNN, attention model, memory network. ...
Augmented attention mechanism is employed in [3] that models the temporal dynamics and semantic attributes of the video. ...
doi:10.1109/access.2021.3058248
fatcat:bnjmbffxgreb5jkjuxethaqnde
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends
[article]
2020
arXiv
pre-print
More recently, this has enhanced research interests in the intersection of the Vision and Language arena with its numerous applications and fast-paced growth. ...
We also address task-specific trends, along with their evaluation strategies and upcoming challenges. ...
Image Paragraph Captioning, as pursued by [12] , generates detailed paragraphs describing the images at a finer level. ...
arXiv:2010.09522v2
fatcat:l4npstkoqndhzn6hznr7eeys4u
A Roadmap for Big Model
[article]
2022
arXiv
pre-print
With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm. ...
In this paper, we cover not only the BM technologies themselves but also the prerequisites for BM training and applications with BMs, dividing the BM review into four parts: Resource, Models, Key Technologies ...
For the model inputs, knowledge augmentation aims to enhance the inputs with abundant related knowledge [162, 412] . ...
arXiv:2203.14101v4
fatcat:rdikzudoezak5b36cf6hhne5u4
Multimodal Intelligence: Representation Learning, Information Fusion, and Applications
[article]
2020
arXiv
pre-print
Regarding applications, selected areas of a broad interest in the current literature are covered, including image-to-text caption generation, text-to-image generation, and visual question answering. ...
Regarding multimodal fusion, this review focuses on special architectures for the integration of representations of unimodal signals for a particular task. ...
ACKNOWLEDGEMENT The authors are grateful to the editor and anonymous reviewers for their valuable suggestions that helped to make this paper better. ...
arXiv:1911.03977v3
fatcat:ojazuw3qzvfqrdweul6qdpxuo4
Adversarial Text-to-Image Synthesis: A Review
[article]
2021
arXiv
pre-print
It is a flexible and intuitive way for conditional image generation with significant progress in the last years regarding visual realism, diversity, and semantic alignment. ...
With the advent of generative adversarial networks, synthesizing images from textual descriptions has recently become an active research area. ...
., attention mechanisms, cycle consistency, dynamic memory, Siamese architectures). ...
arXiv:2101.09983v1
fatcat:as5i4mk4kndrzpcshlewkbgge4
Recent Advances in Neural Text Generation: A Task-Agnostic Survey
[article]
2022
arXiv
pre-print
Finally we discuss the future directions for the development of neural text generation including neural pipelines and exploiting back-ground knowledge. ...
These advances have been achieved by numerous developments, which we group under the following four headings: data construction, neural frameworks, training and inference strategies, and evaluation metrics ...
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. ...
arXiv:2203.03047v1
fatcat:iupgvcw2hbge5ioy6quiotnra4
Cross Modal Retrieval with Querybank Normalisation
[article]
2022
arXiv
pre-print
Profiting from large-scale training datasets, advances in neural architecture design and efficient inference, joint embeddings have become the dominant approach for tackling cross-modal retrieval. ...
We showcase QB-Norm across a range of cross modal retrieval models and benchmarks where it consistently enhances strong baselines beyond the state of the art. ...
The authors thank Bruno Korbar for his assistance. S.A. would like to acknowledge Z. Novak and S. Carlson in supporting his contribution. ...
arXiv:2112.12777v3
fatcat:iu5tnhg62ncebbtykfxfrq22aq
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods
2021
The Journal of Artificial Intelligence Research
Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video. ...
Much of the growth in these fields has been made possible with deep learning, a sub-area of machine learning that uses artificial neural networks. ...
We extend our special thanks to Matthew Kuhn and Stephanie Lund for painstakingly proofing the whole manuscript. ...
doi:10.1613/jair.1.11688
fatcat:kvfdrg3bwrh35fns4z67adqp6i
Video Description: Datasets & Evaluation Metrics
2021
IEEE Access
Finally, we concluded with the need for essential enhancements and encouraging research directions on the topic. ...
INDEX TERMS Datasets, evaluation metrics, sequence to sequence, video description, video captioning, vision to language, vision to text. 121666 VOLUME 9, 2021 ...
BERT [95] ; language modeling based on the transformer got attention for both performance enhancement due to parallelization (transformer mechanism employment) and pre-training approach. ...
doi:10.1109/access.2021.3108565
fatcat:tlqiaopvrbefpjeo4cvcbqdxoq
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods
[article]
2020
arXiv
pre-print
Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video. ...
The largest of the growths in these fields has been made possible with deep learning, a sub-area of machine learning, which uses the principles of artificial neural networks. ...
We extend our special thanks to Matthew Kuhn and Stephanie Lund for painstakingly proofing the whole manuscript. ...
arXiv:1907.09358v2
fatcat:4fyf6kscy5dfbewll3zs7yzsuq
From Show to Tell: A Survey on Deep Learning-based Image Captioning
[article]
2021
arXiv
pre-print
For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. ...
This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. ...
We also want to thank the authors who provided us with the captions and model weights for some of the surveyed approaches. ...
arXiv:2107.06912v3
fatcat:ezhutcovnvh4reiweedfmxjlve
Deep Learning Based Text Classification: A Comprehensive Review
[article]
2021
arXiv
pre-print
We also provide a summary of more than 40 popular datasets widely used for text classification. ...
In this paper, we provide a comprehensive review of more than 150 deep learning based models for text classification developed in recent years, and discuss their technical contributions, similarities, ...
ACKNOWLEDGMENTS The authors would like to thank Richard Socher, Kristina Toutanova, and Brooke Cowan for reviewing this work, and providing very insightful comments. ...
arXiv:2004.03705v3
fatcat:al5hstylsbhfpldvokuvlpomam
Neural Language Generation: Formulation, Methods, and Evaluation
[article]
2020
arXiv
pre-print
Next we include a comprehensive outline of methods and neural architectures employed for generating diverse texts. ...
Recent advances in neural network-based generative modeling have reignited the hopes in having computer systems capable of seamlessly conversing with humans and able to understand natural language. ...
Image / Video Captioning Image captioning is designed to generate captions in the form of textual descriptions for an image. ...
arXiv:2007.15780v1
fatcat:oixtreazxvbgvclicpxiqzbxrm
Explainable Deep Learning Methods in Medical Diagnosis: A Survey
[article]
2022
arXiv
pre-print
Moreover, this work reviews the existing medical imaging datasets and the existing metrics for evaluating the quality of the explanations . ...
Finally, the major challenges in applying XAI to medical imaging are also discussed. ...
In adversarial training, examples of the training set are augmented with adversarial perturbations at each training loop. ...
arXiv:2205.04766v1
fatcat:sqgaaat6qrag5gtoh7mo7anapy
Deep Image Synthesis from Intuitive User Input: A Review and Perspectives
[article]
2021
arXiv
pre-print
While classic works that allow such automatic image content generation have followed a framework of image retrieval and composition, recent advances in deep generative models such as generative adversarial ...
This paper reviews recent works for image synthesis given intuitive user input, covering advances in input versatility, image generation methodology, benchmark datasets, and evaluation metrics. ...
[152] introduces a gating mechanism where a writing gate writes selected important textual features from the given sentence into a dynamic memory, and a response gate adaptively reads from the memory ...
arXiv:2107.04240v2
fatcat:ticrsi27nzhozmw7dp7wwja2ni
« Previous
Showing results 1 — 15 out of 206 results