Filters








81 Hits in 7.5 sec

BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues [article]

Hung Le, Doyen Sahoo, Nancy F. Chen, Steven C.H. Hoi
2020 arXiv   pre-print
To address this drawback, we propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos based on textual cues.  ...  We also adapt our BiST models to the Video QA setting, and substantially outperform prior approaches on the TGIF-QA benchmark.  ...  In addition, we adapt our models to the video QA benchmark TGIF-QA (Jang et al., 2017) . (See Table 1 for a summary of the two datasets).  ... 
arXiv:2010.10095v1 fatcat:jiipwofx3fcvhh2vomtjrhlxrm

Recent Advances in Video Question Answering: A Review of Datasets and Methods [article]

Devshree Patel, Ratnam Parikh, Yesha Shastri
2021 arXiv   pre-print
VQA helps to retrieve temporal and spatial information from the video scenes and interpret it. In this survey, we review a number of methods and datasets for the task of VQA.  ...  Video Question Answering (VQA) is a recent emerging challenging task in the field of Computer Vision.  ...  With contextual attention, video features can be extracted from both spatio-temporal dimensions.  ... 
arXiv:2101.05954v1 fatcat:afio7akl7zf6rm2yn2a2xp2anq

Video Question Answering: Datasets, Algorithms and Challenges [article]

Yaoyao Zhong, Wei Ji, Junbin Xiao, Yicong Li, Weihong Deng, Tat-Seng Chua
2022 arXiv   pre-print
We then point out the research trend of studying beyond factoid QA to inference QA towards the cognition of video contents, Finally, we conclude some promising directions for future exploration.  ...  Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.  ...  Multi-choice QA is typically set up to study beyond factoid QA to causal and temporal QA, as it dispenses with the need for language generation and evaluation.  ... 
arXiv:2203.01225v1 fatcat:dn4sz5pomnfb7igvmxofangzsa

Video Question-Answering Techniques, Benchmark Datasets and Evaluation Metrics Leveraging Video Captioning: A Comprehensive Survey

Khushboo Khurana, Umesh Deshpande
2021 IEEE Access  
It requires the collaboration of both research communities of computer vision and natural language processing.  ...  INDEX TERMS Video question answering, video captioning, video description generation, natural language processing, deep learning, computer vision, LSTM, CNN, attention model, memory network.  ...  This can be addressed by Vision-language pre-training. It is an emerging and recent research area that can significantly enhance the performance of video captioning and video-QA techniques.  ... 
doi:10.1109/access.2021.3058248 fatcat:bnjmbffxgreb5jkjuxethaqnde

AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning [article]

Madeleine Grunde-McLaughlin, Ranjay Krishna, Maneesh Agrawala
2021 arXiv   pre-print
We present Action Genome Question Answering (AGQA), a new benchmark for compositional spatio-temporal reasoning. AGQA contains 192M unbalanced question answer pairs for 9.6K videos.  ...  When developing computer vision models that can reason about compositional spatio-temporal events, we need benchmarks that can analyze progress and uncover shortcomings.  ...  We also thank Michael Bernstein, Li Fei-Fei, Lyne Tchapmi, Edwin Pan, and Mustafa Omer Gul for their valuable insights.  ... 
arXiv:2103.16002v1 fatcat:vkcqfxgssvb5bjwp7zvqetbpti

Triple Multimodal Cyclic Fusion and Self-Adaptive Balancing for Video Q&A燬ystems

Xiliang Zhang, Jin Liu, Yue Li, Zhongdai Wu, Y. Ken Wang
2022 Computers Materials & Continua  
Performance of Video Question and Answer (VQA) systems relies on capturing key information of both visual images and natural language in the context to generate relevant questions' answers.  ...  Extensive experiments were conducted on MSVD-QA and MSRVTT-QA datasets. The results confirm the advantages of our approach in handling multimodal tasks.  ...  graph representation with context as the basis for predicting answers.  ... 
doi:10.32604/cmc.2022.027097 fatcat:ikg444w7hncsbezfz4ynwg5quy

MERLOT: Multimodal Neural Script Knowledge Models [article]

Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi
2021 arXiv   pre-print
As a result, MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets when finetuned.  ...  By pretraining with a mix of both frame-level (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what  ...  Ernie-vil: Knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934, 2020. [108] Shoou-I Yu, Lu Jiang, and Alexander Hauptmann.  ... 
arXiv:2106.02636v3 fatcat:mrj2t3yuanbdzhsujshtky4enq

Multimodal Research in Vision and Language: A Review of Current and Emerging Trends [article]

Shagun Uppal, Sarthak Bhagat, Devamanyu Hazarika, Navonil Majumdar, Soujanya Poria, Roger Zimmermann, Amir Zadeh
2020 arXiv   pre-print
More recently, this has enhanced research interests in the intersection of the Vision and Language arena with its numerous applications and fast-paced growth.  ...  In this paper, we present a detailed overview of the latest trends in research pertaining to visual and language modalities.  ...  Another extension to standard GANs modifies discriminator networks to verify generated video sequences against correct captions instead of real/fake, with spatio-temporal convolutions for synthesizing  ... 
arXiv:2010.09522v2 fatcat:l4npstkoqndhzn6hznr7eeys4u

The Multi-Modal Video Reasoning and Analyzing Competition [article]

Haoran Peng, He Huang, Li Xu, Tianjiao Li, Jun Liu, Hossein Rahmani, Qiuhong Ke, Zhicheng Guo, Cong Wu, Rongchang Li, Mang Ye, Jiahao Wang (+6 others)
2021 arXiv   pre-print
This competition is composed of four different tracks, namely, video question answering, skeleton-based action recognition, fisheye video-based action recognition, and person re-identification, which are  ...  In this paper, we introduce the Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC) workshop in conjunction with ICCV 2021.  ...  However, videos enable us to reason about more comprehensive spatio-temporal relationships.  ... 
arXiv:2108.08344v1 fatcat:u2grncl6ofgjnc7h5dcf72i4yy

Deep Learning for Omnidirectional Vision: A Survey and New Perspectives [article]

Hao Ai, Zidong Cao, Jinjing Zhu, Haotian Bai, Yucheng Chen, Lin Wang
2022 arXiv   pre-print
This paper presents a systematic and comprehensive review and analysis of the recent progress in DL methods for omnidirectional vision.  ...  imaging, the convolution methods on the ODI, and datasets to highlight the differences and difficulties compared with the 2D planar image data; (ii) A structural and hierarchical taxonomy of the DL methods for  ...  They built a deep ranking model for spatial summarization to select NFOV shots from each frame in the ODV and generated a spatio-temporal highlight video by extending the same model to the temporal domain  ... 
arXiv:2205.10468v2 fatcat:73fks33oafa6zgxliccydvdbeq

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Aditya Mogadala, Marimuthu Kalimuthu, Dietrich Klakow
2021 The Journal of Artificial Intelligence Research  
This has created significant interest in the integration of vision and language.  ...  This success can be partly attributed to the advancements made in the sub-fields of AI such as machine learning, computer vision, and natural language processing.  ...  We extend our special thanks to Matthew Kuhn and Stephanie Lund for painstakingly proofing the whole manuscript.  ... 
doi:10.1613/jair.1.11688 fatcat:kvfdrg3bwrh35fns4z67adqp6i

A Review on Methods and Applications in Multimodal Deep Learning [article]

Jabeen Summaira, Xi Li, Amin Muhammad Shoib, Jabbar Abdul
2022 arXiv   pre-print
This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals.  ...  Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning.  ...  M * Adaptive moving window extracts better event representations * Additionally, temporal attention enhance description process by focusing on temporal frame features while words captioning mainly two  ... 
arXiv:2202.09195v1 fatcat:wwxrmrwmerfabbenleylwmmj7y

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods [article]

Aditya Mogadala and Marimuthu Kalimuthu and Dietrich Klakow
2020 arXiv   pre-print
This has created significant interest in the integration of vision and language. The tasks are designed such that they perfectly embrace the ideas of deep learning.  ...  This success can be partly attributed to the advancements made in the sub-fields of AI such as Machine Learning (ML), Computer Vision (CV), and Natural Language Processing (NLP).  ...  We extend our special thanks to Matthew Kuhn and Stephanie Lund for painstakingly proofing the whole manuscript.  ... 
arXiv:1907.09358v2 fatcat:4fyf6kscy5dfbewll3zs7yzsuq

2021 Index IEEE Transactions on Image Processing Vol. 30

2021 IEEE Transactions on Image Processing  
The Author Index contains the primary entry for each item, listed under the first author's name.  ...  ., +, TIP 2021 4409-4422 A Global-Local Self-Adaptive Network for Drone-View Object Detection. Adaptive Spatio-Temporal Graph Enhanced Vision-Language Representation for Video QA.  ...  ., +, TIP 2021 2549-2561 AGRNet: Adaptive Graph Representation Learning and Reasoning for Face Parsing.  ... 
doi:10.1109/tip.2022.3142569 fatcat:z26yhwuecbgrnb2czhwjlf73qu

CLEVRER: CoLlision Events for Video REpresentation and Reasoning [article]

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, Joshua B. Tenenbaum
2020 arXiv   pre-print
To this end, we introduce the CoLlision Events for Video REpresentation and Reasoning (CLEVRER), a diagnostic video dataset for systematic evaluation of computational models on a wide range of reasoning  ...  The ability to reason about temporal and causal events from videos lies at the core of human intelligence.  ...  Second, symbolic representation provides a powerful common ground for vision, language, dynamics and causality.  ... 
arXiv:1910.01442v2 fatcat:5t3rceq24rffjkdi4nmongam3q
« Previous Showing results 1 — 15 out of 81 results