Filters








1,303 Hits in 8.9 sec

Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers [article]

Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian
2021 arXiv   pre-print
for every frame, and an inter-frame aggregation module capturing temporal cues.  ...  To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations  ...  Spatio-temporal graphs using knowledge distillation is explored for video captioning in (Pan et al. 2020 ).  ... 
arXiv:2007.03848v2 fatcat:bqvz6lk3szfv7frgcq4fvfz2ji

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review [article]

Hao Wang, Bin Guo, Yating Zeng, Yasan Ding, Chen Qiu, Ying Zhang, Lina Yao, Zhiwen Yu
2022 arXiv   pre-print
cross-modal semantic interaction.  ...  We conclude this paper by putting forward some open issues and promising research trends for VAD, e.g., the cognitive mechanisms of human-machine dialogue under cross-modal dialogue context, and knowledge-enhanced  ...  ACKNOWLEDGMENTS This work was partially supported by the National Science Fund for Distinguished Young Scholars (62025205), and the National Natural Science Foundation of China (No. 62032020, 61960206008  ... 
arXiv:2207.00782v1 fatcat:a57laj75xfa43gg4hjvxdh4c4i

Video Question Answering: Datasets, Algorithms and Challenges [article]

Yaoyao Zhong, Wei Ji, Junbin Xiao, Yicong Li, Weihong Deng, Tat-Seng Chua
2022 arXiv   pre-print
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.  ...  We then point out the research trend of studying beyond factoid QA to inference QA towards the cognition of video contents, Finally, we conclude some promising directions for future exploration.  ...  ., 2017] try to directly apply element-wise multiplication to fuse video and question information for answer prediction, and additionally demonstrate the advantage of a simple temporal attention.  ... 
arXiv:2203.01225v1 fatcat:dn4sz5pomnfb7igvmxofangzsa

Reasoning with Heterogeneous Graph Alignment for Video Question Answering

Pin Jiang, Yahong Han
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
The dominant video question answering methods are based on fine-grained representation or model-specific attention mechanism.  ...  We propose a deep heterogeneous graph alignment network over the video shots and question words.  ...  and Video Question Answering (VideoQA), where VideoQA extends VQA to video domain and raises higher demands on spatio-temporal understanding and reasoning.  ... 
doi:10.1609/aaai.v34i07.6767 fatcat:kbte5ijo4fh53ngz7bby2uefa4

Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering [article]

Long Hoang Dang, Thao Minh Le, Vuong Le, Truyen Tran
2021 arXiv   pre-print
Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities.  ...  for the correct answer.  ...  This is challenging for the complexity of the video spatio-temporal structure and the cross-domain compatibility gap between linguistic query and visual objects.  ... 
arXiv:2106.13432v2 fatcat:b4upxdws3ra6dftan7cy7igacm

(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

Anoop Cherian, Chiori Hori, Tim K. Marks, Jonathan Le Roux
2022 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame.  ...  Next, for the video QA task, we present a novel transformer-based reasoning pipeline that embeds the (2.5+1)D graph into a spatio-temporal hierarchical latent space, where the sub-graphs and their interactions  ...  Spatio-temporal scene graphs are combined with a knowledge distillation objective for video captioning in (Pan et al. 2020) .  ... 
doi:10.1609/aaai.v36i1.19922 fatcat:vvdnagxrc5dv3ghoiibdclypsm

(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering [article]

Anoop Cherian and Chiori Hori and Tim K. Marks and Jonathan Le Roux
2022 arXiv   pre-print
Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame.  ...  Next, for the video QA task, we present a novel transformer-based reasoning pipeline that embeds the (2.5+1)D graph into a spatio-temporal hierarchical latent space, where the sub-graphs and their interactions  ...  Spatio-temporal scene graphs are combined with a knowledge distillation objective for video captioning in (Pan et al. 2020) .  ... 
arXiv:2202.09277v2 fatcat:avwo43pepvfbneqauppa3dchmi

Object-Centric Representation Learning for Video Question Answering [article]

Long Hoang Dang, Thao Minh Le, Vuong Le, Truyen Tran
2021 arXiv   pre-print
Video question answering (Video QA) presents a powerful testbed for human-like intelligent behaviors.  ...  To tackle this issue, we advocate for object-centric representation as a basis for constructing spatio-temporal structures from videos, essentially bridging the semantic gap between low-level pattern recognition  ...  In recent years video question answering (Video QA) has become a major playground for spatio-temporal representation and visual-linguistic integration.  ... 
arXiv:2104.05166v3 fatcat:3mwbzgp7l5hsve5h45lgahaedm

Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering [article]

Yang Liu, Guanbin Li, Liang Lin
2022 arXiv   pre-print
Specifically, we propose a novel event-level visual question answering framework named Cross-Modal Causal RelatIonal Reasoning (CMCIR), to achieve robust casuality-aware visual-linguistic question answering  ...  To discover the fine-grained interactions between linguistic semantics and spatial-temporal representations, we build a novel Spatial-Temporal Transformer (STT) that builds the multi-modal co-occurrence  ...  CASSG [104] : A cross-attentional spatio-temporal semantic graph network that explores fine-grained interactions between different modalities.  ... 
arXiv:2207.12647v2 fatcat:rkwil7hyx5dytfcsiwunapg5qq

Tracklet Pair Proposal and Context Reasoning for Video Scene Graph Generation

Gayoung Jung, Jonghun Lee, Incheol Kim
2021 Sensors  
To effectively utilize the spatio-temporal context, low-level visual context reasoning is performed using a spatio-temporal context graph and a graph neural network as well as high-level semantic context  ...  This study proposes a novel deep neural network model called VSGG-Net for video scene graph generation.  ...  In various fields, such as visual question answering, semantic image retrieval, and image generation, scene graphs have proved to be a useful tool for deeper and better visual scene understanding [1]  ... 
doi:10.3390/s21093164 pmid:34063299 pmcid:PMC8124611 fatcat:vtl2wizi5jfjdbqgxngtelwhwy

stagNet: An Attentive Semantic RNN for Group Activity and Individual Action Recognition

Mengshi Qi, Yunhong Wang, Jie Qin, Annan Li, Jiebo Luo, Luc Van Gool
2019 IEEE transactions on circuits and systems for video technology (Print)  
In the paper, we present a novel attentive semantic recurrent neural network (RNN), namely stagNet, for understanding group activities and individual actions in videos, by combining the spatio-temporal  ...  attention mechanism and semantic graph modeling.  ...  Besides, the structural semantic output is beneficial for lots of other tasks like dense video captioning [84] , sports video captioning [78] and visual question answering [85] as it provides mid-level  ... 
doi:10.1109/tcsvt.2019.2894161 fatcat:wcjvyo3wgfbsfcew4x62sw6cfi

HERO: HiErarchical spatio-tempoRal reasOning with Contrastive Action Correspondence for End-to-End Video Object Grounding [article]

Mengze Li and Tianbao Wang and Haoyu Zhang and Shengyu Zhang and Zhou Zhao and Wenqiao Zhang and Jiaxu Miao and Shiliang Pu and Fei Wu
2022 arXiv   pre-print
Such a design can build the fine-grained cross-modal correspondence for more accurate subsequent VOG. (2) Hierarchical Spatio-temporal Modeling Improvement.  ...  This is a challenging vision-language task that necessitates constructing the correct cross-modal correspondence and modeling the appropriate spatio-temporal context of the query video and caption, thereby  ...  in the first stage, and then, selects the target one from the candidates based on a spatio-temporal graph network in the second stage.  ... 
arXiv:2208.05818v1 fatcat:cq7mh2dl5bdbfhwsj74buty2k4

Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering [article]

Jungin Park, Jiyoung Lee, Kwanghoon Sohn
2021 arXiv   pre-print
This paper presents a novel method, termed Bridge to Answer, to infer correct answers for questions about a given video by leveraging adequate graph interactions of heterogeneous crossmodal graphs.  ...  As a result, our method can learn the question conditioned visual representations attributed to appearance and motion that show powerful capability for video question answering.  ...  We employ the cross-entropy loss for openended questions.  ... 
arXiv:2104.14085v1 fatcat:qguxecsdajbnpo5gds2i2eraii

Visual Relationship Forecasting in Videos [article]

Li Mi, Yangjun Ou, Zhenzhong Chen
2021 arXiv   pre-print
In addition, we present a novel Graph Convolutional Transformer (GCT) framework, which captures both object-level and frame-level dependencies by spatio-temporal Graph Convolution Network and Transformer  ...  To evaluate the VRF task, we introduce two video datasets named VRF-AG and VRF-VidOR, with a series of spatio-temporally localized visual relation annotations in a video.  ...  and has proposed for long-term temporal modeling. • GRU [45] is the simple variant of LSTM. • ST-GCN Spatio-Temporal Graph Convolutional Network (ST-GCN) [46] is a variant of Graph Convolutional Network  ... 
arXiv:2107.01181v1 fatcat:ep2hjklh5zdxpaahmmakn5m3fy

Video Question-Answering Techniques, Benchmark Datasets and Evaluation Metrics Leveraging Video Captioning: A Comprehensive Survey

Khushboo Khurana, Umesh Deshpande
2021 IEEE Access  
INDEX TERMS Video question answering, video captioning, video description generation, natural language processing, deep learning, computer vision, LSTM, CNN, attention model, memory network.  ...  The captions generated by video captioning can be further utilized for video retrieval, summarization, question-answering, etc.  ...  The Sequential Video Attention model accumulates the video attention for each question, the Temporal Question Attention model accumulates the question-attentions for each video frame and a unified attention  ... 
doi:10.1109/access.2021.3058248 fatcat:bnjmbffxgreb5jkjuxethaqnde
« Previous Showing results 1 — 15 out of 1,303 results