A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Filters
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers
[article]
2021
arXiv
pre-print
for every frame, and an inter-frame aggregation module capturing temporal cues. ...
To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations ...
Spatio-temporal graphs using knowledge distillation is explored for video captioning in (Pan et al. 2020 ). ...
arXiv:2007.03848v2
fatcat:bqvz6lk3szfv7frgcq4fvfz2ji
Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review
[article]
2022
arXiv
pre-print
cross-modal semantic interaction. ...
We conclude this paper by putting forward some open issues and promising research trends for VAD, e.g., the cognitive mechanisms of human-machine dialogue under cross-modal dialogue context, and knowledge-enhanced ...
ACKNOWLEDGMENTS This work was partially supported by the National Science Fund for Distinguished Young Scholars (62025205), and the National Natural Science Foundation of China (No. 62032020, 61960206008 ...
arXiv:2207.00782v1
fatcat:a57laj75xfa43gg4hjvxdh4c4i
Video Question Answering: Datasets, Algorithms and Challenges
[article]
2022
arXiv
pre-print
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. ...
We then point out the research trend of studying beyond factoid QA to inference QA towards the cognition of video contents, Finally, we conclude some promising directions for future exploration. ...
., 2017] try to directly apply element-wise multiplication to fuse video and question information for answer prediction, and additionally demonstrate the advantage of a simple temporal attention. ...
arXiv:2203.01225v1
fatcat:dn4sz5pomnfb7igvmxofangzsa
Reasoning with Heterogeneous Graph Alignment for Video Question Answering
2020
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
The dominant video question answering methods are based on fine-grained representation or model-specific attention mechanism. ...
We propose a deep heterogeneous graph alignment network over the video shots and question words. ...
and Video Question Answering (VideoQA), where VideoQA extends VQA to video domain and raises higher demands on spatio-temporal understanding and reasoning. ...
doi:10.1609/aaai.v34i07.6767
fatcat:kbte5ijo4fh53ngz7bby2uefa4
Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering
[article]
2021
arXiv
pre-print
Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities. ...
for the correct answer. ...
This is challenging for the complexity of the video spatio-temporal structure and the cross-domain compatibility gap between linguistic query and visual objects. ...
arXiv:2106.13432v2
fatcat:b4upxdws3ra6dftan7cy7igacm
(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering
2022
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame. ...
Next, for the video QA task, we present a novel transformer-based reasoning pipeline that embeds the (2.5+1)D graph into a spatio-temporal hierarchical latent space, where the sub-graphs and their interactions ...
Spatio-temporal scene graphs are combined with a knowledge distillation objective for video captioning in (Pan et al. 2020) . ...
doi:10.1609/aaai.v36i1.19922
fatcat:vvdnagxrc5dv3ghoiibdclypsm
(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering
[article]
2022
arXiv
pre-print
Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame. ...
Next, for the video QA task, we present a novel transformer-based reasoning pipeline that embeds the (2.5+1)D graph into a spatio-temporal hierarchical latent space, where the sub-graphs and their interactions ...
Spatio-temporal scene graphs are combined with a knowledge distillation objective for video captioning in (Pan et al. 2020) . ...
arXiv:2202.09277v2
fatcat:avwo43pepvfbneqauppa3dchmi
Object-Centric Representation Learning for Video Question Answering
[article]
2021
arXiv
pre-print
Video question answering (Video QA) presents a powerful testbed for human-like intelligent behaviors. ...
To tackle this issue, we advocate for object-centric representation as a basis for constructing spatio-temporal structures from videos, essentially bridging the semantic gap between low-level pattern recognition ...
In recent years video question answering (Video QA) has become a major playground for spatio-temporal representation and visual-linguistic integration. ...
arXiv:2104.05166v3
fatcat:3mwbzgp7l5hsve5h45lgahaedm
Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering
[article]
2022
arXiv
pre-print
Specifically, we propose a novel event-level visual question answering framework named Cross-Modal Causal RelatIonal Reasoning (CMCIR), to achieve robust casuality-aware visual-linguistic question answering ...
To discover the fine-grained interactions between linguistic semantics and spatial-temporal representations, we build a novel Spatial-Temporal Transformer (STT) that builds the multi-modal co-occurrence ...
CASSG [104] : A cross-attentional spatio-temporal semantic graph network that explores fine-grained interactions between different modalities. ...
arXiv:2207.12647v2
fatcat:rkwil7hyx5dytfcsiwunapg5qq
Tracklet Pair Proposal and Context Reasoning for Video Scene Graph Generation
2021
Sensors
To effectively utilize the spatio-temporal context, low-level visual context reasoning is performed using a spatio-temporal context graph and a graph neural network as well as high-level semantic context ...
This study proposes a novel deep neural network model called VSGG-Net for video scene graph generation. ...
In various fields, such as visual question answering, semantic image retrieval, and image generation, scene graphs have proved to be a useful tool for deeper and better visual scene understanding [1] ...
doi:10.3390/s21093164
pmid:34063299
pmcid:PMC8124611
fatcat:vtl2wizi5jfjdbqgxngtelwhwy
stagNet: An Attentive Semantic RNN for Group Activity and Individual Action Recognition
2019
IEEE transactions on circuits and systems for video technology (Print)
In the paper, we present a novel attentive semantic recurrent neural network (RNN), namely stagNet, for understanding group activities and individual actions in videos, by combining the spatio-temporal ...
attention mechanism and semantic graph modeling. ...
Besides, the structural semantic output is beneficial for lots of other tasks like dense video captioning [84] , sports video captioning [78] and visual question answering [85] as it provides mid-level ...
doi:10.1109/tcsvt.2019.2894161
fatcat:wcjvyo3wgfbsfcew4x62sw6cfi
HERO: HiErarchical spatio-tempoRal reasOning with Contrastive Action Correspondence for End-to-End Video Object Grounding
[article]
2022
arXiv
pre-print
Such a design can build the fine-grained cross-modal correspondence for more accurate subsequent VOG. (2) Hierarchical Spatio-temporal Modeling Improvement. ...
This is a challenging vision-language task that necessitates constructing the correct cross-modal correspondence and modeling the appropriate spatio-temporal context of the query video and caption, thereby ...
in the first stage, and then, selects the target one from the candidates based on a spatio-temporal graph network in the second stage. ...
arXiv:2208.05818v1
fatcat:cq7mh2dl5bdbfhwsj74buty2k4
Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering
[article]
2021
arXiv
pre-print
This paper presents a novel method, termed Bridge to Answer, to infer correct answers for questions about a given video by leveraging adequate graph interactions of heterogeneous crossmodal graphs. ...
As a result, our method can learn the question conditioned visual representations attributed to appearance and motion that show powerful capability for video question answering. ...
We employ the cross-entropy loss for openended questions. ...
arXiv:2104.14085v1
fatcat:qguxecsdajbnpo5gds2i2eraii
Visual Relationship Forecasting in Videos
[article]
2021
arXiv
pre-print
In addition, we present a novel Graph Convolutional Transformer (GCT) framework, which captures both object-level and frame-level dependencies by spatio-temporal Graph Convolution Network and Transformer ...
To evaluate the VRF task, we introduce two video datasets named VRF-AG and VRF-VidOR, with a series of spatio-temporally localized visual relation annotations in a video. ...
and has proposed for long-term temporal modeling. • GRU [45] is the simple variant of LSTM. • ST-GCN Spatio-Temporal Graph Convolutional Network (ST-GCN) [46] is a variant of Graph Convolutional Network ...
arXiv:2107.01181v1
fatcat:ep2hjklh5zdxpaahmmakn5m3fy
Video Question-Answering Techniques, Benchmark Datasets and Evaluation Metrics Leveraging Video Captioning: A Comprehensive Survey
2021
IEEE Access
INDEX TERMS Video question answering, video captioning, video description generation, natural language processing, deep learning, computer vision, LSTM, CNN, attention model, memory network. ...
The captions generated by video captioning can be further utilized for video retrieval, summarization, question-answering, etc. ...
The Sequential Video Attention model accumulates the video attention for each question, the Temporal Question Attention model accumulates the question-attentions for each video frame and a unified attention ...
doi:10.1109/access.2021.3058248
fatcat:bnjmbffxgreb5jkjuxethaqnde
« Previous
Showing results 1 — 15 out of 1,303 results