A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Action Genome: Actions as Composition of Spatio-temporal Scene Graphs
[article]
2019
arXiv
pre-print
Finally, we benchmark existing scene graph models on the new task of spatio-temporal scene graph prediction. ...
Inspired by evidence that the prototypical unit of an event is an action-object interaction, we introduce Action Genome, a representation that decomposes actions into spatio-temporal scene graphs. ...
This article solely reflects the opinions and conclusions of its authors and not Panasonic or any entity associated with Panasonic. ...
arXiv:1912.06992v1
fatcat:6iap73ap2zbi7bxdkrtvkn66wi
Revisiting spatio-temporal layouts for compositional action recognition
[article]
2021
arXiv
pre-print
On the Something-Else and Action Genome datasets, we demonstrate (i) how to extend multi-head attention for spatio-temporal layout-based action recognition, (ii) how to improve the performance of appearance-based ...
The main focus of this paper is compositional/few-shot action recognition, where we advocate the usage of multi-head attention (proven to be effective for spatial reasoning) over spatio-temporal layouts ...
a subset of scene graph generation. ...
arXiv:2111.01936v1
fatcat:q3l3m7nj7jadflecmeulmwefcy
Compositional Video Synthesis with Action Graphs
[article]
2021
arXiv
pre-print
Videos of actions are complex signals containing rich compositional structure in space and time. ...
To address this challenge, we propose to represent the actions in a graph structure called Action Graph and present the new "Action Graph To Video" synthesis task. ...
This work was completed in partial fulfillment for the Ph.D degree of Amir Bar. ...
arXiv:2006.15327v4
fatcat:zcwuyip2djbozlv2dvwpiap5di
Generating Videos of Zero-Shot Compositions of Actions and Objects
[article]
2020
arXiv
pre-print
In particular, we introduce the task of generating human-object interaction videos in a zero-shot compositional setting, i.e., generating videos for action-object compositions that are unseen during training ...
In this paper we develop methods for generating such videos -- making progress toward addressing the important, open problem of video generation in complex scenes. ...
These crops will correspond to the nodes of spatio-temporal graph. ...
arXiv:1912.02401v4
fatcat:56xoucduffdq5clzn3jco3z6ga
SAFCAR: Structured Attention Fusion for Compositional Action Recognition
[article]
2020
arXiv
pre-print
We present a general framework for compositional action recognition -- i.e. action recognition where the labels are composed out of simpler components such as subjects, atomic-actions and objects. ...
The main challenge in compositional action recognition is that there is a combinatorially large set of possible actions that can be composed using basic components. ...
The annotated spatio-temporal scene graphs are provided to the SGFB model as an additional supervision during training. ...
arXiv:2012.02109v2
fatcat:wvqdumgwqbeivi7w3g24q7fbue
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
[article]
2021
arXiv
pre-print
We present Action Genome Question Answering (AGQA), a new benchmark for compositional spatio-temporal reasoning. AGQA contains 192M unbalanced question answer pairs for 9.6K videos. ...
Visual events are a composition of temporal actions involving actors spatially interacting with objects. ...
Action genome: Actions as compositions of spatiotemporal scene graphs. ...
arXiv:2103.16002v1
fatcat:vkcqfxgssvb5bjwp7zvqetbpti
Target Adaptive Context Aggregation for Video Scene Graph Generation
[article]
2021
arXiv
pre-print
This paper deals with a challenging task of video scene graph generation (VidSGG), which could serve as a structured video representation for high-level understanding tasks. ...
We perform experiments on two VidSGG benchmarks: ImageNet-VidVRD and Action Genome, and the results demonstrate that our TRACE achieves the state-of-the-art performance. ...
Introduction Video understanding tasks, such as action recognition [29, 1, 38, 39] , temporal action localization [47, 20, 31] , spatio-temporal action detection [18, 5] , have received lots of research ...
arXiv:2108.08121v1
fatcat:uco3x7widjdvtjogmlxizfc5fi
DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video
[article]
2020
arXiv
pre-print
Moreover, a temporal sub-graph captures the activities within the video through time. ...
These relationships are obtained by a spatial sub-graph that contextualizes the scene representation using detected objects and human features conditioned in the language query. ...
As activities are usually the result of the composition of several actions or interactions between a subject and objects [24] , our algorithm incorporates both spatial and temporal dependencies. ...
arXiv:2010.06260v1
fatcat:yqkwzl5o7rbvjfan5e34hbhhfq
DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue
[article]
2021
arXiv
pre-print
The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning over the spatio-temporal space of video. ...
A video-grounded dialogue system is required to understand both dialogue, which contains semantic dependencies from turn to turn, and video, which contains visual cues of spatial and temporal scene variations ...
As illustrated in Figure 1 , at each dialogue turn, a DVD question tests dialogue systems to perform different types of reasoning on videos, such as action recognition and spatio-temporal reasoning. ...
arXiv:2101.00151v2
fatcat:j4pv54mx3bhd7eyfs5eyzyoyju
Action of multiple intra-QTL genes concerted around a co-localized transcription factor underpins a large effect QTL
2015
Scientific Reports
Although precision genome engineering is continually evolving, inhibitory costs and intractable philosophies weigh down transgenic product development. Conventional breeding is temporally demanding. ...
This novel report on extensive molecular characterization of a QTL contributed by a susceptible variety that improves stress tolerance, as well as the identification of cis-interacting genes belonging ...
Such variability among transgenic events is common and mostly related to position effects, but also may occur because of differences in spatio-temporal metabolic fluxes whereby the effect of a single gene ...
doi:10.1038/srep15183
pmid:26507552
pmcid:PMC4623671
fatcat:r42lqxktwvgfhdyyjt3vdsnvda
Video Question Answering: Datasets, Algorithms and Challenges
[article]
2022
arXiv
pre-print
We then point out the research trend of studying beyond factoid QA to inference QA towards the cognition of video contents, Finally, we conclude some promising directions for future exploration. ...
, actions and activities as well as reasoning of their spatial, temporal and causal relationships Xiao et al., 2021; . ...
of the video with a dynamics predictor, and runs the program on the dynamic scene to obtain an answer. ...
arXiv:2203.01225v1
fatcat:dn4sz5pomnfb7igvmxofangzsa
Learning Canonical Representations for Scene Graph to Image Generation
[article]
2020
arXiv
pre-print
Previous approaches showed that scenes with few entities can be controlled using scene graphs, but this approach struggles as the complexity of the graph (the number of objects and edges) increases. ...
Finally, we show improved performance of the model on three different benchmarks: Visual Genome, COCO, and CLEVR. ...
This work was completed in partial fulfillment for the Ph.D degree of the first author.
References ...
arXiv:1912.07414v5
fatcat:q2dmubbvqvbf3pcvrdoyhftcnm
Video Question-Answering Techniques, Benchmark Datasets and Evaluation Metrics Leveraging Video Captioning: A Comprehensive Survey
2021
IEEE Access
Additionally, the graph-based methods, although less explored, give very promising results. ...
This involves understanding the semantics of a video and then generating human-like descriptions of the video. ...
The similarity between candidate and reference scene graphs is computed by considering the semantic relations in the scene graph as a conjunction of logical propositions. ...
doi:10.1109/access.2021.3058248
fatcat:bnjmbffxgreb5jkjuxethaqnde
Video Analysis for Understanding Human Actions and Interactions
[article]
2021
Equipped with a proposal-free architecture, we tackle the temporal moment localization introducing a spatial-temporal graph. We found that one of the limitations of the exist [...] ...
We begin by considering the challenging problem of human action anticipation. In this task, we seek to predict a person's action as early as possible before it is completed. ...
of our spatio-temporal graph approach with existing methods for different tIoU α levels. ...
doi:10.25911/g7kb-br27
fatcat:qul7pgxp4rfurept4etvgntpie
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods
2021
The Journal of Artificial Intelligence Research
This success can be partly attributed to the advancements made in the sub-fields of AI such as machine learning, computer vision, and natural language processing. ...
Much of the growth in these fields has been made possible with deep learning, a sub-area of machine learning that uses artificial neural networks. ...
Acknowledgments This work was supported by the German Research Foundation (DFG) as a part of -Project-ID 232722074 -SFB1102. ...
doi:10.1613/jair.1.11688
fatcat:kvfdrg3bwrh35fns4z67adqp6i
« Previous
Showing results 1 — 15 out of 202 results