202 Hits in 4.7 sec

Action Genome: Actions as Composition of Spatio-temporal Scene Graphs [article]

Jingwei Ji, Ranjay Krishna, Li Fei-Fei, Juan Carlos Niebles
2019 arXiv   pre-print
Finally, we benchmark existing scene graph models on the new task of spatio-temporal scene graph prediction.  ...  Inspired by evidence that the prototypical unit of an event is an action-object interaction, we introduce Action Genome, a representation that decomposes actions into spatio-temporal scene graphs.  ...  This article solely reflects the opinions and conclusions of its authors and not Panasonic or any entity associated with Panasonic.  ... 
arXiv:1912.06992v1 fatcat:6iap73ap2zbi7bxdkrtvkn66wi

Revisiting spatio-temporal layouts for compositional action recognition [article]

Gorjan Radevski, Marie-Francine Moens, Tinne Tuytelaars
2021 arXiv   pre-print
On the Something-Else and Action Genome datasets, we demonstrate (i) how to extend multi-head attention for spatio-temporal layout-based action recognition, (ii) how to improve the performance of appearance-based  ...  The main focus of this paper is compositional/few-shot action recognition, where we advocate the usage of multi-head attention (proven to be effective for spatial reasoning) over spatio-temporal layouts  ...  a subset of scene graph generation.  ... 
arXiv:2111.01936v1 fatcat:q3l3m7nj7jadflecmeulmwefcy

Compositional Video Synthesis with Action Graphs [article]

Amir Bar, Roei Herzig, Xiaolong Wang, Anna Rohrbach, Gal Chechik, Trevor Darrell, Amir Globerson
2021 arXiv   pre-print
Videos of actions are complex signals containing rich compositional structure in space and time.  ...  To address this challenge, we propose to represent the actions in a graph structure called Action Graph and present the new "Action Graph To Video" synthesis task.  ...  This work was completed in partial fulfillment for the Ph.D degree of Amir Bar.  ... 
arXiv:2006.15327v4 fatcat:zcwuyip2djbozlv2dvwpiap5di

Generating Videos of Zero-Shot Compositions of Actions and Objects [article]

Megha Nawhal, Mengyao Zhai, Andreas Lehrmann, Leonid Sigal, Greg Mori
2020 arXiv   pre-print
In particular, we introduce the task of generating human-object interaction videos in a zero-shot compositional setting, i.e., generating videos for action-object compositions that are unseen during training  ...  In this paper we develop methods for generating such videos -- making progress toward addressing the important, open problem of video generation in complex scenes.  ...  These crops will correspond to the nodes of spatio-temporal graph.  ... 
arXiv:1912.02401v4 fatcat:56xoucduffdq5clzn3jco3z6ga

SAFCAR: Structured Attention Fusion for Compositional Action Recognition [article]

Tae Soo Kim, Gregory D. Hager
2020 arXiv   pre-print
We present a general framework for compositional action recognition -- i.e. action recognition where the labels are composed out of simpler components such as subjects, atomic-actions and objects.  ...  The main challenge in compositional action recognition is that there is a combinatorially large set of possible actions that can be composed using basic components.  ...  The annotated spatio-temporal scene graphs are provided to the SGFB model as an additional supervision during training.  ... 
arXiv:2012.02109v2 fatcat:wvqdumgwqbeivi7w3g24q7fbue

AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning [article]

Madeleine Grunde-McLaughlin, Ranjay Krishna, Maneesh Agrawala
2021 arXiv   pre-print
We present Action Genome Question Answering (AGQA), a new benchmark for compositional spatio-temporal reasoning. AGQA contains 192M unbalanced question answer pairs for 9.6K videos.  ...  Visual events are a composition of temporal actions involving actors spatially interacting with objects.  ...  Action genome: Actions as compositions of spatiotemporal scene graphs.  ... 
arXiv:2103.16002v1 fatcat:vkcqfxgssvb5bjwp7zvqetbpti

Target Adaptive Context Aggregation for Video Scene Graph Generation [article]

Yao Teng, Limin Wang, Zhifeng Li, Gangshan Wu
2021 arXiv   pre-print
This paper deals with a challenging task of video scene graph generation (VidSGG), which could serve as a structured video representation for high-level understanding tasks.  ...  We perform experiments on two VidSGG benchmarks: ImageNet-VidVRD and Action Genome, and the results demonstrate that our TRACE achieves the state-of-the-art performance.  ...  Introduction Video understanding tasks, such as action recognition [29, 1, 38, 39] , temporal action localization [47, 20, 31] , spatio-temporal action detection [18, 5] , have received lots of research  ... 
arXiv:2108.08121v1 fatcat:uco3x7widjdvtjogmlxizfc5fi

DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video [article]

Cristian Rodriguez-Opazo and Edison Marrese-Taylor and Basura Fernando and Hongdong Li and Stephen Gould
2020 arXiv   pre-print
Moreover, a temporal sub-graph captures the activities within the video through time.  ...  These relationships are obtained by a spatial sub-graph that contextualizes the scene representation using detected objects and human features conditioned in the language query.  ...  As activities are usually the result of the composition of several actions or interactions between a subject and objects [24] , our algorithm incorporates both spatial and temporal dependencies.  ... 
arXiv:2010.06260v1 fatcat:yqkwzl5o7rbvjfan5e34hbhhfq

DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue [article]

Hung Le and Chinnadhurai Sankar and Seungwhan Moon and Ahmad Beirami and Alborz Geramifard and Satwik Kottur
2021 arXiv   pre-print
The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning over the spatio-temporal space of video.  ...  A video-grounded dialogue system is required to understand both dialogue, which contains semantic dependencies from turn to turn, and video, which contains visual cues of spatial and temporal scene variations  ...  As illustrated in Figure 1 , at each dialogue turn, a DVD question tests dialogue systems to perform different types of reasoning on videos, such as action recognition and spatio-temporal reasoning.  ... 
arXiv:2101.00151v2 fatcat:j4pv54mx3bhd7eyfs5eyzyoyju

Action of multiple intra-QTL genes concerted around a co-localized transcription factor underpins a large effect QTL

Shalabh Dixit, Akshaya Kumar Biswal, Aye Min, Amelia Henry, Rowena H. Oane, Manish L. Raorane, Toshisangba Longkumer, Isaiah M. Pabuayon, Sumanth K. Mutte, Adithi R. Vardarajan, Berta Miro, Ganesan Govindan (+10 others)
2015 Scientific Reports  
Although precision genome engineering is continually evolving, inhibitory costs and intractable philosophies weigh down transgenic product development. Conventional breeding is temporally demanding.  ...  This novel report on extensive molecular characterization of a QTL contributed by a susceptible variety that improves stress tolerance, as well as the identification of cis-interacting genes belonging  ...  Such variability among transgenic events is common and mostly related to position effects, but also may occur because of differences in spatio-temporal metabolic fluxes whereby the effect of a single gene  ... 
doi:10.1038/srep15183 pmid:26507552 pmcid:PMC4623671 fatcat:r42lqxktwvgfhdyyjt3vdsnvda

Video Question Answering: Datasets, Algorithms and Challenges [article]

Yaoyao Zhong, Wei Ji, Junbin Xiao, Yicong Li, Weihong Deng, Tat-Seng Chua
2022 arXiv   pre-print
We then point out the research trend of studying beyond factoid QA to inference QA towards the cognition of video contents, Finally, we conclude some promising directions for future exploration.  ...  , actions and activities as well as reasoning of their spatial, temporal and causal relationships Xiao et al., 2021; .  ...  of the video with a dynamics predictor, and runs the program on the dynamic scene to obtain an answer.  ... 
arXiv:2203.01225v1 fatcat:dn4sz5pomnfb7igvmxofangzsa

Learning Canonical Representations for Scene Graph to Image Generation [article]

Roei Herzig, Amir Bar, Huijuan Xu, Gal Chechik, Trevor Darrell, Amir Globerson
2020 arXiv   pre-print
Previous approaches showed that scenes with few entities can be controlled using scene graphs, but this approach struggles as the complexity of the graph (the number of objects and edges) increases.  ...  Finally, we show improved performance of the model on three different benchmarks: Visual Genome, COCO, and CLEVR.  ...  This work was completed in partial fulfillment for the Ph.D degree of the first author. References  ... 
arXiv:1912.07414v5 fatcat:q2dmubbvqvbf3pcvrdoyhftcnm

Video Question-Answering Techniques, Benchmark Datasets and Evaluation Metrics Leveraging Video Captioning: A Comprehensive Survey

Khushboo Khurana, Umesh Deshpande
2021 IEEE Access  
Additionally, the graph-based methods, although less explored, give very promising results.  ...  This involves understanding the semantics of a video and then generating human-like descriptions of the video.  ...  The similarity between candidate and reference scene graphs is computed by considering the semantic relations in the scene graph as a conjunction of logical propositions.  ... 
doi:10.1109/access.2021.3058248 fatcat:bnjmbffxgreb5jkjuxethaqnde

Video Analysis for Understanding Human Actions and Interactions [article]

Cristian Rodriguez Opazo, University, The Australian National
Equipped with a proposal-free architecture, we tackle the temporal moment localization introducing a spatial-temporal graph. We found that one of the limitations of the exist [...]  ...  We begin by considering the challenging problem of human action anticipation. In this task, we seek to predict a person's action as early as possible before it is completed.  ...  of our spatio-temporal graph approach with existing methods for different tIoU α levels.  ... 
doi:10.25911/g7kb-br27 fatcat:qul7pgxp4rfurept4etvgntpie

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Aditya Mogadala, Marimuthu Kalimuthu, Dietrich Klakow
2021 The Journal of Artificial Intelligence Research  
This success can be partly attributed to the advancements made in the sub-fields of AI such as machine learning, computer vision, and natural language processing.  ...  Much of the growth in these fields has been made possible with deep learning, a sub-area of machine learning that uses artificial neural networks.  ...  Acknowledgments This work was supported by the German Research Foundation (DFG) as a part of -Project-ID 232722074 -SFB1102.  ... 
doi:10.1613/jair.1.11688 fatcat:kvfdrg3bwrh35fns4z67adqp6i
« Previous Showing results 1 — 15 out of 202 results