11,706 Hits in 6.9 sec

Temporal Localization and Spatial Segmentation of Joint Attention in Multiple First-Person Videos

Yifei Huang, Minjie Cai, Hiroshi Kera, Ryo Yonetani, Keita Higuchi, Yoichi Sato
2017 2017 IEEE International Conference on Computer Vision Workshops (ICCVW)  
Technically, we propose a hierarchical conditional random field-based model that can 1) localize events of joint attention temporally and 2) segment objects of joint attention spatially.  ...  As a key tool to discover such objects of joint attention, we rely on a collection of wearable eye-tracking cameras that provide a firstperson video of interaction scenes and points-of-gaze data of interacting  ...  We thank Binhua Zuo, Zhenqiang Li, Dailin Li, Ya Wang and Jiehui Wang for helping to collect and annotate our joint attention dataset.  ... 
doi:10.1109/iccvw.2017.273 dblp:conf/iccvw/HuangCKYHS17 fatcat:2z5ty3hgq5ccblzpndmbyw2xke

Discovering Objects of Joint Attention via First-Person Sensing

Hiroshi Kera, Ryo Yonetani, Keita Higuchi, Yoichi Sato
2016 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)  
We also introduce a new dataset of multiple pairs of first-person videos and points-of-gaze data.  ...  The goal of this work is to discover objects of joint attention, i.e., objects being viewed by multiple people using head-mounted cameras and eye trackers.  ...  instead of Conclusions In this work, we introduced a novel task of discovering objects of joint attention in multiple first-person videos.  ... 
doi:10.1109/cvprw.2016.52 dblp:conf/cvpr/KeraYHS16 fatcat:w2nmemtlanadreac5wpuylhwu4

STAR: Sparse Transformer-based Action Recognition [article]

Feng Shi, Chonghan Lee, Liang Qiu, Yizhou Zhao, Tianyi Shen, Shivran Muralidhar, Tian Han, Song-Chun Zhu, Vijaykrishnan Narayanan
2021 arXiv   pre-print
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.  ...  Our model can also process the variable length of video clips grouped as a single batch.  ...  matrix multiplications to capture spatial correlations between human skeleton joints. • We propose a segmented linear self-attention module that effectively captures temporal correlations of dynamic joint  ... 
arXiv:2107.07089v1 fatcat:g2ko62ahbvfftay3ocghmw7rmy

Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos [article]

Reuben Tan, Bryan A. Plummer, Kate Saenko, Hailin Jin, Bryan Russell
2021 arXiv   pre-print
We introduce the task of spatially localizing narrated interactions in videos.  ...  We demonstrate the effectiveness of our approach by self-training on the HowTo100M instructional video dataset and evaluating on a newly collected dataset of localized described interactions in the YouCook2  ...  with descriptions and temporal segments but did not address spatial grounding of activities.  ... 
arXiv:2110.10596v2 fatcat:jxitovdgezbwhehrvd7x3r2jyy

Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions [article]

Ke Ning, Linchao Zhu, Ming Cai, Yi Yang, Di Xie, Fei Wu
2018 arXiv   pre-print
We propose a novel attentive sequence to sequence translator (ASST) for clip localization in videos by natural language descriptions. We make two contributions.  ...  The RNN parses natural language descriptions in two directions, and the attentive model attends every meaningful word or phrase to each frame, thereby resulting in a more detailed understanding of video  ...  In the first example, our model successfully localized "a person sitting at a desk eating some food" and "person drinking from a coffee cup".  ... 
arXiv:1808.08803v1 fatcat:rfxl44c5e5e7be3t4b3wh3d6ra

Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding

Zhu Zhang, Zhou Zhao, Zhijie Lin, Baoxing Huai, Jing Yuan
2020 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence  
Currently, most existing grounding methods are restricted to well-aligned segment-sentence pairs. In this paper, we explore spatio-temporal video grounding on unaligned data and multi-form sentences.  ...  Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence.  ...  ., 2017] only ground the person tube in multiple videos and [Zhou et al., 2018; Chen et al., 2019b] further retrieve the spatio-temporal tubes of diverse objects from trimmed videos by weakly-supervised  ... 
doi:10.24963/ijcai.2020/149 dblp:conf/ijcai/ZhangZLHY20 fatcat:4yux7bufpzeqpeexjwfux4tubq

TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation [article]

Dongxu Li, Chenchen Xu, Xin Yu, Kaihao Zhang, Ben Swift, Hanna Suominen, Hongdong Li
2020 arXiv   pre-print
Specifically, TSPNet introduces an inter-scale attention to evaluate and enhance local semantic consistency of sign segments and an intra-scale attention to resolve semantic ambiguity by using non-local  ...  To this end, we first present a novel sign video segment representation which takes into account multiple temporal granularities, thus alleviating the need for accurate video segmentation.  ...  Different from their approaches, we develop a segment representation for sign videos, and aim to learn both spatial and temporal semantics of sign gestures.  ... 
arXiv:2010.05468v1 fatcat:qoj67klu2va6jk4v6klg37bhwa

Polar Relative Positional Encoding for Video-Language Segmentation

Ke Ning, Lingxi Xie, Fei Wu, Qi Tian
2020 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence  
Given a video and a sentence in natural language, the goal is to segment the object or actor described by the sentence in video frames.  ...  In this paper, we tackle a challenging task named video-language segmentation.  ...  Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence  ... 
doi:10.24963/ijcai.2020/132 dblp:conf/ijcai/NingXW020 fatcat:mktrb7kgbzcqbgmywwgtrm23my

Deep Learning for Video Captioning: A Review

Shaoxiang Chen, Ting Yao, Yu-Gang Jiang
2019 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence  
In this survey, we first formulate the problem of video captioning, then review state-of-the-art methods categorized by their emphasis on vision or language, and followed by a summary of standard datasets  ...  As a connection between the two worlds of vision and language, video captioning is the task of producing a natural-language utterance (usually a sentence) that describes the visual content of a video.  ...  In the absence of spatial annotation, Shen et al. first adopted multiple instance learning to detect semantic concepts in video frames, and then selected spatial region sequences using submodular maximization  ... 
doi:10.24963/ijcai.2019/877 dblp:conf/ijcai/ChenYJ19 fatcat:3xxssrzqjjd5jbvtgkkp5lw7xa

A Survey of Human Action Recognition and Posture Prediction

Nan Ma, Zhixuan Wu, Yiu-ming Cheung, Yuchen Guo, Yue Gao, Jiahong Li, Beijyan Jiang
2022 Tsinghua Science and Technology  
Human action recognition and posture prediction aim to recognize and predict respectively the action and postures of persons in videos.  ...  They are both active research topics in computer vision community, which have attracted considerable attention from academia and industry.  ...  This work was supported by the National Natural Science Foundation of China (Nos. 61871038  ... 
doi:10.26599/tst.2021.9010068 fatcat:lygnvsm3unddnngyd7s3wkchjy

Weakly Supervised Video Moment Retrieval From Text Queries

Niluthpol Chowdhury Mithun, Sujoy Paul, Amit K. Roy-Chowdhury
2019 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
We propose a joint visualsemantic embedding based framework that learns the notion of relevant segments from video using only video-level sentence descriptions.  ...  However, acquiring a large number of training videos with temporal boundary annotations for each text description is extremely timeconsuming and often not scalable.  ...  This work was partially supported by NSF grant 1544969 and ONR contract N00014-15-C5113 through a sub-contract from Mayachitra Inc.  ... 
doi:10.1109/cvpr.2019.01186 dblp:conf/cvpr/MithunPR19 fatcat:fv7y4dhnxrhvjdrf4c2w7mm5nm

Video-Based Convolutional Attention for Person Re-Identification [article]

Marco Zamprogno, Marco Passon, Niki Martinel, Giuseppe Serra, Giuseppe Lancioni, Christian Micheloni, Carlo Tasso, Gian Luca Foresti
2019 arXiv   pre-print
We introduce an attention mechanisms to capture the relevant information both at frame level (spatial information) and at video level (temporal information given by the importance of a specific frame within  ...  In this paper we consider the problem of video-based person re-identification, which is the task of associating videos of the same person captured by different and non-overlapping cameras.  ...  FESR 2014-2020 fund, project "Design of a Digital Assistant based on machine learning and natural language, and by the "PREscriptive Situational awareness for cooperative autoorganizing aerial sensor NETworks  ... 
arXiv:1910.04856v1 fatcat:s7imn4i7qncxzfsfpjmjrsxovi

Multiple Image Objects Detection, Tracking, and Classification using Human Articulated Visual Perception Capability [chapter]

HeungKyu Lee
2008 Brain, Vision and AI  
By using this concept, both temporal attention and spatial attention can be considered because temporal attention provides the predictable motion model, and spatial attention provides the detailed local  ...  This mechanism provides an efficient method for more complex analysis using data association in spatially attentive window and predicted temporal location.  ...  This book provides only a small example of this research activity, but it covers a great deal of what has been done in the field recently.  ... 
doi:10.5772/6040 fatcat:xmuljcpyzbbzvomxqsqfoc2jju

Looking deeper into Time for Activities of Daily Living Recognition

Srijan Das, Monique Thonnat, Francois Bremond
2020 2020 IEEE Winter Conference on Applications of Computer Vision (WACV)  
The temporal structure is represented globally by different temporal granularities and locally by temporal segments.  ...  We also propose a two-level pose driven attention mechanism to take into account the relative importance of the segments and granularities.  ...  Acknowledgement We are grateful to INRIA Sophia Antipolis -Mediterranean "NEF" computation cluster for providing resources and support.  ... 
doi:10.1109/wacv45572.2020.9093575 dblp:conf/wacv/DasTB20 fatcat:vknggzstnnbwvhf2vo6k244pki

Recent Advances in Video Question Answering: A Review of Datasets and Methods [article]

Devshree Patel, Ratnam Parikh, Yesha Shastri
2021 arXiv   pre-print
VQA helps to retrieve temporal and spatial information from the video scenes and interpret it. In this survey, we review a number of methods and datasets for the task of VQA.  ...  Video Question Answering (VQA) is a recent emerging challenging task in the field of Computer Vision.  ...  Spatio-Temporal Methods Joint reasoning of spatial and temporal structures of a video is required to accurately tackle the problem of VQA.  ... 
arXiv:2101.05954v1 fatcat:afio7akl7zf6rm2yn2a2xp2anq
« Previous Showing results 1 — 15 out of 11,706 results