A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Temporal Localization and Spatial Segmentation of Joint Attention in Multiple First-Person Videos
2017
2017 IEEE International Conference on Computer Vision Workshops (ICCVW)
Technically, we propose a hierarchical conditional random field-based model that can 1) localize events of joint attention temporally and 2) segment objects of joint attention spatially. ...
As a key tool to discover such objects of joint attention, we rely on a collection of wearable eye-tracking cameras that provide a firstperson video of interaction scenes and points-of-gaze data of interacting ...
We thank Binhua Zuo, Zhenqiang Li, Dailin Li, Ya Wang and Jiehui Wang for helping to collect and annotate our joint attention dataset. ...
doi:10.1109/iccvw.2017.273
dblp:conf/iccvw/HuangCKYHS17
fatcat:2z5ty3hgq5ccblzpndmbyw2xke
Discovering Objects of Joint Attention via First-Person Sensing
2016
2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
We also introduce a new dataset of multiple pairs of first-person videos and points-of-gaze data. ...
The goal of this work is to discover objects of joint attention, i.e., objects being viewed by multiple people using head-mounted cameras and eye trackers. ...
instead of
Conclusions In this work, we introduced a novel task of discovering objects of joint attention in multiple first-person videos. ...
doi:10.1109/cvprw.2016.52
dblp:conf/cvpr/KeraYHS16
fatcat:w2nmemtlanadreac5wpuylhwu4
STAR: Sparse Transformer-based Action Recognition
[article]
2021
arXiv
pre-print
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data. ...
Our model can also process the variable length of video clips grouped as a single batch. ...
matrix multiplications to capture spatial correlations between human skeleton joints. • We propose a segmented linear self-attention module that effectively captures temporal correlations of dynamic joint ...
arXiv:2107.07089v1
fatcat:g2ko62ahbvfftay3ocghmw7rmy
Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos
[article]
2021
arXiv
pre-print
We introduce the task of spatially localizing narrated interactions in videos. ...
We demonstrate the effectiveness of our approach by self-training on the HowTo100M instructional video dataset and evaluating on a newly collected dataset of localized described interactions in the YouCook2 ...
with descriptions and temporal segments but did not address spatial grounding of activities. ...
arXiv:2110.10596v2
fatcat:jxitovdgezbwhehrvd7x3r2jyy
Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions
[article]
2018
arXiv
pre-print
We propose a novel attentive sequence to sequence translator (ASST) for clip localization in videos by natural language descriptions. We make two contributions. ...
The RNN parses natural language descriptions in two directions, and the attentive model attends every meaningful word or phrase to each frame, thereby resulting in a more detailed understanding of video ...
In the first example, our model successfully localized "a person sitting at a desk eating some food" and "person drinking from a coffee cup". ...
arXiv:1808.08803v1
fatcat:rfxl44c5e5e7be3t4b3wh3d6ra
Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding
2020
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence
Currently, most existing grounding methods are restricted to well-aligned segment-sentence pairs. In this paper, we explore spatio-temporal video grounding on unaligned data and multi-form sentences. ...
Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence. ...
., 2017] only ground the person tube in multiple videos and [Zhou et al., 2018; Chen et al., 2019b] further retrieve the spatio-temporal tubes of diverse objects from trimmed videos by weakly-supervised ...
doi:10.24963/ijcai.2020/149
dblp:conf/ijcai/ZhangZLHY20
fatcat:4yux7bufpzeqpeexjwfux4tubq
TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation
[article]
2020
arXiv
pre-print
Specifically, TSPNet introduces an inter-scale attention to evaluate and enhance local semantic consistency of sign segments and an intra-scale attention to resolve semantic ambiguity by using non-local ...
To this end, we first present a novel sign video segment representation which takes into account multiple temporal granularities, thus alleviating the need for accurate video segmentation. ...
Different from their approaches, we develop a segment representation
for sign videos, and aim to learn both spatial and temporal semantics of sign gestures. ...
arXiv:2010.05468v1
fatcat:qoj67klu2va6jk4v6klg37bhwa
Polar Relative Positional Encoding for Video-Language Segmentation
2020
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence
Given a video and a sentence in natural language, the goal is to segment the object or actor described by the sentence in video frames. ...
In this paper, we tackle a challenging task named video-language segmentation. ...
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ...
doi:10.24963/ijcai.2020/132
dblp:conf/ijcai/NingXW020
fatcat:mktrb7kgbzcqbgmywwgtrm23my
Deep Learning for Video Captioning: A Review
2019
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence
In this survey, we first formulate the problem of video captioning, then review state-of-the-art methods categorized by their emphasis on vision or language, and followed by a summary of standard datasets ...
As a connection between the two worlds of vision and language, video captioning is the task of producing a natural-language utterance (usually a sentence) that describes the visual content of a video. ...
In the absence of spatial annotation, Shen et al. first adopted multiple instance learning to detect semantic concepts in video frames, and then selected spatial region sequences using submodular maximization ...
doi:10.24963/ijcai.2019/877
dblp:conf/ijcai/ChenYJ19
fatcat:3xxssrzqjjd5jbvtgkkp5lw7xa
A Survey of Human Action Recognition and Posture Prediction
2022
Tsinghua Science and Technology
Human action recognition and posture prediction aim to recognize and predict respectively the action and postures of persons in videos. ...
They are both active research topics in computer vision community, which have attracted considerable attention from academia and industry. ...
This work was supported by the National Natural Science Foundation of China (Nos. 61871038 ...
doi:10.26599/tst.2021.9010068
fatcat:lygnvsm3unddnngyd7s3wkchjy
Weakly Supervised Video Moment Retrieval From Text Queries
2019
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
We propose a joint visualsemantic embedding based framework that learns the notion of relevant segments from video using only video-level sentence descriptions. ...
However, acquiring a large number of training videos with temporal boundary annotations for each text description is extremely timeconsuming and often not scalable. ...
This work was partially supported by NSF grant 1544969 and ONR contract N00014-15-C5113 through a sub-contract from Mayachitra Inc. ...
doi:10.1109/cvpr.2019.01186
dblp:conf/cvpr/MithunPR19
fatcat:fv7y4dhnxrhvjdrf4c2w7mm5nm
Video-Based Convolutional Attention for Person Re-Identification
[article]
2019
arXiv
pre-print
We introduce an attention mechanisms to capture the relevant information both at frame level (spatial information) and at video level (temporal information given by the importance of a specific frame within ...
In this paper we consider the problem of video-based person re-identification, which is the task of associating videos of the same person captured by different and non-overlapping cameras. ...
FESR 2014-2020 fund, project "Design of a Digital Assistant based on machine learning and natural language, and by the "PREscriptive Situational awareness for cooperative autoorganizing aerial sensor NETworks ...
arXiv:1910.04856v1
fatcat:s7imn4i7qncxzfsfpjmjrsxovi
Multiple Image Objects Detection, Tracking, and Classification using Human Articulated Visual Perception Capability
[chapter]
2008
Brain, Vision and AI
By using this concept, both temporal attention and spatial attention can be considered because temporal attention provides the predictable motion model, and spatial attention provides the detailed local ...
This mechanism provides an efficient method for more complex analysis using data association in spatially attentive window and predicted temporal location. ...
This book provides only a small example of this research activity, but it covers a great deal of what has been done in the field recently. ...
doi:10.5772/6040
fatcat:xmuljcpyzbbzvomxqsqfoc2jju
Looking deeper into Time for Activities of Daily Living Recognition
2020
2020 IEEE Winter Conference on Applications of Computer Vision (WACV)
The temporal structure is represented globally by different temporal granularities and locally by temporal segments. ...
We also propose a two-level pose driven attention mechanism to take into account the relative importance of the segments and granularities. ...
Acknowledgement We are grateful to INRIA Sophia Antipolis -Mediterranean "NEF" computation cluster for providing resources and support. ...
doi:10.1109/wacv45572.2020.9093575
dblp:conf/wacv/DasTB20
fatcat:vknggzstnnbwvhf2vo6k244pki
Recent Advances in Video Question Answering: A Review of Datasets and Methods
[article]
2021
arXiv
pre-print
VQA helps to retrieve temporal and spatial information from the video scenes and interpret it. In this survey, we review a number of methods and datasets for the task of VQA. ...
Video Question Answering (VQA) is a recent emerging challenging task in the field of Computer Vision. ...
Spatio-Temporal Methods Joint reasoning of spatial and temporal structures of a video is required to accurately tackle the problem of VQA. ...
arXiv:2101.05954v1
fatcat:afio7akl7zf6rm2yn2a2xp2anq
« Previous
Showing results 1 — 15 out of 11,706 results