23,693 Hits in 8.0 sec

Weakly-supervised Action Localization via Hierarchical Mining [article]

Jia-Chang Feng, Fa-Ting Hong, Jia-Run Du, Zhongang Qi, Ying Shan, Xiaohu Qie, Wei-Shi Zheng, Jianping Wu
2022 arXiv   pre-print
In this work, we propose a hierarchical mining strategy under video-level and snippet-level manners, i.e., hierarchical supervision and hierarchical consistency mining, to maximize the usage of the given  ...  Thus, the crucial issue of existing weakly-supervised action localization methods is the limited supervision from the weak annotations for precise predictions.  ...  In contrast, we mine the hierarchical supervision and consistency for better action localization without bells and whistles.  ... 
arXiv:2206.11011v1 fatcat:2qbjft2y3bfzjlflq223mvpqwy

Weakly Supervised Video Summarization by Hierarchical Reinforcement Learning [article]

Yiyan Chen, Li Tao, Xueting Wang, Toshihiko Yamasaki
2020 arXiv   pre-print
To solve these problems, we propose a weakly supervised hierarchical reinforcement learning framework, which decomposes the whole task into several subtasks to enhance the summarization quality.  ...  With the guide of the subgoal, the worker predicts the importance scores for video frames in the subtask by policy gradient according to both global reward and innovative defined sub-rewards to overcome  ...  [21] used long short-term memory (LSTM) to predict the importance score with a Determinantal Point Process (DPP) module.  ... 
arXiv:2001.05864v2 fatcat:mdkkrc75zjex3jnvk7blnqemqe

Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction [article]

Wonkwang Lee, Whie Jung, Han Zhang, Ting Chen, Jing Yu Koh, Thomas Huang, Hyungsuk Yoon, Honglak Lee, Seunghoon Hong
2021 arXiv   pre-print
In this work, we revisit hierarchical models in video prediction.  ...  Learning to predict the long-term future of video frames is notoriously challenging due to inherent ambiguities in the distant future and dramatic amplifications of prediction error through time.  ...  Hierarchical long-term video prediction without supervision. In ICML. 2018.  ... 
arXiv:2104.06697v1 fatcat:thkaq2a53fhyzedof52lzw7xtm

Unsupervised Hierarchical Concept Learning [article]

Sumegh Roychowdhury, Sumedh A. Sontakke, Nikaash Puri, Mausoom Sarkar, Milan Aggarwal, Pinkesh Badjatiya, Balaji Krishnamurthy, Laurent Itti
2020 arXiv   pre-print
However, recent work to discover such concepts without access to any environment does not discover relationships (or a hierarchy) between these discovered concepts.  ...  Organizing these discovered concepts hierarchically at different levels of abstraction is useful in discovering patterns, building ontologies, and generating tutorials from demonstration data.  ...  We extend the architecture in Shankar et al. (2019) to simultaneously discover concepts along with their hierarchical organization without any supervision.  ... 
arXiv:2010.02556v1 fatcat:4nhaaajxh5efpngtlrzrvoy2ru

Hierarchical Modeling for Task Recognition and Action Segmentation in Weakly-Labeled Instructional Videos [article]

Reza Ghoddoosian, Saif Sayed, Vassilis Athitsos
2021 arXiv   pre-print
Further, we present a novel top-down weakly-supervised action segmentation approach, where the predicted task is used to constrain the inference of fine-grained action sequences.  ...  This paper focuses on task recognition and action segmentation in weakly-labeled instructional videos, where only the ordered sequence of video-level actions is available during training.  ...  Without loss of generality, we used the output of our two-stream hierarchical networkf c , so that p(τ ) = 1 for the predicted task τ = argmax(f c ) and p(τ ) = 0 otherwise.  ... 
arXiv:2110.05697v1 fatcat:nrxass56fvapfovylsyeatd72e

Video Joint Modelling Based on Hierarchical Transformer for Co-summarization [article]

Li Haopeng, Ke Qiuhong, Gong Mingming, Zhang Rui
2021 arXiv   pre-print
To address this limitation, we propose Video Joint Modelling based on Hierarchical Transformer (VJMHT) for co-summarization, which takes into consideration the semantic dependencies across videos.  ...  Extensive experiments are conducted to verify the effectiveness of the proposed modules and the superiority of VJMHT in terms of F-measure and rank-based evaluation.  ...  dealing with long videos.  ... 
arXiv:2112.13478v1 fatcat:evdrt2a2mff4dgfy27eiyrhcci

Hierarchical Self-supervised Representation Learning for Movie Understanding [article]

Fanyi Xiao, Kaustav Kundu, Joseph Tighe, Davide Modolo
2022 arXiv   pre-print
In contrast, in this paper we focus on self-supervised video learning for movie understanding and propose a novel hierarchical self-supervised pretraining strategy that separately pretrains each level  ...  Most self-supervised video representation learning approaches focus on action recognition.  ...  In [54] , the authors propose a long-term temporal model called Object Transformer (OT).  ... 
arXiv:2204.03101v1 fatcat:kl2xwoczfzedvd5tx452ecg2le

ReMOTS: Self-Supervised Refining Multi-Object Tracking and Segmentation [article]

Fan Yang, Xin Chang, Chenyu Dang, Ziqiang Zheng, Sakriani Sakti, Satoshi Nakamura, Yang Wu
2021 arXiv   pre-print
to form short-term tracklets. (3) Training the appearance encoder using short-term tracklets as reliable pseudo labels. (4) Merging short-term tracklets to long-term tracklets utilizing adopted appearance  ...  To tackle this issue, we propose a self-supervised refining MOTS (i.e., ReMOTS) framework.  ...  Merging Short-term Tracklets With better appearance features and more robust spatiotemporal information of short-term tracklets, we are able to merge short-term tracklets into long-term ones.  ... 
arXiv:2007.03200v3 fatcat:ijgzbckgjzbn5a47hk3f4djzeu

A Perceptual Prediction Framework for Self Supervised Event Segmentation [article]

Sathyanarayanan N. Aakur, Sudeep Sarkar
2019 arXiv   pre-print
We introduce a self-supervised, predictive learning framework that draws inspiration from cognitive psychology to segment long, visually complex videos into individual, stable segments that share the same  ...  In this paper, we tackle the problem of self-supervised temporal segmentation of long videos that alleviate the need for any supervision.  ...  Related Work Fully supervised approaches treat event segmentation as a supervised learning problem and assign the semantics to the video in terms of labels and try to segment the video into its semantically  ... 
arXiv:1811.04869v3 fatcat:suoguja4dzbvrjsg3m5lkcdtnq

Action-Agnostic Human Pose Forecasting [article]

Hsu-kuang Chiu, Ehsan Adeli, Borui Wang, De-An Huang, Juan Carlos Niebles
2018 arXiv   pre-print
For instance, previous work either focused only on short-term or long-term predictions, while sacrificing one or the other.  ...  In this paper, we propose a new action-agnostic method for short- and long-term human pose forecasting.  ...  predict future sequences without the demand for the supervising signal from action labels.  ... 
arXiv:1810.09676v1 fatcat:pms3wo6iyvbsrh2vkcdqjfdgza

A Perceptual Prediction Framework for Self Supervised Event Segmentation

Sathyanarayanan N. Aakur, Sudeep Sarkar
2019 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
We introduce a self-supervised, predictive learning framework that draws inspiration from cognitive psychology to segment long, visually complex videos into constituent events.  ...  Temporal segmentation of long videos is an important problem, that has largely been tackled through supervised learning, often requiring large amounts of annotated training data.  ...  Related Work Fully supervised approaches treat event segmentation as a supervised learning problem and assign the semantics to the video in terms of labels and try to segment the video into its semantically  ... 
doi:10.1109/cvpr.2019.00129 dblp:conf/cvpr/AakurS19 fatcat:z7p22thdqnf5xffzrhxvxvbc3q

Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning [article]

Yuxiao Chen, Long Zhao, Jianbo Yuan, Yu Tian, Zhaoyang Xia, Shijie Geng, Ligong Han, Dimitris N. Metaxas
2022 arXiv   pre-print
(Hi-TRS), to explicitly capture spatial, short-term, and long-term temporal dependencies at frame, clip, and video levels, respectively.  ...  Different from such superficial supervision at the video level, we propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder  ...  Video Transformer (V-TRS). The V-TRS model summarizes the long-term abstracted video level information.  ... 
arXiv:2207.09644v2 fatcat:ffyjgbxzevf3pkjqp25ulrs3my

A Neurally-Inspired Hierarchical Prediction Network for Spatiotemporal Sequence Learning and Prediction [article]

Jielin Qiu, Ge Huang, Tai Sing Lee
2021 arXiv   pre-print
This facilitates the learning of relationships among movement patterns, yielding state-of-the-art performance in long range video sequence predictions in the benchmark datasets.  ...  in the visual cortical hierarchy for predicting future video frames.  ...  This model learns a LSTM (long short-term memory) model at each level to predict the errors made in an earlier level of the hierarchical visual system.  ... 
arXiv:1901.09002v2 fatcat:3yz3fenx2zaoje4gpi5ii7x4yi

Reconstructive Sequence-Graph Network for Video Summarization

Bin Zhao, Haopeng Li, Xiaoqiang Lu, Xuelong Li
2021 IEEE Transactions on Pattern Analysis and Machine Intelligence  
Long Short-Term Memory (LSTM), and the shot-level dependencies are captured by the Graph Convolutional Network (GCN).  ...  Motivated by this point, we propose a Reconstructive Sequence-Graph Network (RSGN) to encode the frames and shots as sequence and graph hierarchically, where the frame-level dependencies are encoded by  ...  This point has also been proved by the results of our unsupervised variant RSGN uns , since it performs even better than those supervised baselines without the video reconstructor.  ... 
doi:10.1109/tpami.2021.3072117 pmid:33835915 fatcat:x5ql4fld5barnndnv2uizejeau

Associating Objects with Transformers for Video Object Segmentation [article]

Zongxin Yang, Yunchao Wei, Yi Yang
2021 arXiv   pre-print
For sufficiently modeling multi-object association, a Long Short-Term Transformer is designed for constructing hierarchical matching and propagation.  ...  This paper investigates how to realize better and more efficient embedding learning to tackle the semi-supervised video object segmentation under challenging multi-object scenarios.  ...  Related Work Semi-supervised Video Object Segmentation.  ... 
arXiv:2106.02638v3 fatcat:mwhmxpp2u5dmplxyquvkin4y7i
« Previous Showing results 1 — 15 out of 23,693 results