739 Hits in 6.8 sec

A Survey on Deep Learning Technique for Video Segmentation [article]

Wenguan Wang, Tianfei Zhou, Fatih Porikli, David Crandall, Luc Van Gool
2021 arXiv   pre-print
Video segmentation, i.e., partitioning video frames into multiple segments or objects, plays a critical role in a broad range of practical applications, from enhancing visual effects in movie, to understanding  ...  In this survey, we comprehensively review two basic lines of research - generic object segmentation (of unknown categories) in videos and video semantic segmentation - by introducing their respective task  ...  Later, in [166] , an outside memory is utilized to build a stronger Siamese track-segmenter. • Un-/Weakly-Supervised Learning based Methods.  ... 
arXiv:2107.01153v3 fatcat:nry4yjhq7zhtzbfh53wf7ie3um

A Survey on Temporal Sentence Grounding in Videos [article]

Xiaohan Lan, Yitian Yuan, Xin Wang, Zhi Wang, Wenwu Zhu
2021 arXiv   pre-print
Temporal sentence grounding in videos(TSGV), which aims to localize one target segment from an untrimmed video with respect to a given sentence query, has drawn increasing attentions in the research community  ...  More specifically, we first discuss existing TSGV approaches by grouping them into four categories, i.e., two-stage methods, end-to-end methods, reinforcement learning-based methods, and weakly supervised  ...  QSPN also devises an auxiliary captioning task which re-generate the query sentence from the retrieved video segment.  ... 
arXiv:2109.08039v2 fatcat:6ja4csssjzflhj426eggaf77tu

Weakly-Supervised Video Object Grounding via Causal Intervention [article]

Wei Wang, Junyu Gao, Changsheng Xu
2021 arXiv   pre-print
We target at the task of weakly-supervised video object grounding (WSVOG), where only video-sentence annotations are available during model learning.  ...  With this in mind, we design a unified causal framework to learn the deconfounded object-relevant association for more accurate and robust video object grounding.  ...  Weakly-Supervised Video Object Grounding There are several different settings in existing works in terms of video object grounding, including localizing the queried objects described in the sentence to  ... 
arXiv:2112.00475v1 fatcat:swz3seosebg2bbscrbs2gvti7e

Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos [article]

Yijun Song, Jingwen Wang, Lin Ma, Zhou Yu, Jun Yu
2020 arXiv   pre-print
The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds to the given query.  ...  In this work we present an effective weakly-supervised model, named as Multi-Level Attentional Reconstruction Network (MARN), which only relies on video-sentence pairs during the training stage.  ...  Weakly-Supervised Video Grounding Weakly-supervised video grounding is to predict the most semantically matched temporal proposal without temporal segment annotation.  ... 
arXiv:2003.07048v1 fatcat:fcepouqkvves7op25a4vpwidmm

Space-Time Memory Network for Sounding Object Localization in Videos [article]

Sizhe Li, Yapeng Tian, Chenliang Xu
2021 arXiv   pre-print
To this end, we propose a space-time memory network for sounding object localization in videos.  ...  Leveraging temporal synchronization and association within sight and sound is an essential step towards robust localization of sounding objects.  ...  Acknowledgements: We would like to thank the anonymous reviewers for the constructive comments. This work was supported in part by NSF 1741472 and 1909912.  ... 
arXiv:2111.05526v1 fatcat:7bly6ftblzhrtou3h5hpgcuk3i

LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval [article]

Reuben Tan, Huijuan Xu, Kate Saenko, Bryan A. Plummer
2020 arXiv   pre-print
The goal of weakly-supervised video moment retrieval is to localize the video segment most relevant to the given natural language query without access to temporal annotations during training.  ...  Prior strongly- and weakly-supervised approaches often leverage co-attention mechanisms to learn visual-semantic representations for localization.  ...  However, our Re-call@5 accuracy is still inferior to those obtained by strongly-supervised models.  ... 
arXiv:1909.13784v2 fatcat:btgosisk6bb4pklnwgpkojk53m

Audio-Visual Event Localization in the Wild

Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu
2019 Computer Vision and Pattern Recognition  
segment query).  ...  Fully and Weakly-Supervised Event Localization The goal of event localization is to predict the event label for each video segment, which contains both audio and visual tracks, for an input video sequence  ... 
dblp:conf/cvpr/Tian0LDX19 fatcat:nhfmf63jgbglbebqhxb6lzs5zq

State of the Art: A Summary of Semantic Image and Video Retrieval Techniques

S. Suguna, C. Ranjith Kumar, D. Sheela Jeyarani
2015 Indian Journal of Science and Technology  
efficient semantic video retrieval.  ...  Due to these reasons semantic video retrieval became a challenging issue in various industries.  ...  The main problem in this method is memory cost for the allocation of voting maps during the object localization.  ... 
doi:10.17485/ijst/2015/v8i35/77061 fatcat:2htopyojqjd7bkjt6mx66cf24i

Weakly Supervised Video Salient Object Detection via Point Supervision [article]

Shuyong Gao, Haozhe Xing, Wei Zhang, Yan Wang, Qianyu Guo, Wenqiang Zhang
2022 arXiv   pre-print
Video salient object detection models trained on pixel-wise dense annotation have achieved excellent performance, yet obtaining pixel-by-pixel annotated datasets is laborious.  ...  for dense prediction), has not been explored.  ...  There has been some researches on point annotations in weakly supervised segmentation [2, 43] and instance segmentation [3, 29, 30, 37] . Bearman et al.  ... 
arXiv:2207.07269v1 fatcat:6qb3t6nkm5aktonkb7wot4d72q

MINI-Net: Multiple Instance Ranking Network for Video Highlight Detection [article]

Fa-Ting Hong, Xuanteng Huang, Wei-Hong Li, Wei-Shi Zheng
2020 arXiv   pre-print
We address the weakly supervised video highlight detection problem for learning to detect segments that are more attractive in training videos given their video event label but without expensive supervision  ...  In this work, we propose casting weakly supervised video highlight detection modeling for a given specific event as a multiple instance ranking network (MINI-Net) learning.  ...  Weakly supervised person re-identification.  ... 
arXiv:2007.09833v2 fatcat:i66exkby2fbejmir2jftgfcsqm

Self-supervised Learning for Semi-supervised Temporal Language Grounding [article]

Fan Luo, Shaoxiang Chen, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang
2021 arXiv   pre-print
Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.  ...  Previous works either tackle this task in a fully-supervised setting that requires a large amount of temporal annotations or in a weakly-supervised setting that usually cannot achieve satisfactory performance  ...  Weakly supervised alignment network for weakly-supervised video moment re- dense event captioning in videos.  ... 
arXiv:2109.11475v2 fatcat:2qmfaum4off4dmxzbvgpgj2hty

Recent Advances in Embedding Methods for Multi-Object Tracking: A Survey [article]

Gaoang Wang, Mingli Song, Jenq-Neng Hwang
2022 arXiv   pre-print
Unlike other computer vision tasks, such as image classification, object detection, re-identification, and segmentation, embedding methods in MOT have large variations, and they have never been systematically  ...  Multi-object tracking (MOT) aims to associate target objects across video frames in order to obtain entire moving trajectories.  ...  As a result, these are actually weakly supervised approaches.  ... 
arXiv:2205.10766v1 fatcat:p7s7lnnlsnadrhsdcmwlg7msfy

Detector-Free Weakly Supervised Group Activity Recognition [article]

Dongkeun Kim, Jinsung Lee, Minsu Cho, Suha Kwak
2022 arXiv   pre-print
Motivated by this, we propose a novel model for group activity recognition that depends neither on bounding box labels nor on object detector.  ...  Existing models for this task are often impractical in that they demand ground-truth bounding box labels of actors even in testing or rely on off-the-shelf object detectors.  ...  Third, object detection is costly to itself and imposes additional overheads in both computation and memory.  ... 
arXiv:2204.02139v1 fatcat:64gp6rrlfrdo3nptrpess65vya

Temporal Context Aggregation for Video Retrieval with Contrastive Learning [article]

Jie Shao, Xin Wen, Bingchen Zhao, Xiangyang Xue
2020 arXiv   pre-print
In this paper, we propose TCA (Temporal Context Aggregation for Video Retrieval), a video representation learning framework that incorporates long-range temporal information between frame-level features  ...  To train it on video retrieval datasets, we propose a supervised contrastive learning method that performs automatic hard negative mining and utilizes the memory bank mechanism to increase the capacity  ...  In terms of the sequence models, the Long Short-Term Memory (LSTM) [20] and Gated Recurrent Unit (GRU) [8] are commonly used for video re-localization and copy detection [13, 22] .  ... 
arXiv:2008.01334v2 fatcat:yyohxhiq45cipewfj3qyr43oaa

Fine-Grained Instance-Level Sketch-Based Video Retrieval [article]

Peng Xu, Kun Liu, Tao Xiang, Timothy M. Hospedales, Zhanyu Ma, Jun Guo, Yi-Zhe Song
2020 arXiv   pre-print
We then introduce a novel multi-stream multi-modality deep network to perform FG-SBVR under both strong and weakly supervised settings.  ...  We show that this model significantly outperforms a number of existing state-of-the-art models designed for video analysis.  ...  Training with Strong Supervision Recall that our sketch queries can contain multiple pages corresponding to different segments/sub-clips within the video clip, and that the detailed correspondence is annotated  ... 
arXiv:2002.09461v1 fatcat:5ryrizjjbnabhcjmtxptxes6ae
« Previous Showing results 1 — 15 out of 739 results