66,424 Hits in 7.5 sec

Video Object Segmentation with Language Referring Expressions [article]

Anna Khoreva, Anna Rohrbach, Bernt Schiele
2019 arXiv   pre-print
In this work we explore an alternative way of identifying a target object, namely by employing language referring expressions.  ...  To evaluate our method we augment the popular video object segmentation benchmarks, DAVIS'16 and DAVIS'17 with language descriptions of target objects.  ...  A Referring expressions for video object segmentation As our goal is to segment objects in videos using language specifications, we augment all objects annotated with mask labels in DAVIS 16 [38]  ... 
arXiv:1803.08006v3 fatcat:qzv4vpl4ojap3lriyexugtycby

Video Object Linguistic Grounding

Alba Herrera-Palacio, Carles Ventura, Xavier Giro-i-Nieto
2019 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications - MULEA '19  
Figure 1 : Example of the semi-supervised video object segmentation problem using language referring expressions from [3] ABSTRACT The goal of this work is segmenting on a video sequence the objects which  ...  over the video frames, making the segmentation of the objects temporally consistent along the sequence.  ...  EXPERIMENTAL RESULTS Here we present our video object segmentation results on the DAVIS17 dataset [5] with language referring expressions [3] .  ... 
doi:10.1145/3347450.3357662 fatcat:eoe5b3jf7jbbpkyvr724fsqt2y

SynthRef: Generation of Synthetic Referring Expressions for Object Segmentation [article]

Ioannis Kazakos, Carles Ventura, Miriam Bellver, Carina Silberer, Xavier Giro-i-Nieto
2021 arXiv   pre-print
dataset with synthetic referring expressions for video object segmentation.  ...  Recent advances in deep learning have brought significant progress in visual grounding tasks such as language-guided video object segmentation.  ...  We gratefully acknowledge the support of NVIDIA Corporation with the donation of GPUs used in this work.  ... 
arXiv:2106.04403v2 fatcat:huta6ela6zfe5k4r7jo7obpggy

RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation [article]

Miriam Bellver, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, Xavier Giro-i-Nieto
2020 arXiv   pre-print
The task of video object segmentation with referring expressions (language-guided VOS) is to, given a linguistic phrase and a video, generate binary masks for the object to which the phrase refers.  ...  We leverage this data to analyze the results of RefVOS, a novel neural network that obtains competitive results for the task of language-guided image segmentation and state of the art results for language-guided  ...  Experiments We report results with our model on two different tasks: language-guided image segmentation and language-guided video object segmentation.  ... 
arXiv:2010.00263v1 fatcat:iz2c2wrcrjfhbdfz4jsfdab34i

Language as Queries for Referring Video Object Segmentation [article]

Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, Ping Luo
2022 arXiv   pre-print
Referring video object segmentation (R-VOS) is an emerging cross-modal task that aims to segment the target object referred by a language expression in all video frames.  ...  Concretely, we introduce a small set of object queries conditioned on the language as the input to the Transformer. In this manner, all the queries are obligated to find the referred objects only.  ...  The model takes a video clip with the corresponding language expression as input and output the segmentation mask of the referred object in each frame.  ... 
arXiv:2201.00487v2 fatcat:uhk7jvi7uzbktd3ps6ty76qdbq

Localizing Moments in Video with Natural Language [article]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell
2017 arXiv   pre-print
A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding  ...  Therefore, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 10,000 unedited, personal videos in diverse visual settings with pairs of localized video segments and referring  ...  Datasets for natural language object retrieval include referring expressions which can uniquely localize a specific location in a image.  ... 
arXiv:1708.01641v1 fatcat:sgrv3qlhhfaujh6szkoxgwgmqa

ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation [article]

Chen Liang, Yu Wu, Yawei Luo, Yi Yang
2022 arXiv   pre-print
Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos.  ...  /referring expressions.  ...  Related Work Referring Image Segmentation Referring expression segmentation aims at precisely localizing the entity referred by a natural language expression with a pixel-level segmentation mask.  ... 
arXiv:2103.10702v3 fatcat:nmkubjdazvfrtpzx6ldtmzveia

Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation [article]

Chen Liang, Yu Wu, Tianfei Zhou, Wenguan Wang, Zongxin Yang, Yunchao Wei, Yi Yang
2021 arXiv   pre-print
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.  ...  First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.  ...  Introduction Referring video object segmentation (RVOS) targets at segmenting video objects referred by given language expressions.  ... 
arXiv:2106.01061v1 fatcat:6jdazlbzsrbn7mzv4cp76pzlme

YouRefIt: Embodied Reference Understanding with Language and Gesture [article]

Yixin Chen, Qing Li, Deqian Kong, Yik Lun Kei, Song-Chun Zhu, Tao Gao, Yixin Zhu, Siyuan Huang
2021 arXiv   pre-print
Of note, this new visual task requires understanding multimodal cues with perspective-taking to identify which object is being referred to.  ...  We study the understanding of embodied reference: One agent uses both language and gesture to refer to an object to another agent in a shared physical environment.  ...  Videos are segmented into short clips, with each clip containing an exact one reference instance. For each clip, we annotate the reference target (object) with a bounding box.  ... 
arXiv:2109.03413v2 fatcat:32j2f7ea3vbwlfr2wi5z2d6vna

Local-Global Context Aware Transformer for Language-Guided Video Segmentation [article]

Chen Liang, Wenguan Wang, Tianfei Zhou, Jiaxu Miao, Yawei Luo, Yi Yang
2022 arXiv   pre-print
Further, our Locater based solution achieved the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge.  ...  We explore the task of language-guided video segmentation (LVS).  ...  INTRODUCTION L ANGUAGE-GUIDED video segmentation (LVS) [1], also known as language-queried video actor segmentation [2] , aims to segment a specific object/actor in a video referred by a linguistic phrase  ... 
arXiv:2203.09773v1 fatcat:6u5mrlvg7rbithmv3xsdwfgqvi

Decoupled Spatial Temporal Graphs for Generic Visual Grounding [article]

Qianyu Feng, Yunchao Wei, Mingming Cheng, Yi Yang
2021 arXiv   pre-print
We further elaborate a new video dataset, GVG, that consists of challenging referring cases with far-ranging videos.  ...  This work, on the other hand, investigates into a more general setting, generic visual grounding, aiming to mine all the objects satisfying the given expression, which is more challenging yet practical  ...  Visual grounding task [46, 24, 41] is first put forward to refer objects in the image with expression in natural language.  ... 
arXiv:2103.10191v1 fatcat:yeuulvtpvzax3itpbkbac3rz64

Regia: a metadata editor for audiovisual documents

Claudio Gennaro
2007 Multimedia tools and applications  
Regia allows the user to manually edit textual metadata and to hierarchically organize the segmentation of the audiovisual content.  ...  An important feature of this metadata editor is that it is not hard-wired with a particular metadata attributes set; for this purpose the XML schema of the metadata model is used by the editor as configuration  ...  By double-clicking on a keyframe associated with a segment (or a segment on the timeline) it is possible to follow the corresponding sub-Expression.  ... 
doi:10.1007/s11042-007-0129-4 fatcat:6dzhycszajak7mvtekjryo3ore

The Study of Subtitle Translation Based on Multi-Hierarchy Semantic Segmentation and Extraction in Digital Video

Wang Xuemei
2017 Humanities and Social Sciences  
of video object, on account of the consideration of each video object synchronization as well as temporal-spatial constraints related issues.  ...  This paper established a reasonable and effective multi-hierarchy semantic information descriptive model based on video segmentation and extraction technology to realize the mapping of video semantic information  ...  This thesis is the part achievements of 985 key construction disciplines of School of Foreign Languages of Xi 'an Jiaotong University.  ... 
doi:10.11648/j.hss.20170502.17 fatcat:oqmeelzcn5cebkdd5s2oh6uaua

Weak Supervision and Referring Attention for Temporal-Textual Association Learning [article]

Zhiyuan Fang, Shu Kong, Zhe Wang, Charless Fowlkes, Yezhou Yang
2020 arXiv   pre-print
The principle in our designed mechanism is to fully exploit 1) the weak supervision by considering informative and discriminative cues from intra-video segments anchored with the textual query, 2) multiple  ...  The weak supervision is simply a textual expression (e.g., short phrases or sentences) at video level, indicating this video contains relevant frames.  ...  [31] or object retrieval using language [16] .  ... 
arXiv:2006.11747v2 fatcat:bpqa6chthfgjhatmsgqq5t2dym

Cross-Modal Progressive Comprehension for Referring Segmentation [article]

Si Liu, Tianrui Hui, Shaofei Huang, Yunchao Wei, Bo Li, Guanbin Li
2021 arXiv   pre-print
Given a natural language expression and an image/video, the goal of referring segmentation is to produce the pixel-level masks of the entities described by the subject of the expression.  ...  Combining CMPC-I or CMPC-V with TGFE can form our image or video version referring segmentation frameworks and our frameworks achieve new state-of-the-art performances on four referring image segmentation  ...  Given a natural language expression and an image/video as inputs, the goal of referring segmentation is to segment the entities referred by the subject of the input expression.  ... 
arXiv:2105.07175v1 fatcat:z34rf37pnzgtbgcbcranimaqvy
« Previous Showing results 1 — 15 out of 66,424 results