Filters








20,388 Hits in 4.2 sec

Identity-Aware Multi-Sentence Video Description [article]

Jae Sung Park, Trevor Darrell, Anna Rohrbach
2020 arXiv   pre-print
We propose a multi-sentence Identity-Aware Video Description task, which overcomes this limitation and requires to re-identify persons locally within a set of consecutive clips.  ...  We first generate multi-sentence video descriptions, and then apply our Fill-in the Identity model to establish links between the predicted person entities.  ...  We then present our second task, Identity-Aware Video Description, which aims to generate multi-sentence video descriptions with local person IDs.  ... 
arXiv:2008.09791v1 fatcat:k5sl2e7t4bbenpv7j7v6jufssq

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions [article]

Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, Baining Guo
2022 arXiv   pre-print
Existing works either extract low-quality video features or learn limited text embedding, while neglecting that high-resolution videos and diversified semantics can significantly improve cross-modality  ...  We study joint video and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream VL tasks.  ...  Thanks to the large-scale pre-training on the proposed HD-VILA-100M dataset, the multi-modal encoders are able to provide vision-aware text embedding and text-aware vision embedding, which benefits downstream  ... 
arXiv:2111.10337v2 fatcat:4cxblk4v3bhizldnifbmjqxk74

The Eighth Dialog System Technology Challenge [article]

Seokhwan Kim, Michel Galley, Chulaka Gunasekara, Sungjin Lee, Adam Atkinson, Baolin Peng, Hannes Schulz, Jianfeng Gao, Jinchao Li, Mahmoud Adada, Minlie Huang, Luis Lastras (+7 others)
2019 arXiv   pre-print
In line with recent challenges, the eighth edition focuses on applying end-to-end dialog technologies in a pragmatic way for multi-domain task-completion, noetic response selection, audio visual scene-aware  ...  At the end, the questioner summarized the events in the video as a video description. This downstream task incentivized the questioner to collect useful answers for the video description.  ...  Audio visual scene-aware dialog track (Section 4) is another follow-up track of DSTC7 which aims to generate dialog responses using multi-modal information given in an input video.  ... 
arXiv:1911.06394v1 fatcat:lefow54nkngendt3hfjbmnc2ou

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences [article]

Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, Lianli Gao
2020 arXiv   pre-print
In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG).  ...  STVG has two challenging settings: (1) We need to localize spatio-temporal object tubes from untrimmed videos, where the object may only exist in a very small segment of the video; (2) We deal with multi-form  ...  We discard video-triplet pairs that are too hard to give a precise description. And a video-triplet pair may correspond to multiple sentences.  ... 
arXiv:2001.06891v3 fatcat:df3uigkdrzbxdnfcze3ydpicpi

All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers [article]

Carmelo Scribano, Davide Sapienza, Giorgia Franchini, Micaela Verucchi, Marko Bertogna
2021 arXiv   pre-print
The main building blocks of the proposed architecture are (i) BERT to provide an embedding of the textual descriptions, (ii) a convolutional backbone along with a Transformer model to embed the visual  ...  three identical sentences within a sentence-triplet.  ...  In particular, at sentence-level, the most recurrent sentence occurs 53 times, while at sentence-triplet-level 15 sentencetriplets have two identical sentences among the three, and in one case there are  ... 
arXiv:2106.10153v1 fatcat:3dovt6ygkrcjffsq3ixcxaruau

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment [article]

Jianwei Yang, Yonatan Bisk, Jianfeng Gao
2021 arXiv   pre-print
Contrastive learning has been widely used to train transformer-based vision-language models for video-text alignment and multi-modal representation learning.  ...  The first is the token-aware contrastive loss which is computed by taking into account the syntactic classes of words.  ...  We ensure the output feature dimension of video encoder to be identical to that of language encoder.  ... 
arXiv:2108.09980v1 fatcat:d3lha34hs5abtifgwmt3pskhmm

Video Storytelling [article]

Junnan Li, Yongkang Wong, Qi Zhao, Mohan S. Kankanhalli
2018 arXiv   pre-print
While earlier works focus on generating a single-sentence description for visual content, recent works have studied paragraph generation.  ...  In this work, we introduce the problem of video storytelling, which aims at generating coherent and succinct stories for long videos.  ...  Recent works [1] , [2] aim to provide more comprehensive and fine-grained image descriptions by generating multi-sentence paragraphs.  ... 
arXiv:1807.09418v1 fatcat:7fgnnfm33ngspgqlnpijppq4hu

A Survey on Temporal Sentence Grounding in Videos [article]

Xiaohan Lan, Yitian Yuan, Xin Wang, Zhi Wang, Wenwu Zhu
2021 arXiv   pre-print
Temporal sentence grounding in videos(TSGV), which aims to localize one target segment from an untrimmed video with respect to a given sentence query, has drawn increasing attentions in the research community  ...  To the best of our knowledge, this is the first systematic survey on temporal sentence grounding.  ...  Temporal Sentence Grounding in Videos (TSGV) is such a task to match a descriptive sentence with one segment (or moment) in an untrimmed video that is of the same semantics.  ... 
arXiv:2109.08039v2 fatcat:6ja4csssjzflhj426eggaf77tu

Multi-Task Video Captioning with Video and Entailment Generation

Ramakanth Pasunuru, Mohit Bansal
2017 Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)  
We improve video captioning by sharing knowledge with two related directed-generation tasks: a temporally-directed unsupervised video prediction task to learn richer context-aware video encoder representations  ...  We also show mutual multi-task improvements on the entailment generation task.  ...  On the other hand, the many-to-one multi-task (with entailment generation) seems to be stronger at generating a caption which is a logicallyimplied entailment of a ground-truth caption, e.g., "a cat is  ... 
doi:10.18653/v1/p17-1117 dblp:conf/acl/PasunuruB17 fatcat:52ykqnferfbkhbhhyko3g5tkxe

Multi-Task Video Captioning with Video and Entailment Generation [article]

Ramakanth Pasunuru, Mohit Bansal
2017 arXiv   pre-print
We improve video captioning by sharing knowledge with two related directed-generation tasks: a temporally-directed unsupervised video prediction task to learn richer context-aware video encoder representations  ...  We also show mutual multi-task improvements on the entailment generation task.  ...  On the other hand, the many-to-one multi-task (with entailment generation) seems to be stronger at generating a caption which is a logicallyimplied entailment of a ground-truth caption, e.g., "a cat is  ... 
arXiv:1704.07489v2 fatcat:ila3iyjwmrcgpbamyh34xwdjee

Hierarchical Boundary-Aware Neural Encoder for Video Captioning

Lorenzo Baraldi, Costantino Grana, Rita Cucchiara
2017 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
We evaluate our approach on three large-scale datasets: the Montreal Video Annotation dataset, the MPII Movie Description dataset and the Microsoft Video Description Corpus.  ...  The use of Recurrent Neural Networks for video captioning has recently gained a lot of attention, since they can be used both to encode the input video and to generate the corresponding description.  ...  The dataset was initially conceived to contain multi-lingual descriptions; however, we only consider captions in the English language.  ... 
doi:10.1109/cvpr.2017.339 dblp:conf/cvpr/BaraldiGC17 fatcat:xf2sb2mdfrgkvfcxbnsyhow5nm

Cross-Modal Progressive Comprehension for Referring Segmentation [article]

Si Liu, Tianrui Hui, Shaofei Huang, Yunchao Wei, Bo Li, Guanbin Li
2021 arXiv   pre-print
In this way, multi-level features can communicate with each other and be mutually refined based on the textual context.  ...  image and video segmentation models.  ...  A2D Sentences is extended from the Actor-Action Dataset [49] by providing textual descriptions for each video.  ... 
arXiv:2105.07175v1 fatcat:z34rf37pnzgtbgcbcranimaqvy

Multi-Attention Network for Compressed Video Referring Object Segmentation [article]

Weidong Chen, Dexiang Hong, Yuankai Qi, Zhenjun Han, Shuhui Wang, Laiyun Qing, Qingming Huang, Guorong Li
2022 arXiv   pre-print
a content-aware dynamic kernel and to predict final segmentation masks.  ...  To alleviate this problem, in this paper, we explore the referring object segmentation task on compressed videos, namely on the original video data flow.  ...  [9] extending, J-HMDB Sentences has 928 corresponding sentence descriptions for 928 videos. Refer-YouTube-VOS.  ... 
arXiv:2207.12622v1 fatcat:z7ubyseaubeh7fh3tclioo7r4a

Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos [article]

Zhu Zhang, Zhijie Lin, Zhou Zhao, Jieming Zhu, Xiuqiang He
2020 arXiv   pre-print
Video moment retrieval aims to localize the target moment in an video according to the given sentence. The weak-supervised setting only provides the video-level sentence annotations during training.  ...  Concretely, we first devise a language-aware filter to generate an enhanced video stream and a suppressed video stream.  ...  They regard matched video-sentence pairs as positive samples and unmatched video-sentence pairs as negative samples.  ... 
arXiv:2008.08257v1 fatcat:kvuetkpc6nee7pj72jxry36l5i

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, Lianli Gao
2020 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG).  ...  STVG has two challenging settings: (1) We need to localize spatio-temporal object tubes from untrimmed videos, where the object may only exist in a very small segment of the video; (2) We deal with multi-form  ...  We discard video-triplet pairs that are too hard to give a precise description. And a videotriplet pair may correspond to multiple sentences.  ... 
doi:10.1109/cvpr42600.2020.01068 dblp:conf/cvpr/ZhangZZWLG20 fatcat:umcnf6qsajcezfx6k2sa2a6t5e
« Previous Showing results 1 — 15 out of 20,388 results