Filters








3,061 Hits in 3.8 sec

Guest Editorial Introduction to the Special Section on Video and Language

Tao Mei, Jason J. Corso, Gunhee Kim, Jiebo Luo, Chunhua Shen, Hanwang Zhang
2022 IEEE transactions on circuits and systems for video technology (Print)  
Visual Captioning In [A2] , Li et al. present a novel adaptive spatial attention mechanism for video captioning.  ...  Such way not only boosts sentence generation with adaptively learnt spatial locations but also reduces time and memory consumption caused by temporal redundancy across frames.  ...  Prior to joining JD.com in 2018, he was a Senior Research Manager with Microsoft Research Asia.  ... 
doi:10.1109/tcsvt.2021.3137430 fatcat:ksel3hruujgwfpwalj4u5ebebu

Coarse-to-Fine Spatial-Temporal Relationship Inference for Temporal Sentence Grounding

Shanshan Qi, Luxi Yang, Chunguo Li, Yongming Huang
2021 IEEE Access  
Moreover, this task needs to detect the location clues precisely from both spatial and temporal dimensions, but the relationship between spatial-temporal semantic information and query sentence is still  ...  Temporal sentence grounding aims to ground a query sentence into a specific segment of the video.  ...  The balance hyper-parameter ϕ is set to 0.001, and µ is set to 0.25 empirically. The number of nearest cluster centers ω is set to 7 for ActivityNet Caption and 10 for TACoS.  ... 
doi:10.1109/access.2021.3095229 fatcat:jow3mzxavfaohemaxgk4d2buey

OVC-Net: Object-Oriented Video Captioning with Temporal Graph and Detail Enhancement [article]

Fangyi Zhu, Jenq-Neng Hwang, Zhanyu Ma, Guang Chen, Jun Guo
2020 arXiv   pre-print
locations.  ...  To demonstrate the effectiveness, we conduct experiments on the new dataset and compare it with the state-of-the-art video captioning methods.  ...  [13] proposed dense video captioning, which bridges two separate tasks: temporal action location and video captioning.  ... 
arXiv:2003.03715v5 fatcat:g5trretzdjauplie7estebze2a

Understanding Objects in Video: Object-Oriented Video Captioning via Structured Trajectory and Adversarial Learning

Fangyi Zhu, Jenq-Neng Hwang, Zhanyu Ma, Guang Chen, Jun Guo
2020 IEEE Access  
[16] propose dense video captioning which bridges two separate tasks: temporal action location and video captioning.  ...  We perform experiments on the new dataset and compare with the state-of-the-arts for video captioning.  ... 
doi:10.1109/access.2020.3021857 fatcat:bhgzumf2yjdj3bpdxvu4sttqyy

End-to-End Dense Video Captioning with Parallel Decoding [article]

Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, Ping Luo
2021 arXiv   pre-print
Dense video captioning aims to generate multiple associated captions with their temporal locations from the video.  ...  In this paper, we proposed a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC), by formulating the dense caption generation as a set prediction task.  ...  Thus, we add a modulation factor γ to rectify the influence of caption length. µ is the balance factor. Set prediction loss.  ... 
arXiv:2108.07781v2 fatcat:qdgnb5iuy5eahefughmepovxuq

Hierarchical Attention Based Spatial-Temporal Graph-to-Sequence Learning for Grounded Video Description

Kai Shen, Lingfei Wu, Fangli Xu, Siliang Tang, Jun Xiao, Yueting Zhuang
2020 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence  
The task of Grounded Video Description~(GVD) is to generate sentences whose objects can be grounded with the bounding boxes in the video frames.  ...  To address these issues, we cast the GVD task as a spatial-temporal Graph-to-Sequence learning problem, where we model video frames as spatial-temporal sequence graph in order to better capture implicit  ...  values for normalized spatial location and 1 value for the normalized frame index.  ... 
doi:10.24963/ijcai.2020/131 dblp:conf/ijcai/ShenWXT0Z20 fatcat:irwpzeaj3ncihhykuch3unmgxm

Joint Event Detection and Description in Continuous Video Streams [article]

Huijuan Xu, Boyang Li, Vasili Ramanishka, Leonid Sigal, Kate Saenko
2018 arXiv   pre-print
Dense video captioning is a fine-grained video understanding task that involves two sub-problems: localizing distinct events in a long video stream, and generating captions for the localized events.  ...  Our model continuously encodes the input video stream with three-dimensional convolutional layers, proposes variable-length temporal events based on pooled features, and generates their captions.  ...  loss for word prediction.  ... 
arXiv:1802.10250v3 fatcat:vypwgfxc75e5vlot4sldiiwtca

Learning to Generate Grounded Visual Captions without Localization Supervision [article]

Chih-Yao Ma, Yannis Kalantidis, Ghassan AlRegib, Peter Vajda, Marcus Rohrbach, Zsolt Kira
2020 arXiv   pre-print
When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions  ...  We show that our model significantly improves grounding accuracy without relying on grounding supervision or introducing extra computation during inference, for both image and video captioning tasks.  ...  [4] , video captioning [40, 9] , captioning and drawing [15] as well as domain adaptation [13] .  ... 
arXiv:1906.00283v3 fatcat:24igmbdcdbfv5i3d5cgxzr4gke

Move Forward and Tell: A Progressive Generator of Video Descriptions [chapter]

Yilei Xiong, Bo Dai, Dahua Lin
2018 Lecture Notes in Computer Science  
On the ActivityNet Captions dataset, our method demonstrated the capability of generating high-quality paragraph descriptions for videos.  ...  We present an efficient framework that can generate a coherent paragraph to describe a given video. Previous works on video captioning usually focus on video clips.  ...  for image captioning often lead to the loss of temporal information.  ... 
doi:10.1007/978-3-030-01252-6_29 fatcat:twbxfif36rgtrcxa2m3wwbo2uu

Semantic context driven language descriptions of videos using deep neural network

Dinesh Naik, C. D. Jaidhar
2022 Journal of Big Data  
mechanism working as the decoder with a hybrid loss function.  ...  Visual feature vectors extracted from the video frames using a 2D-CNN model capture spatial features.  ...  increased focus on the required captions appropriate for the image locations.  ... 
doi:10.1186/s40537-022-00569-4 fatcat:6v4lbr6eiva2xnsnxwwrhfxdda

Coupled Recurrent Network (CRN) [article]

Lin Sun, Kui Jia, Yuejia Shen, Silvio Savarese, Dit Yan Yeung, and Bertram E. Shi
2019 arXiv   pre-print
Different from RNNs which stack the training loss at each time step or the last time step, we propose an effective and efficient training strategy for CRN.  ...  For example, in addition to the original RGB input sequences, sequences of optical flow are usually used to boost the performance of human action recognition in videos.  ...  Each verb is associated with over 1,000 videos, resulting in a large balanced dataset for learning a basis of dynamical events from videos.  ... 
arXiv:1812.10071v2 fatcat:3dtgqzloezh3lblddavymi2oja

Revitalize Region Feature for Democratizing Video-Language Pre-training [article]

Guanyu Cai, Yixiao Ge, Alex Jinpeng Wang, Rui Yan, Xudong Lin, Ying Shan, Lianghua He, Xiaohu Qie, Jianping Wu, Mike Zheng Shou
2022 arXiv   pre-print
Despite the impressive results, VLP research becomes extremely expensive with the need for massive data and a long training time, preventing further explorations.  ...  Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language  ...  These backbones are designed with 2D [20] and 3D [19] CNNs to capture spatial and temporal information in videos.  ... 
arXiv:2203.07720v2 fatcat:d2dp2lz4fnfvpasiqzkpmmv5bi

Grounding Spatio-Semantic Referring Expressions for Human-Robot Interaction [article]

Mohit Shridhar, David Hsu
2017 arXiv   pre-print
The human language is one of the most natural interfaces for humans to interact with robots.  ...  A core issue for the system is semantic and spatial grounding, which is to infer objects and their spatial relationships from images and natural language expressions.  ...  We thank members of the Adaptive Computing Lab at NUS for thoughtful discussions. We also thank the reviewers for their insightful feedback.  ... 
arXiv:1707.05720v1 fatcat:vv72jqpm5nfcxpexpnfnya6coi

Move Forward and Tell: A Progressive Generator of Video Descriptions [article]

Yilei Xiong, Bo Dai, Dahua Lin
2018 arXiv   pre-print
On the ActivityNet Captions dataset, our method demonstrated the capability of generating high-quality paragraph descriptions for videos.  ...  We present an efficient framework that can generate a coherent paragraph to describe a given video. Previous works on video captioning usually focus on video clips.  ...  for image captioning often lead to the loss of temporal information.  ... 
arXiv:1807.10018v1 fatcat:xhay72cq4fhzjntpsgoveomvfy

Describing like humans: on diversity in image captioning [article]

Qingzhong Wang, Antoni B. Chan
2019 arXiv   pre-print
We also show that balancing the cross-entropy loss and CIDEr reward in reinforcement learning during training can effectively control the tradeoff between diversity and accuracy of the generated captions  ...  Therefore only evaluating accuracy is not sufficient for measuring the performance of captioning models --- the diversity of the generated captions should also be considered.  ...  Recently, the encoder-decoder models, e.g., neural image captioning (NIC) [29] , spatial attention [34] and adaptive attention [19] , trained end-to-end have obtained much better results than the early  ... 
arXiv:1903.12020v3 fatcat:qmgxfny3mbf2jomdgecnly4a3y
« Previous Showing results 1 — 15 out of 3,061 results