Filters








7,611 Hits in 2.0 sec

Multi-perspective Attention Network for Fast Temporal Moment Localization

JungKyoo Shin, Jinyoung Moon
2021 IEEE Access  
Inspired by the way humans understand an image from multiple perspectives and different contexts, we devise a novel multi-perspective attention mechanism consisting of perspective attention and multi-perspective  ...  Furthermore, multi-perspective modal interactions model the complex relationship between a video and sentence query, and obtain the modal-interacted memory, consisting of a visual feature that selectively  ...  As future work, we plan to employ the multi-perspective approach to other tasks requiring modal interactions between video and text, such as VQA and video captioning.  ... 
doi:10.1109/access.2021.3106698 fatcat:5xtxboezxrahveif6k2cwdtbee

Lifelogging caption generation via fourth-person vision in a human–robot symbiotic environment

Kazuto Nakashima, Yumi Iwashita, Ryo Kurazume
2020 ROBOMECH Journal  
To validate our approach in this scenario, we collect perspective-aware lifelog videos and corresponding caption annotations.  ...  Subsequently, we propose a multi-perspective image captioning model composed of an image-wise salient region encoder, an attention module that adaptively fuses the salient regions, and a caption decoder  ...  To the best of our knowledge, this is the first work that focuses on multi-perspective images for improving caption generation.  ... 
doi:10.1186/s40648-020-00181-2 fatcat:fw5ndj6d7fg6dpnx55ugupck64

Disjoint Multi-task Learning between Heterogeneous Human-centric Tasks [article]

Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, Youngjin Yoon, In So Kweon
2018 arXiv   pre-print
In this paper, we leverage existing single-task datasets for human action classification and captioning data for efficient human behavior learning.  ...  Since the data in each dataset has respective heterogeneous annotations, traditional multi-task learning is not effective in this scenario.  ...  When test video is applied, trained multi-task network is used to predict class and to extract caption embedding as depicted in Figure 2 .  ... 
arXiv:1802.04962v1 fatcat:flqul2wjrbax7gndviagr5atjy

Video Captioning with Guidance of Multimodal Latent Topics

Shizhe Chen, Jia Chen, Qin Jin, Alexander Hauptmann
2017 Proceedings of the 2017 ACM on Multimedia Conference - MM '17  
We formulate the topic-aware caption generation as a multi-task learning problem, in which we add a parallel task, topic prediction, in addition to the caption task.  ...  The topic diversity of open-domain videos leads to various vocabularies and linguistic expressions in describing video contents, and therefore, makes the video captioning task even more challenging.  ...  Evaluation of M&M TGM In this subsection, we introduce two baselines to show the effectiveness of topic guidance and our multi-task learning for video captioning.  ... 
doi:10.1145/3123266.3123420 dblp:conf/mm/ChenCJH17 fatcat:st3ogxnthbczhnr7kygbgf7psu

Video Captioning with Guidance of Multimodal Latent Topics [article]

Shizhe Chen, Jia Chen, Qin Jin, Alexander Hauptmann
2017 arXiv   pre-print
We formulate the topic-aware caption generation as a multi-task learning problem, in which we add a parallel task, topic prediction, in addition to the caption task.  ...  The topic diversity of open-domain videos leads to various vocabularies and linguistic expressions in describing video contents, and therefore, makes the video captioning task even more challenging.  ...  We utilize the multimodal topic mining approach to construct video topics automatically and take a teacher-student learning perspective to predict the latent topics purely from video multimodal contents  ... 
arXiv:1708.09667v2 fatcat:pf5ybcxzhnfufmasqctsfe3xtu

Jointly Localizing and Describing Events for Dense Video Captioning

Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, Tao Mei
2018 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition  
video captioning architecture.  ...  A valid question is how to temporally localize and then describe events, which is known as "dense video captioning."  ...  Video Captioning.  ... 
doi:10.1109/cvpr.2018.00782 dblp:conf/cvpr/LiYPCM18 fatcat:jqxzsfs5b5harautf6gbcpruwa

Jointly Localizing and Describing Events for Dense Video Captioning [article]

Yehao Li and Ting Yao and Yingwei Pan and Hongyang Chao and Tao Mei
2018 arXiv   pre-print
video captioning architecture.  ...  A valid question is how to temporally localize and then describe events, which is known as "dense video captioning."  ...  Video Captioning.  ... 
arXiv:1804.08274v1 fatcat:nasjrj6vw5ds3mbif43rlgjkaa

Survey: Transformer based Video-Language Pre-training [article]

Ludan Ruan, Qin Jin
2021 arXiv   pre-print
Next, we categorize transformer models into Single-Stream and Multi-Stream structures, highlight their innovations and compare their performances.  ...  We then describe the typical paradigm of pre-training & fine-tuning on Video-Language processing in terms of proxy tasks, downstream tasks and commonly used video datasets.  ...  The pre-trained weights of VICTOR are further transferred to downstream tasks of multi-level video classification, content-based video recommendation, multi-modal video captioning, and cross-modal retrieval  ... 
arXiv:2109.09920v1 fatcat:ixysz5k4vrbktmf6cqftttls7m

Generating Video Descriptions with Topic Guidance [article]

Shizhe Chen, Jia Chen, Qin Jin
2017 arXiv   pre-print
Generating video descriptions in natural language (a.k.a. video captioning) is a more challenging task than image captioning as the videos are intrinsically more complicated than images in two aspects.  ...  As for testing video topic prediction, we treat the topic mining model as teacher to train the student, the topic prediction model, by utilizing the full multi-modalities in the video especially the speech  ...  Multi-modality nature is also emphasized in video captioning. For the motion modality in videos, Yao et al.  ... 
arXiv:1708.09666v2 fatcat:u7ejhmf6q5duxk5q6y6lzimtxe

Multiple view perspectives

Raja S. Kushalnagar, Anna C. Cavender, Jehan-François Pâris
2010 Proceedings of the 12th international ACM SIGACCESS conference on Computers and accessibility - ASSETS '10  
Multiple View Perspectives (MVP) enables deaf and hard of hearing students to view and record multiple video views of a classroom presentation using a stand-alone solution.  ...  We show that deaf and hard of hearing students prefer multiple, focused videos over a single, high-quality video and that a compacted layout of only the most important views is pre ferred.  ...  We also thank Anshul Verma for volunteering to give the Bubble Sort lecture used in the evaluations, and the realtime captioner and sign language interpreter for their work in making the lecture accessible  ... 
doi:10.1145/1878803.1878827 dblp:conf/assets/KushalnagarCP10 fatcat:hlx4e47lrree3gp43apiaqvep4

Embracing Uncertainty: Decoupling and De-bias for Robust Temporal Grounding [article]

Hao Zhou, Chongyang Zhang, Yan Luo, Yanjun Chen, Chuanping Hu
2021 arXiv   pre-print
Moreover, we put forward new multi-label metrics to diversify the performance evaluation.  ...  Extensive experiments show that our approach is more effective and robust than state-of-the-arts on Charades-STA and ActivityNet Captions datasets.  ...  ActivityNet Captions. This dataset [2] contains 19,209 videos, which was originally proposed by [16] for dense video captioning task.  ... 
arXiv:2103.16848v2 fatcat:ylhgid4mlrhrbbyrszmpqgzmce

Page 31 of SMPTE Motion Imaging Journal Vol. 107, Issue 9 [page]

1998 SMPTE Motion Imaging Journal  
perspective and a video / audio stream perspective and it provides connectivity to both domains.  ...  Content items symbol — for example, a typical video plus stereo audio plus Closed Caption programme in a single FC-AV Container; the formation of a multi-programme package from the Content items of several  ... 

Multi-modal information fusion for news story segmentation in broadcast video

Bailan Feng, Peng Ding, Jiansong Chen, Jinfeng Bai, Su Xu, Bo Xu
2012 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
In this paper, we propose a novel news story segmentation scheme which can segment broadcast video into story units with multi-modal information fusion (MMIF) strategy.  ...  With the fast development of high-speed network and digital video recording technologies, broadcast video has been playing a more and more important role in our daily life.  ...  Visual cues topic caption Topic caption is one kind of artificial text in news video, which generally appears in the bottom of the frames.  ... 
doi:10.1109/icassp.2012.6288156 dblp:conf/icassp/FengDCBXX12 fatcat:g5g7fohlhre7fetvclum5ybsje

SACT: Self-Aware Multi-Space Feature Composition Transformer for Multinomial Attention for Video Captioning [article]

Chiranjib Sur
2020 arXiv   pre-print
Also, multi-head attention transformer works on the principle of combining all possible contents for attention, which is good for natural language classification, but has limitations for video captioning  ...  Video captioning works on the two fundamental concepts, feature detection and feature composition.  ...  [3] defined a masked transformer approach where they provided the scope of multi-headed attention for video captioning application.  ... 
arXiv:2006.14262v1 fatcat:h4gktmcwxvfmxm24rxseowfoq4

Introduction to the special issue: Egocentric Vision and Lifelogging

Mariella Dimiccoli, Cathal Gurrin, David Crandall, Xavier Giró-i-Nieto, Petia Radeva
2018 Journal of Visual Communication and Image Representation  
In "Making a long story short: A Multi-Importance fast-forwarding egocentric videos with the emphasis on relevant objects," Silva et al. propose a fast-forward method for egocentric video that emphasizes  ...  A different approach for 3 sequence captioning is proposed in "DeepDiary: Lifelogging Image Captioning 40 and Summarization," by Fan et al.  ... 
doi:10.1016/j.jvcir.2018.06.010 fatcat:rhrhgajxhfa6np4a4ufmjg4o5a
« Previous Showing results 1 — 15 out of 7,611 results