25,752 Hits in 4.5 sec

Learning Semantic-Aware Dynamics for Video Prediction [article]

Xinzhu Bei, Yanchao Yang, Stefano Soatto
2021 arXiv   pre-print
The result is a predictive model that explicitly represents objects and learns their class-specific motion, which we evaluate on video prediction benchmarks.  ...  We propose an architecture and training scheme to predict video frames by explicitly modeling dis-occlusions and capturing the evolution of semantically consistent regions in the video.  ...  Our video prediction architecture with learned semantic-aware dynamics.  ... 
arXiv:2104.09762v1 fatcat:rzbewbus4zftpn6cu4e6asfoi4

Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation [article]

Xian Liu, Yinghao Xu, Qianyi Wu, Hang Zhou, Wayne Wu, Bolei Zhou
2022 arXiv   pre-print
Specifically, we first propose a Semantic-Aware Dynamic Ray Sampling module with an additional parsing branch that facilitates audio-driven volume rendering.  ...  Animating high-fidelity video portrait with speech audio is crucial for virtual reality and digital entertainment.  ...  We then introduce the Semantic-Aware Dynamic Ray Sampling module, which facilitates fine-grained appearance and dynamics modeling for each portrait part with semantic information (Sec. 3.2).  ... 
arXiv:2201.07786v1 fatcat:77kaocrzqrbjjcy4sqqq3osruy

Audio-Visual Collaborative Representation Learning for Dynamic Saliency Prediction [article]

Hailong Ning, Bin Zhao, Zhanxuan Hu, Lang He, Ercheng Pei
2022 arXiv   pre-print
Motivated by this, an audio-visual collaborative representation learning method is proposed for the DSP task, which explores the audio modality to better predict the dynamic saliency map by assisting vision  ...  The Dynamic Saliency Prediction (DSP) task simulates the human selective attention mechanism to perceive the dynamic scene, which is significant and imperative in many vision tasks.  ...  In view of practical applications, this paper aims to investigate the saliency prediction for the dynamic video.  ... 
arXiv:2109.08371v3 fatcat:4uqa4l25mjaztppresigt6jd6a

Visual-aware Attention Dual-stream Decoder for Video Captioning [article]

Zhixin Sun, Xian Zhong, Shuqin Chen, Lin Li, Luo Zhong
2021 arXiv   pre-print
The attention mechanism in the current video captioning method learns to assign weight to each frame, promoting the decoder dynamically.  ...  Video captioning is a challenging task that captures different visual parts and describes them in sentences, for it requires visual and linguistic coherence.  ...  The visual-aware attention mechanism is used to select the fused visual feature dynamically.  ... 
arXiv:2110.08578v1 fatcat:gsa6o75oqrgo3b3c2gxdzgt5ti

Explanation-Guided Fairness Testing through Genetic Algorithm [article]

Ming Fan, Wenying Wei, Wuxia Jin, Zijiang Yang, Ting Liu
2022 arXiv   pre-print
A plethora of research has proposed diverse methods for individual fairness testing.  ...  Moreover, ExpGA only requires prediction probabilities of the tested model, resulting in a better generalization capability to various models.  ...  CONCLUSION This work proposes ExpGA, an explanation-guided method through the GA for software fairness testing.  ... 
arXiv:2205.08335v1 fatcat:kwcxbsoif5ct3cq4m4i77rwee4

Personalized Cinemagraphs using Semantic Understanding and Collaborative Learning [article]

Tae-Hyun Oh, Kyungdon Joo, Neel Joshi, Baoyuan Wang, In So Kweon, Sing Bing Kang
2017 arXiv   pre-print
Creating a high-quality, aesthetically pleasing cinemagraph requires isolating objects in a semantically meaningful way and then selecting good start times and looping periods for those objects to minimize  ...  To achieve this, we present a new technique that uses object recognition and semantic segmentation as part of an optimization method to automatically create cinemagraphs from videos that are both visually  ...  The best performance of Joint shows learning the user feature in a context aware manner can improve the quality of preference prediction for cinemagraph.  ... 
arXiv:1708.02970v1 fatcat:4btr42ilk5ekbopyhgczsgcsea

High-Quality Video Generation from Static Structural Annotations

Lu Sheng, Junting Pan, Jiaming Guo, Jing Shao, Chen Change Loy
2020 International Journal of Computer Vision  
The second image-to-video (I2V) generation task applies the synthesized starting frame and the associated structural annotation map to animate the scene dynamics for the generation of a photorealistic  ...  Integrating structural annotations into the flow prediction also improves the structural awareness in the I2V generation process.  ...  While for the video generation for a dynamic scene modeling, there are a list of works trained to predict raw pixels in future frames by learning from historical motion patterns (Mathieu et al. 2015;  ... 
doi:10.1007/s11263-020-01334-x fatcat:yedge4qmcbd2jpyz6bo3n5fbqe

Cross-Modal Graph with Meta Concepts for Video Captioning [article]

Hao Wang, Guosheng Lin, Steven C. H. Hoi, Chunyan Miao
2021 arXiv   pre-print
Specifically, to cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions, where the associated visual regions and textual words are  ...  We further build meta concept graphs dynamically with the learned cross-modal meta concepts.  ...  Concept Prediction Learning semantic concepts from visual input has been validated to be useful in the captioning task [22] - [24] , where they mainly use a multi-label classification to predict the  ... 
arXiv:2108.06458v2 fatcat:ud6awh36iba67gacokl25uccim

Music Gesture for Visual Sound Separation [article]

Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba
2020 arXiv   pre-print
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio  ...  Recent deep learning approaches have achieved impressive performance on visual sound separation tasks.  ...  Once the visual semantic feature and keypoints are extracted from the raw video, we adopt a context-aware Graph CNN (CT-GCN) to fuse the semantic context of instruments and human body dynamics.  ... 
arXiv:2004.09476v1 fatcat:jl3ujfazkfgcncdfqqfaieebl4

Dual-Level Decoupled Transformer for Video Captioning [article]

Yiqi Gao, Xinglin Hou, Wei Suo, Mengyang Sun, Tiezheng Ge, Yuning Jiang, Peng Wang
2022 arXiv   pre-print
(ii) for sentence generation, we propose Syntax-Aware Decoder to dynamically measure the contribution of visual semantic and syntax-related words.  ...  Video captioning aims to understand the spatio-temporal semantic concept of the video and generate descriptive sentences.  ...  Li [22] adopts two layers of spatio-temporal dynamic attention for video subtitles.  ... 
arXiv:2205.03039v1 fatcat:omrzfavtlngotbf27d43nwe4k4

Probabilistic Future Prediction for Video Scene Understanding [article]

Anthony Hu, Fergal Cotter, Nikhil Mohan, Corina Gurau, Alex Kendall
2020 arXiv   pre-print
We present a novel deep learning architecture for probabilistic future prediction from video.  ...  Our model learns a representation from RGB video with a spatio-temporal convolutional module.  ...  We also thank Przemyslaw Mazur, Nikolay Nikolov and Roberto Cipolla for the many insightful research discussions.  ... 
arXiv:2003.06409v2 fatcat:mf56dimeh5hgjpijm2yyeibhzu

Future-Supervised Retrieval of Unseen Queries for Live Video

Spencer Cappallo, Cees G.M. Snoek
2017 Proceedings of the 2017 ACM on Multimedia Conference - MM '17  
We introduce the use of future frame representations as a supervision signal for learning temporally aware semantic representations on unlabeled video data.  ...  We investigate retrieval of previously unseen queries for live video content. Drawing from existing whole-video techniques, we focus on adapting image-trained semantic models to the video domain.  ...  We enrich per-frame semantics with temporal awareness by using future representations for supervision.  ... 
doi:10.1145/3123266.3123437 dblp:conf/mm/CappalloS17 fatcat:t3wgpjthpfhnnbq6eeu56flsyu

Position-aware Location Regression Network for Temporal Video Grounding [article]

Sunoh Kim, Kimin Yun, Jin Young Choi
2022 arXiv   pre-print
The key to successful grounding for video surveillance is to understand a semantic phrase corresponding to important actors and objects.  ...  To understand comprehensive contexts with only one semantic phrase, we propose Position-aware Location Regression Network (PLRN) which exploits position-aware features of a query and a video.  ...  Also, a reinforcement learning (RL)-based approach [11, 28] is introduced for temporal video grounding, where the RL agent adjusts the predicted grounding boundary according to the learned policy.  ... 
arXiv:2204.05499v1 fatcat:wo73va53pnekrox5lf7d4u53ee

Toward Cost-Effective Mobile Video Streaming through Environment-Aware Watching State Prediction

Xuanyu Wang, Weizhan Zhang, Xiang Gao, Jingyi Wang, Haipeng Du, Qinghua Zheng
2019 Sensors  
Mobile video applications are becoming increasingly prevalent and enriching the way people learn and are entertained.  ...  First, the watching state is predicted by machine learning based on user behavior and the physical environment during a given time window.  ...  It provides a cost-effective data download strategy through environment-aware watching state prediction and provides a generalized strategy that can be used for many other video delivery technologies,  ... 
doi:10.3390/s19173654 fatcat:s4oemgoiiraqthtvgmq2gsg7t4

2021 Index IEEE Transactions on Multimedia Vol. 23

2021 IEEE transactions on multimedia  
The Author Index contains the primary entry for each item, listed under the first author's name.  ...  Yang, H., +, TMM 2021 572-583 Dynamics Dynamic Motion Estimation and Evolution Video Prediction Network.  ...  Gu, L., +, TMM 2021 939-954 Dynamic Motion Estimation and Evolution Video Prediction Network.  ... 
doi:10.1109/tmm.2022.3141947 fatcat:lil2nf3vd5ehbfgtslulu7y3lq
« Previous Showing results 1 — 15 out of 25,752 results