11 Hits in 4.2 sec

TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection [article]

Kyle Min, Jason J. Corso
2019 arXiv   pre-print
TASED-Net is a 3D fully-convolutional network architecture for video saliency detection.  ...  decodes the encoded features spatially while aggregating all the temporal information.  ...  We thank Ryan Szeto for his valuable feedback and comments. We also thank Stephan Lemmer, Mohamed El Banani, and Luowei Zhou for their discussions.  ... 
arXiv:1908.05786v1 fatcat:4n7ywgyrbvg7fnhf4cauvolkd4

Temporal-Spatial Feature Pyramid for Video Saliency Detection [article]

Qinyao Chang, Shiping Zhu
2021 arXiv   pre-print
In order to fully combine multi-level features and make it serve the video saliency model, we propose a 3D fully convolutional encoder-decoder architecture for video saliency detection, which combines  ...  The encoder extracts multi-scale temporal-spatial features from the input continuous video frames, and then constructs temporal-spatial feature pyramid through temporal-spatial convolution and top-down  ...  Most of the existing video saliency detection models employ the encoder-decoder structure, and rely on the temporal recurrence to predict video saliency.  ... 
arXiv:2105.04213v2 fatcat:hvo2v46n4jczjkd5crywaixmxu

Learning Pixel-Level Distinctions for Video Highlight Detection [article]

Fanyue Wei, Biao Wang, Tiezheng Ge, Yuning Jiang, Wen Li, Lixin Duan
2022 arXiv   pre-print
We design an encoder-decoder network to estimate the pixel-level distinction, in which we leverage the 3D convolutional neural networks to exploit the temporal context information, and further take advantage  ...  of the visual saliency to model the spatial distinction.  ...  Acknowledgements This work is supported by the Major Project for New Generation of AI under Grant No. 2018AAA0100400, the National Natural Science Foundation of China (Grant No. 62176047), Beijing Natural  ... 
arXiv:2204.04615v1 fatcat:h35saplykjgn7cvts6nodm4lvy

ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction [article]

Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik, Ramanathan Subramanian, Vineet Gandhi
2021 arXiv   pre-print
We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture.  ...  The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies  ...  TASED-Net [16] uses S3D as an encoder to extract spatial features while jointly aggregating all the temporal information in order to produce a single full-resolution prediction map.  ... 
arXiv:2012.06170v3 fatcat:tumxsk7ofrbqrl56msgrhuap7y

Hierarchical Domain-Adapted Feature Learning for Video Saliency Prediction [article]

Giovanni Bellitto, Federica Proietto Salanitri, Simone Palazzo, Francesco Rundo, Daniela Giordano, Concetto Spampinato
2021 arXiv   pre-print
In this work, we propose a 3D fully convolutional architecture for video saliency prediction that employs hierarchical supervision on intermediate maps (referred to as conspicuity maps) generated using  ...  The results of our experiments show that the proposed model yields state-of-the-art accuracy on supervised saliency prediction.  ...  TASED-Net [44] is a 3D fully-convolutional network, based on a standard encoder-decoder architecture, for video saliency detection without any additional feature processing steps.  ... 
arXiv:2010.01220v4 fatcat:woawbhame5bs5kmeh6qrhshzqi

A Gated Fusion Network for Dynamic Saliency Prediction [article]

Aysun Kocak, Erkut Erdem, Aykut Erdem
2021 arXiv   pre-print
Predicting saliency in videos is a challenging problem due to complex modeling of interactions between spatial and temporal information, especially when ever-changing, dynamic nature of videos is considered  ...  In this paper, we introduce Gated Fusion Network for dynamic saliency (GFSalNet), the first deep saliency model capable of making predictions in a dynamic way via gated fusion mechanism.  ...  [48] suggested a different model called TASED-Net, which utilizes a 3D fullyconvolutional encoder-decoder network architecture where the encoded features are spatially upsampled while aggregating the  ... 
arXiv:2102.07682v1 fatcat:btyo6jtzpvbixj3bbd4hfnioh4

Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network [article]

Jin Chen, Huihui Song, Kaihua Zhang, Bo Liu, Qingshan Liu
2020 arXiv   pre-print
Due to a variety of motions across different frames, it is highly challenging to learn an effective spatiotemporal representation for accurate video saliency prediction (VSP).  ...  Finally, the enhanced features are decoded to generate the predicted saliency map. The proposed model is trained end-to-end without any intricate post processing.  ...  TASED-Net [33] designs a 3D fully-convolutional network structure and decodes the encoded features spatially while aggregating all the temporal information for VSP.  ... 
arXiv:2001.00292v1 fatcat:mu5m34vx2bbyzfgmvs6uqeux2e

Spatio-Temporal Self-Attention Network for Video Saliency Prediction [article]

Ziqiang Wang, Zhi Liu, Gongyang Li, Tianhong Zhang, Lihua Xu, Jijun Wang
2021 arXiv   pre-print
To overcome this limitation, we propose a novel Spatio-Temporal Self-Attention 3D Network (STSANet) for video saliency prediction, in which multiple Spatio-Temporal Self-Attention (STSA) modules are employed  ...  3D convolutional neural networks have achieved promising results for video tasks in computer vision, including video saliency prediction that is explored in this paper.  ...  Differently, TASED-Net [21] proposed a 3D fullyconvolutional encoder-decoder network for VSP and achieved promising performance.  ... 
arXiv:2108.10696v1 fatcat:llbizroitvhy5cqzlxnmhzi3ry

Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception [article]

Guotao Wang, Chenglizhao Chen, Deng-Ping Fan, Aimin Hao, Hong Qin
2022 arXiv   pre-print
Moreover, we distill knowledge from these regions to obtain complete new spatial-temporal-audio (STA) fixation prediction (FP) networks, enabling broad applications in cases where video tags are not available  ...  Thanks to the rapid advances in deep learning techniques and the wide availability of large-scale training sets, the performance of video saliency detection models has been improving steadily and significantly  ...  Corso, “Tased-net: Temporally-aggregating spatial supervised object detection,” in ICCV, 2019. encoder-decoder network for video saliency detection,” in ICCV, [45] P. Jiang, L.  ... 
arXiv:2112.13697v2 fatcat:d7nr4muxgfgulnlypygwahpcc4

Recent Advances in Saliency Estimation for Omnidirectional Images, Image Groups, and Video Sequences

Marco Buzzelli
2020 Applied Sciences  
for co-saliency, and saliency shift for video saliency estimation.  ...  We focus on domains that are especially recent and relevant, as they make saliency estimation particularly useful and/or effective: omnidirectional images, image groups for co-saliency, and video sequences  ...  [65] presented TASED-Net: a Temporally-Aggregating Spatial Encoder-Decoder neural architecture based on the S3D [132] model (and, consequently, on the Inception model [133] ), that produces an estimation  ... 
doi:10.3390/app10155143 fatcat:3akkibaelnfnpav2x7gfdllhg4

Video Understanding with Minimal Human Supervision [article]

Byungsu Min, University, My
For example, collecting an autonomous driving dataset requires a lot of human annotators to draw the bounding boxes of all the pedestrians and vehicles for every frame of a video.  ...  However, collecting annotations for video datasets compared to image datasets with similar scales is more challenging, time-consuming, and expensive, since a video can have an arbitrarily large number  ...  Tased-net: Temporally-aggregating spatial encoder- decoder network for video saliency detection.  ... 
doi:10.7302/3832 fatcat:hxmhohokozc7vm7zuffmmkr6te