A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection
[article]
2019
arXiv
pre-print
TASED-Net is a 3D fully-convolutional network architecture for video saliency detection. ...
decodes the encoded features spatially while aggregating all the temporal information. ...
We thank Ryan Szeto for his valuable feedback and comments. We also thank Stephan Lemmer, Mohamed El Banani, and Luowei Zhou for their discussions. ...
arXiv:1908.05786v1
fatcat:4n7ywgyrbvg7fnhf4cauvolkd4
Temporal-Spatial Feature Pyramid for Video Saliency Detection
[article]
2021
arXiv
pre-print
In order to fully combine multi-level features and make it serve the video saliency model, we propose a 3D fully convolutional encoder-decoder architecture for video saliency detection, which combines ...
The encoder extracts multi-scale temporal-spatial features from the input continuous video frames, and then constructs temporal-spatial feature pyramid through temporal-spatial convolution and top-down ...
Most of the existing video saliency detection models employ the encoder-decoder structure, and rely on the temporal recurrence to predict video saliency. ...
arXiv:2105.04213v2
fatcat:hvo2v46n4jczjkd5crywaixmxu
Learning Pixel-Level Distinctions for Video Highlight Detection
[article]
2022
arXiv
pre-print
We design an encoder-decoder network to estimate the pixel-level distinction, in which we leverage the 3D convolutional neural networks to exploit the temporal context information, and further take advantage ...
of the visual saliency to model the spatial distinction. ...
Acknowledgements This work is supported by the Major Project for New Generation of AI under Grant No. 2018AAA0100400, the National Natural Science Foundation of China (Grant No. 62176047), Beijing Natural ...
arXiv:2204.04615v1
fatcat:h35saplykjgn7cvts6nodm4lvy
ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction
[article]
2021
arXiv
pre-print
We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. ...
The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies ...
TASED-Net [16] uses S3D as an encoder to extract spatial features while jointly aggregating all the temporal information in order to produce a single full-resolution prediction map. ...
arXiv:2012.06170v3
fatcat:tumxsk7ofrbqrl56msgrhuap7y
Hierarchical Domain-Adapted Feature Learning for Video Saliency Prediction
[article]
2021
arXiv
pre-print
In this work, we propose a 3D fully convolutional architecture for video saliency prediction that employs hierarchical supervision on intermediate maps (referred to as conspicuity maps) generated using ...
The results of our experiments show that the proposed model yields state-of-the-art accuracy on supervised saliency prediction. ...
TASED-Net [44] is a 3D fully-convolutional network, based on a standard encoder-decoder architecture, for video saliency detection without any additional feature processing steps. ...
arXiv:2010.01220v4
fatcat:woawbhame5bs5kmeh6qrhshzqi
A Gated Fusion Network for Dynamic Saliency Prediction
[article]
2021
arXiv
pre-print
Predicting saliency in videos is a challenging problem due to complex modeling of interactions between spatial and temporal information, especially when ever-changing, dynamic nature of videos is considered ...
In this paper, we introduce Gated Fusion Network for dynamic saliency (GFSalNet), the first deep saliency model capable of making predictions in a dynamic way via gated fusion mechanism. ...
[48] suggested a different model called TASED-Net, which utilizes a 3D fullyconvolutional encoder-decoder network architecture where the encoded features are spatially upsampled while aggregating the ...
arXiv:2102.07682v1
fatcat:btyo6jtzpvbixj3bbd4hfnioh4
Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network
[article]
2020
arXiv
pre-print
Due to a variety of motions across different frames, it is highly challenging to learn an effective spatiotemporal representation for accurate video saliency prediction (VSP). ...
Finally, the enhanced features are decoded to generate the predicted saliency map. The proposed model is trained end-to-end without any intricate post processing. ...
TASED-Net [33] designs a 3D fully-convolutional network structure and decodes the encoded features spatially while aggregating all the temporal information for VSP. ...
arXiv:2001.00292v1
fatcat:mu5m34vx2bbyzfgmvs6uqeux2e
Spatio-Temporal Self-Attention Network for Video Saliency Prediction
[article]
2021
arXiv
pre-print
To overcome this limitation, we propose a novel Spatio-Temporal Self-Attention 3D Network (STSANet) for video saliency prediction, in which multiple Spatio-Temporal Self-Attention (STSA) modules are employed ...
3D convolutional neural networks have achieved promising results for video tasks in computer vision, including video saliency prediction that is explored in this paper. ...
Differently, TASED-Net [21] proposed a 3D fullyconvolutional encoder-decoder network for VSP and achieved promising performance. ...
arXiv:2108.10696v1
fatcat:llbizroitvhy5cqzlxnmhzi3ry
Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception
[article]
2022
arXiv
pre-print
Moreover, we distill knowledge from these regions to obtain complete new spatial-temporal-audio (STA) fixation prediction (FP) networks, enabling broad applications in cases where video tags are not available ...
Thanks to the rapid advances in deep learning techniques and the wide availability of large-scale training sets, the performance of video saliency detection models has been improving steadily and significantly ...
Corso, “Tased-net: Temporally-aggregating spatial supervised object detection,” in ICCV, 2019.
encoder-decoder network for video saliency detection,” in ICCV, [45] P. Jiang, L. ...
arXiv:2112.13697v2
fatcat:d7nr4muxgfgulnlypygwahpcc4
Recent Advances in Saliency Estimation for Omnidirectional Images, Image Groups, and Video Sequences
2020
Applied Sciences
for co-saliency, and saliency shift for video saliency estimation. ...
We focus on domains that are especially recent and relevant, as they make saliency estimation particularly useful and/or effective: omnidirectional images, image groups for co-saliency, and video sequences ...
[65] presented TASED-Net: a Temporally-Aggregating Spatial Encoder-Decoder neural architecture based on the S3D [132] model (and, consequently, on the Inception model [133] ), that produces an estimation ...
doi:10.3390/app10155143
fatcat:3akkibaelnfnpav2x7gfdllhg4
Video Understanding with Minimal Human Supervision
[article]
2022
For example, collecting an autonomous driving dataset requires a lot of human annotators to draw the bounding boxes of all the pedestrians and vehicles for every frame of a video. ...
However, collecting annotations for video datasets compared to image datasets with similar scales is more challenging, time-consuming, and expensive, since a video can have an arbitrarily large number ...
Tased-net: Temporally-aggregating spatial encoder- decoder network for video saliency detection. ...
doi:10.7302/3832
fatcat:hxmhohokozc7vm7zuffmmkr6te