A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Filters
One-stage Video Instance Segmentation: From Frame-in Frame-out to Clip-in Clip-out
[article]
2022
arXiv
pre-print
The proposed CiCo strategy is free of inter-frame alignment, and can be easily embedded into existing FiFo based VIS approaches. ...
Specifically, we stack FPN features of all frames in a short video clip to build a spatio-temporal feature cube, and replace the 2D conv layers in the prediction heads and the mask branch with 3D conv ...
A spatio-temporal embedding scheme is proposed in STEm-Seg [1] , while the bottom-up paradigm leads to much lower performance. ...
arXiv:2203.06421v1
fatcat:oxy2pte7azb4batrizsrshn2qq
OpenPifPaf: Composite Fields for Semantic Keypoint Detection and Spatio-Temporal Association
[article]
2021
arXiv
pre-print
We present a generic neural network architecture that uses Composite Fields to detect and construct a spatio-temporal pose which is a single, connected graph whose nodes are the semantic keypoints (e.g ...
In this work, we present a general framework that jointly detects and forms spatio-temporal keypoint associations in a single stage, making this the first real-time pose detection and tracking algorithm ...
We also thank our lab members and reviewers for their valuable comments. ...
arXiv:2103.02440v2
fatcat:utj3lczi7rbqri6ntt2um2quc4
Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer
[article]
2022
arXiv
pre-print
State-of-the-art transformer-based video instance segmentation (VIS) approaches typically utilize either single-scale spatio-temporal features or per-frame multi-scale features during the attention computations ...
To address this issue, we propose a transformer-based VIS framework, named MS-STS VIS, that comprises a novel multi-scale spatio-temporal split (MS-STS) attention module in the encoder. ...
Acknowledgements This work was partially supported by VR starting grant (2016-05543), the Wallenberg AI, Autonomous Systems and Software Program (WASP), by the Swedish Research Council through a grant ...
arXiv:2203.13253v1
fatcat:4vvv55v5knaj7i6vcljbiiwr3e
Recent Advances in Monocular 2D and 3D Human Pose Estimation: A Deep Learning Perspective
[article]
2021
arXiv
pre-print
Although there have been some works to summarize the different approaches, it still remains challenging for researchers to have an in-depth view of how these approaches work. ...
By systematically summarizing the differences and connections between these approaches, we further analyze the solutions for challenging cases, such as the lack of data, the inherent ambiguity between ...
Associative Embedding [42] , which is used in the image-based bottom-up keypoint association strategy, is also extended in [124] to build the spatio-temporal embedding. ...
arXiv:2104.11536v1
fatcat:tdag2jq2vjdrjekwukm5nu7l6a
Spatio-Contextual Deep Network Based Multimodal Pedestrian Detection For Autonomous Driving
[article]
2022
arXiv
pre-print
Finally, these feature maps are used by a single-stage decoder to generate the bounding box of each pedestrian and the score map. ...
Although a camera is commonly used for this purpose, its quality degrades severely in low-light night time driving scenarios. ...
Finally, a single stage detection decoder [39] uses the multimodal features to output bounding boxes and scores for pedestrians. ...
arXiv:2105.12713v3
fatcat:2x3qtaupo5euvio2wrjv4dvppu
Spatio-Temporal Laplacian Pyramid Coding for Action Recognition
2014
IEEE Transactions on Cybernetics
and over spatio-temporal neighborhoods. ...
In contrast to sparse representations based on detected local interest points, STLPC regards a video sequence as a whole with spatio-temporal features directly extracted from it, which prevents the loss ...
[37] , feature extraction using an architecture with two stages, namely a filter bank and a feature pooling technique, performs better than that with a single stage. ...
doi:10.1109/tcyb.2013.2273174
pmid:23912503
fatcat:jjqjkgwdcnhsdm4adfwgppcbpy
Less than Few: Self-Shot Video Instance Segmentation
[article]
2022
arXiv
pre-print
We call this self-shot learning and we outline a simple self-supervised learning method to generate an embedding space well-suited for unsupervised retrieval of relevant samples. ...
This is especially true as the level of details in spatio-temporal video understanding and with it, the complexity of annotations continues to increase. ...
Stage 1: Spatio-temporal transformer encoder. ...
arXiv:2204.08874v1
fatcat:ezkbn7phi5cnnbinvcyvfx5vm4
Detection of Parked Vehicles Using Spatiotemporal Maps
2011
IEEE transactions on intelligent transportation systems (Print)
This paper presents a video-based approach to detect the presence of parked vehicles in street lanes. ...
The technique extracts information from low-level feature points (Harris corners) in order to create spatio-temporal maps that describe what is happening in the scene. ...
In [25] , we also proposed the idea of spatio- temporal maps for counting people. However, the information embedded on those maps was completely different. ...
doi:10.1109/tits.2011.2156791
fatcat:yernruc2m5aynesl6yfrl4u3hm
Spatio-Temporal Self-Attention Network for Fire Detection and Segmentation in Video Surveillance
2021
IEEE Access
As a whole, our pipeline has two stages: In the first stage, we take out region proposals using Spatial-Temporal features, and in the second stage, we classify whether each region proposal is flame or ...
improvement for small fires at a very early stage. ...
The spatio stream uses static features from a single frame, such as color and texture. • Our proposed approach uses self-attention on Spatio-Temporal features that are discriminative of fire, enabling ...
doi:10.1109/access.2021.3132787
fatcat:nqqsy3i6v5fbfdwsmogvz7uogq
Panorama View With Spatiotemporal Occlusion Compensation for 3D Video Coding
2015
IEEE Transactions on Image Processing
In this paper we present Spatio-Temporal Occlusion compensation with Panorama view (STOP), a novel 3D video coding technique based on the creation of a panorama view and occlusion coding in terms of spatio-temporal ...
The panorama picture represents most of the visual information acquired from multiple views using a single virtual view, characterized by a larger field of view. ...
'Kendo' and Nokia Research for 'Undo Dancer' and 'GT Fly' 3D video sequence. ...
doi:10.1109/tip.2014.2374533
pmid:25438310
fatcat:agujr5fmtbcrxlanfistgq3vyu
Multi-View Video-Based 3D Hand Pose Estimation
[article]
2021
arXiv
pre-print
Next, we implement MuViHandNet, a neural pipeline consisting of image encoders for obtaining visual embeddings of the hand, recurrent learners to learn both temporal and angular sequential information, ...
Hand pose estimation (HPE) can be used for a variety of human-computer interaction applications such as gesture-based control for physical or virtual/augmented reality devices. ...
Successive to extracting spatial embeddings from each frame using an encoder, our model uses a pair of temporal and angular learners to learn effective spatio-temporal and spatio-angular representations ...
arXiv:2109.11747v1
fatcat:w7idunkz6fbijlnrqjqim74sve
TubeFormer-DeepLab: Video Mask Transformer
[article]
2022
arXiv
pre-print
The observation motivates us to develop TubeFormer-DeepLab, a simple and effective video mask transformer model that is widely applicable to multiple video segmentation tasks. ...
State-of-the-art models adopted in the separate communities have diverged, and radically different approaches dominate in each task. ...
We see the global memory attention is spatio-temporally well separated for individual thing or stuff tubes. ...
arXiv:2205.15361v1
fatcat:l7g2pu7i2zcb7hgpqsjg4muiya
The Role of Dynamics in Extracting Information Sparsely Encoded in High Dimensional Data Streams
[chapter]
2010
Dynamics of Information Systems
The goal of this chapter is to show how the use of simple dynamical systems concepts can lead to tractable, computationally efficient algorithms for extracting information sparsely encoded in multimodal ...
is by nature dynamic and changes as it propagates through a network where the nodes themselves are dynamical systems. ...
Alon Zaslaver, Caltech, for providing the diauxic shift experimental data used in Figures 1.1(c) , 1.5(c) and 1.7. ...
doi:10.1007/978-1-4419-5689-7_1
fatcat:776v36m3kjdulfih4mfuxc3usm
Audio-visual Multi-channel Integration and Recognition of Overlapped Speech
[article]
2021
arXiv
pre-print
Consistent performance improvements are also obtained using the proposed audio-visual multi-channel recognition system when using occluded video input with the face region randomly covered up to 60%. ...
A series of audio-visual multi-channel speech separation front-end components based on TF masking, Filter∑ and mask-based MVDR neural channel integration approaches are developed. ...
The authors would like to thank Yiwen Shao and Yiming Wang for the deep discussion about the LF-MMI implementation details. ...
arXiv:2011.07755v2
fatcat:cc4frdheindjvjko7ur2f6shly
Robust single-view geometry and motion reconstruction
2009
ACM Transactions on Graphics
Our method makes use of a smooth template that provides a crude approximation of the scanned object and serves as a geometric and topological prior for reconstruction. ...
Abstract We present a framework and algorithms for robust geometry and motion reconstruction of complex deforming shapes. ...
A hierarchical graph representation is pre-computed from a dense uniform sampling of graph nodes by successively merging nodes in a bottom-up fashion. ...
doi:10.1145/1618452.1618521
fatcat:zxu5j7ymqvav7gwwbf5vwmezwi
« Previous
Showing results 1 — 15 out of 464 results