Filters








464 Hits in 7.0 sec

One-stage Video Instance Segmentation: From Frame-in Frame-out to Clip-in Clip-out [article]

Minghan Li, Lei Zhang
2022 arXiv   pre-print
The proposed CiCo strategy is free of inter-frame alignment, and can be easily embedded into existing FiFo based VIS approaches.  ...  Specifically, we stack FPN features of all frames in a short video clip to build a spatio-temporal feature cube, and replace the 2D conv layers in the prediction heads and the mask branch with 3D conv  ...  A spatio-temporal embedding scheme is proposed in STEm-Seg [1] , while the bottom-up paradigm leads to much lower performance.  ... 
arXiv:2203.06421v1 fatcat:oxy2pte7azb4batrizsrshn2qq

OpenPifPaf: Composite Fields for Semantic Keypoint Detection and Spatio-Temporal Association [article]

Sven Kreiss, Lorenzo Bertoni, Alexandre Alahi
2021 arXiv   pre-print
We present a generic neural network architecture that uses Composite Fields to detect and construct a spatio-temporal pose which is a single, connected graph whose nodes are the semantic keypoints (e.g  ...  In this work, we present a general framework that jointly detects and forms spatio-temporal keypoint associations in a single stage, making this the first real-time pose detection and tracking algorithm  ...  We also thank our lab members and reviewers for their valuable comments.  ... 
arXiv:2103.02440v2 fatcat:utj3lczi7rbqri6ntt2um2quc4

Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer [article]

Omkar Thawakar, Sanath Narayan, Jiale Cao, Hisham Cholakkal, Rao Muhammad Anwer, Muhammad Haris Khan, Salman Khan, Michael Felsberg, Fahad Shahbaz Khan
2022 arXiv   pre-print
State-of-the-art transformer-based video instance segmentation (VIS) approaches typically utilize either single-scale spatio-temporal features or per-frame multi-scale features during the attention computations  ...  To address this issue, we propose a transformer-based VIS framework, named MS-STS VIS, that comprises a novel multi-scale spatio-temporal split (MS-STS) attention module in the encoder.  ...  Acknowledgements This work was partially supported by VR starting grant (2016-05543), the Wallenberg AI, Autonomous Systems and Software Program (WASP), by the Swedish Research Council through a grant  ... 
arXiv:2203.13253v1 fatcat:4vvv55v5knaj7i6vcljbiiwr3e

Recent Advances in Monocular 2D and 3D Human Pose Estimation: A Deep Learning Perspective [article]

Wu Liu, Qian Bao, Yu Sun, Tao Mei
2021 arXiv   pre-print
Although there have been some works to summarize the different approaches, it still remains challenging for researchers to have an in-depth view of how these approaches work.  ...  By systematically summarizing the differences and connections between these approaches, we further analyze the solutions for challenging cases, such as the lack of data, the inherent ambiguity between  ...  Associative Embedding [42] , which is used in the image-based bottom-up keypoint association strategy, is also extended in [124] to build the spatio-temporal embedding.  ... 
arXiv:2104.11536v1 fatcat:tdag2jq2vjdrjekwukm5nu7l6a

Spatio-Contextual Deep Network Based Multimodal Pedestrian Detection For Autonomous Driving [article]

Kinjal Dasgupta, Arindam Das, Sudip Das, Ujjwal Bhattacharya, Senthil Yogamani
2022 arXiv   pre-print
Finally, these feature maps are used by a single-stage decoder to generate the bounding box of each pedestrian and the score map.  ...  Although a camera is commonly used for this purpose, its quality degrades severely in low-light night time driving scenarios.  ...  Finally, a single stage detection decoder [39] uses the multimodal features to output bounding boxes and scores for pedestrians.  ... 
arXiv:2105.12713v3 fatcat:2x3qtaupo5euvio2wrjv4dvppu

Spatio-Temporal Laplacian Pyramid Coding for Action Recognition

Ling Shao, Xiantong Zhen, Dacheng Tao, Xuelong Li
2014 IEEE Transactions on Cybernetics  
and over spatio-temporal neighborhoods.  ...  In contrast to sparse representations based on detected local interest points, STLPC regards a video sequence as a whole with spatio-temporal features directly extracted from it, which prevents the loss  ...  [37] , feature extraction using an architecture with two stages, namely a filter bank and a feature pooling technique, performs better than that with a single stage.  ... 
doi:10.1109/tcyb.2013.2273174 pmid:23912503 fatcat:jjqjkgwdcnhsdm4adfwgppcbpy

Less than Few: Self-Shot Video Instance Segmentation [article]

Pengwan Yang, Yuki M. Asano, Pascal Mettes, Cees G. M. Snoek
2022 arXiv   pre-print
We call this self-shot learning and we outline a simple self-supervised learning method to generate an embedding space well-suited for unsupervised retrieval of relevant samples.  ...  This is especially true as the level of details in spatio-temporal video understanding and with it, the complexity of annotations continues to increase.  ...  Stage 1: Spatio-temporal transformer encoder.  ... 
arXiv:2204.08874v1 fatcat:ezkbn7phi5cnnbinvcyvfx5vm4

Detection of Parked Vehicles Using Spatiotemporal Maps

Antonio Albiol, Laura Sanchis, Alberto Albiol, José M. Mossi
2011 IEEE transactions on intelligent transportation systems (Print)  
This paper presents a video-based approach to detect the presence of parked vehicles in street lanes.  ...  The technique extracts information from low-level feature points (Harris corners) in order to create spatio-temporal maps that describe what is happening in the scene.  ...  In [25] , we also proposed the idea of spatio- temporal maps for counting people. However, the information embedded on those maps was completely different.  ... 
doi:10.1109/tits.2011.2156791 fatcat:yernruc2m5aynesl6yfrl4u3hm

Spatio-Temporal Self-Attention Network for Fire Detection and Segmentation in Video Surveillance

Mohammad Shahid, John Jethro Virtusio, Yu-Hsien Wu, Yung-Yao Chen, M. Tanveer, Khan Muhammad, Kai-Lung Hua
2021 IEEE Access  
As a whole, our pipeline has two stages: In the first stage, we take out region proposals using Spatial-Temporal features, and in the second stage, we classify whether each region proposal is flame or  ...  improvement for small fires at a very early stage.  ...  The spatio stream uses static features from a single frame, such as color and texture. • Our proposed approach uses self-attention on Spatio-Temporal features that are discriminative of fire, enabling  ... 
doi:10.1109/access.2021.3132787 fatcat:nqqsy3i6v5fbfdwsmogvz7uogq

Panorama View With Spatiotemporal Occlusion Compensation for 3D Video Coding

Muhammad Shahid Farid, Maurizio Lucenteforte, Marco Grangetto
2015 IEEE Transactions on Image Processing  
In this paper we present Spatio-Temporal Occlusion compensation with Panorama view (STOP), a novel 3D video coding technique based on the creation of a panorama view and occlusion coding in terms of spatio-temporal  ...  The panorama picture represents most of the visual information acquired from multiple views using a single virtual view, characterized by a larger field of view.  ...  'Kendo' and Nokia Research for 'Undo Dancer' and 'GT Fly' 3D video sequence.  ... 
doi:10.1109/tip.2014.2374533 pmid:25438310 fatcat:agujr5fmtbcrxlanfistgq3vyu

Multi-View Video-Based 3D Hand Pose Estimation [article]

Leyla Khaleghi, Alireza Sepas Moghaddam, Joshua Marshall, Ali Etemad
2021 arXiv   pre-print
Next, we implement MuViHandNet, a neural pipeline consisting of image encoders for obtaining visual embeddings of the hand, recurrent learners to learn both temporal and angular sequential information,  ...  Hand pose estimation (HPE) can be used for a variety of human-computer interaction applications such as gesture-based control for physical or virtual/augmented reality devices.  ...  Successive to extracting spatial embeddings from each frame using an encoder, our model uses a pair of temporal and angular learners to learn effective spatio-temporal and spatio-angular representations  ... 
arXiv:2109.11747v1 fatcat:w7idunkz6fbijlnrqjqim74sve

TubeFormer-DeepLab: Video Mask Transformer [article]

Dahun Kim, Jun Xie, Huiyu Wang, Siyuan Qiao, Qihang Yu, Hong-Seok Kim, Hartwig Adam, In So Kweon, Liang-Chieh Chen
2022 arXiv   pre-print
The observation motivates us to develop TubeFormer-DeepLab, a simple and effective video mask transformer model that is widely applicable to multiple video segmentation tasks.  ...  State-of-the-art models adopted in the separate communities have diverged, and radically different approaches dominate in each task.  ...  We see the global memory attention is spatio-temporally well separated for individual thing or stuff tubes.  ... 
arXiv:2205.15361v1 fatcat:l7g2pu7i2zcb7hgpqsjg4muiya

The Role of Dynamics in Extracting Information Sparsely Encoded in High Dimensional Data Streams [chapter]

Mario Sznaier, Octavia Camps, Necmiye Ozay, Tao Ding, Gilead Tadmor, Dana Brooks
2010 Dynamics of Information Systems  
The goal of this chapter is to show how the use of simple dynamical systems concepts can lead to tractable, computationally efficient algorithms for extracting information sparsely encoded in multimodal  ...  is by nature dynamic and changes as it propagates through a network where the nodes themselves are dynamical systems.  ...  Alon Zaslaver, Caltech, for providing the diauxic shift experimental data used in Figures 1.1(c) , 1.5(c) and 1.7.  ... 
doi:10.1007/978-1-4419-5689-7_1 fatcat:776v36m3kjdulfih4mfuxc3usm

Audio-visual Multi-channel Integration and Recognition of Overlapped Speech [article]

Jianwei Yu, Shi-Xiong Zhang, Bo Wu, Shansong Liu, Shoukang Hu, Mengzhe Geng, Xunying Liu, Helen Meng, Dong Yu
2021 arXiv   pre-print
Consistent performance improvements are also obtained using the proposed audio-visual multi-channel recognition system when using occluded video input with the face region randomly covered up to 60%.  ...  A series of audio-visual multi-channel speech separation front-end components based on TF masking, Filter∑ and mask-based MVDR neural channel integration approaches are developed.  ...  The authors would like to thank Yiwen Shao and Yiming Wang for the deep discussion about the LF-MMI implementation details.  ... 
arXiv:2011.07755v2 fatcat:cc4frdheindjvjko7ur2f6shly

Robust single-view geometry and motion reconstruction

Hao Li, Bart Adams, Leonidas J. Guibas, Mark Pauly
2009 ACM Transactions on Graphics  
Our method makes use of a smooth template that provides a crude approximation of the scanned object and serves as a geometric and topological prior for reconstruction.  ...  Abstract We present a framework and algorithms for robust geometry and motion reconstruction of complex deforming shapes.  ...  A hierarchical graph representation is pre-computed from a dense uniform sampling of graph nodes by successively merging nodes in a bottom-up fashion.  ... 
doi:10.1145/1618452.1618521 fatcat:zxu5j7ymqvav7gwwbf5vwmezwi
« Previous Showing results 1 — 15 out of 464 results