Filters








35 Hits in 3.6 sec

Rethinking Motion Representation: Residual Frames with 3D ConvNets

Li Tao, Xueting Wang, Toshihiko Yamasaki
2021 IEEE Transactions on Image Processing  
We deeply analyze the effectiveness of this modality compared to normal RGB video clips, and find that better motion features can be extracted using residual frames with 3D ConvNets.  ...  In this paper, we propose a cheap but effective way to extract motion features from videos utilizing residual frames as the input data in 3D ConvNets.  ...  From another point of view, our proposed method focuses more on the movement itself and utilizes a 3D ConvNet with higher motion representation ability by using residual frames as input.  ... 
doi:10.1109/tip.2021.3124156 pmid:34735344 fatcat:ahmd6sjffjaqrkirbqzmwfbkp4

Rethinking Motion Representation: Residual Frames with 3D ConvNets for Better Action Recognition [article]

Li Tao, Xueting Wang, Toshihiko Yamasaki
2020 arXiv   pre-print
Further analysis indicates that better motion features can be extracted using residual frames with 3D ConvNets, and our residual-frame-input path is a good supplement for existing RGB-frame-input models  ...  In this paper, we propose a fast but effective way to extract motion features from videos utilizing residual frames as the input data in 3D ConvNets.  ...  From another point of sight, our proposed method focus more on the movement itself and utilize a 3D ConvNet with higher motion representation ability by using residual frames as input.  ... 
arXiv:2001.05661v1 fatcat:ftetbjnvbjfbfay6g7wxhf5bfu

Rethinking Pose in 3D: Multi-stage Refinement and Recovery for Markerless Motion Capture [article]

Denis Tome, Matteo Toso, Lourdes Agapito, Chris Russell
2018 arXiv   pre-print
This novelty allows us to use provisional 3D models of human pose to rethink where the joints should be located in the image and to recover from past mistakes.  ...  We propose a CNN-based approach for multi-camera markerless motion capture of the human body.  ...  We have demonstrated the clear benefits and robustness of our approach by noticeably improving over existing multi-view markerless motion capture system.  ... 
arXiv:1808.01525v1 fatcat:wbf6dbtwvffuxd7gwvblnddif4

Rethinking Pose in 3D: Multi-stage Refinement and Recovery for Markerless Motion Capture

Denis Tome, Matteo Toso, Lourdes Agapito, Chris Russell
2018 2018 International Conference on 3D Vision (3DV)  
This novelty allows us to use provisional 3D models of human pose to rethink where the joints should be located in the image and to recover from past mistakes.  ...  We propose a CNN-based approach for multi-camera markerless motion capture of the human body.  ...  We have demonstrated the clear benefits and robustness of our approach by noticeably improving over existing multi-view markerless motion capture system.  ... 
doi:10.1109/3dv.2018.00061 dblp:conf/3dim/TomeTAR18 fatcat:frcm5gq2zjam7auxsgdgs4o26q

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation [article]

Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, Song Han
2022 arXiv   pre-print
BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes.  ...  It establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower computation cost.  ...  tasks (such as 3D object tracking and motion forecasting).  ... 
arXiv:2205.13542v2 fatcat:qtunylgozjcvrdrjzdk23xjpve

Self-Supervised Representation Learning from Flow Equivariance [article]

Yuwen Xiong, Mengye Ren, Wenyuan Zeng, Raquel Urtasun
2021 arXiv   pre-print
ego motion.  ...  Instead of learning view-invariant representation from simple images, humans learn representations in a complex world with changing scenes by observing object movement, deformation, pose variation, and  ...  and motion segmentation.  ... 
arXiv:2101.06553v2 fatcat:nkp477lmwndifpturbxukaasni

VPN++: Rethinking Video-Pose embeddings for understanding Activities of Daily Living [article]

Srijan Das, Rui Dai, Di Yang, Francois Bremond
2021 arXiv   pre-print
Because the recent 3D ConvNets are too rigid to capture the subtle visual patterns across an action, this research direction is dominated by methods combining RGB and 3D Poses.  ...  VPN++, with or without 3D Poses, outperforms the representative baselines on 4 public datasets. Code is available at https://github.com/srijandas07/vpnplusplus.  ...  Both the visual feature and spatial attention vectors are obtained by Finally, VPN is plugged into the 3D ConvNet for an end-to-end training with a regularized loss L which is a convex combination of  ... 
arXiv:2105.08141v1 fatcat:vtopa4qekbdelez6cyw4xk24t4

Universal-to-Specific Framework for Complex Action Recognition [article]

Peisen Zhao, Lingxi Xie, Ya Zhang, Qi Tian
2020 arXiv   pre-print
The universal network first learns universal feature representations.  ...  Inspired by a common flowchart based on the human decision-making process that first narrows down the probable classes and then applies a "rethinking" process for finer-level recognition, we propose an  ...  Early studies [1] - [4] started with classifying simple motion states such as "jumping" and "running", which can be easily achieved by directly extracting features from key frames [5] .  ... 
arXiv:2007.06149v1 fatcat:c2eyj7ony5hw7iljwfwccfarpa

Recent Advances of Monocular 2D and 3D Human Pose Estimation: A Deep Learning Perspective

Wu Liu, Tao Mei
2022 ACM Computing Surveys  
In this paper, we provide a comprehensive and holistic 2D-to-3D perspective to tackle this problem. Firstly, we comprehensively summarize the 2D and 3D representations of human body.  ...  Estimation of the human pose from a monocular camera has been an emerging research topic in the computer vision community with many applications.  ...  The specific pose of each frame could be better determined by exploring the motion dynamics in a motion sequence.  ... 
doi:10.1145/3524497 fatcat:4pbvntngrnfp7lqhcpjmy7p2fq

2021 Index IEEE Transactions on Image Processing Vol. 30

2021 IEEE Transactions on Image Processing  
Chen, C., +, TIP 2021 3995-4007 Rethinking Motion Representation: Residual Frames With 3D ConvNets.  ...  Motion Representation: Residual Frames With 3D ConvNets; TIP 2021 9231-9244 Tao, S., see Dong, W., TIP 2021 1030-1043 Tao, X., see Xu, M., TIP 2021 2087-2102 Tao, X., see Wang, J., TIP 2021 4225-4237  ... 
doi:10.1109/tip.2022.3142569 fatcat:z26yhwuecbgrnb2czhwjlf73qu

Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition [article]

Yuanhang Zhang, Shuang Yang, Jingyun Xiao, Shiguang Shan, Xilin Chen
2020 arXiv   pre-print
Currently, most existing methods equate VSR with automatic lip reading, which attempts to recognise speech by analysing lip motion.  ...  Experiments are conducted on both word-level and sentence-level benchmarks with different characteristics.  ...  We would like to thank Chenhao Wang and Mingshuang Luo's extensive help with data processing.  ... 
arXiv:2003.03206v2 fatcat:7gmyhyka55dq3gwa6cgaybjs6i

Auto-X3D: Ultra-Efficient Video Understanding via Finer-Grained Neural Architecture Search [article]

Yifan Jiang, Xinyu Gong, Junru Wu, Humphrey Shi, Zhicheng Yan, Zhangyang Wang
2021 arXiv   pre-print
Efficient video architecture is the key to deploying video recognition systems on devices with limited computing resources.  ...  This paper bypasses existing 2D architectures, and directly searched for 3D architectures in a fine-grained space, where block type, filter number, expansion ratio and attention block are jointly searched  ...  Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length.  ... 
arXiv:2112.04710v1 fatcat:p3dxpblcx5f5xlohmas5odwu7e

Deep Point-wise Prediction for Action Temporal Proposal [article]

Luxuan Li, Tao Kong, Fuchun Sun, Huaping Liu
2019 arXiv   pre-print
The whole system is end-to-end trained with joint loss of temporal action proposal classification and location prediction.  ...  Previous works usually utilize (a) sliding window paradigms, or (b) per-frame action scoring and grouping to enumerate the possible temporal locations.  ...  Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo- 3d residual networks.  ... 
arXiv:1909.07725v1 fatcat:wycy5jadcve7ld37yugegllcku

Surround-View Cameras based Holistic Visual Perception for Automated Driving [article]

Varun Ravi Kumar
2022 arXiv   pre-print
Recent advances in computer vision and deep learning, in conjunction with high-quality sensors such as cameras and LiDARs, have fueled mature visual perception solutions.  ...  We concentrate on the following issues in order to address them: 1) Developing near-field perception algorithms with high performance and low computational complexity for various visual perception tasks  ...  As a result, there is a mismatch between the target and predicted frames for pixels with motion.  ... 
arXiv:2206.05542v1 fatcat:cdpn6afpvvf7hnsvry7cqbjq3u

Survey on Semantic Segmentation using Deep Learning Techniques

Fahad Lateef, Yassine Ruichek
2019 Neurocomputing  
The spatial feature maps of the region in single frame fed into LSTM, infers a relation with spatial features of equivalent regions in frames before that frame.  ...  appearance, depth and motion.  ... 
doi:10.1016/j.neucom.2019.02.003 fatcat:aelsfl7unvdw5j2rtyqhtgqrsm
« Previous Showing results 1 — 15 out of 35 results