1,610 Hits in 4.2 sec

DMCL: Distillation Multiple Choice Learning for Multimodal Action Recognition [article]

Nuno C. Garcia, Sarah Adel Bargal, Vitaly Ablavsky, Pietro Morerio, Vittorio Murino, Stan Sclaroff
2019 arXiv   pre-print
We introduce a novel Distillation Multiple Choice Learning framework for multimodal data, where different modality networks learn in a cooperative setting from scratch, strengthening one another.  ...  We evaluate this approach on three video action recognition benchmark datasets. We obtain state-of-the-art results in comparison to other approaches that work with missing modalities at test time.  ...  petitive to or state-of-the-art results compared to the privileged information literature, and significantly higher accuracy compared to independently trained modality networks for human action recognition  ... 
arXiv:1912.10982v1 fatcat:aplmcrqnufai7mjrf4rgqzcw2u

Modality Distillation with Multiple Stream Networks for Action Recognition [article]

Nuno Garcia, Pietro Morerio, Vittorio Murino
2018 arXiv   pre-print
This paper presents a new approach for multimodal video action recognition, developed within the unified frameworks of distillation and privileged information, named generalized distillation.  ...  We report state-of-the-art results on video action classification on the largest multimodal dataset available for this task, the NTU RGB+D.  ...  multimodal dataset for video action recognition, the NTU RGB+D [9] .  ... 
arXiv:1806.07110v2 fatcat:pdnky2bningcnpipoxiql5q5tq

Learning from Temporal Gradient for Semi-supervised Action Recognition [article]

Junfei Xiao, Longlong Jing, Lin Zhang, Ju He, Qi She, Zongwei Zhou, Alan Yuille, Yingwei Li
2022 arXiv   pre-print
Semi-supervised video action recognition tends to enable deep neural networks to achieve remarkable performance even with very limited labeled data.  ...  The performance of semi-supervised action recognition is significantly improved without additional computation or parameters during inference.  ...  This document contains the supplementary materials for "Learning from Temporal Gradient for Semi-supervised Action Recognition".  ... 
arXiv:2111.13241v3 fatcat:on2wn5j5nffptbsxmwiilyvfjq

Learning with privileged information via adversarial discriminative modality distillation [article]

Nuno C. Garcia, Pietro Morerio, Vittorio Murino
2018 arXiv   pre-print
We report state-of-the-art results on object classification on the NYUD dataset and video action recognition on the largest multimodal dataset available for this task, the NTU RGB+D, as well as on the  ...  We propose a new approach to train a hallucination network that learns to distill depth information via adversarial learning, resulting in a clean approach without several losses to balance or hyperparameters  ...  ACKNOWLEDGMENTS The authors would like to thank Riccardo Volpi for useful discussion on adversarial training and GANs.  ... 
arXiv:1810.08437v1 fatcat:emrj23ga3ngprlp2zmtxvms3qy

Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding [article]

Tanay Agrawal, Dhruv Agarwal, Michal Balazia, Neelabh Sinha, Francois Bremond
2021 arXiv   pre-print
The datasets for the task generally have multiple modalities like video, audio, language and bio-signals. In this paper, we propose a flexible model for the task which exploits all available data.  ...  Cross-attention using transformers has become popular in recent times and is utilised for fusion of different modalities.  ...  This is one reason that multimodal learning is very popular in this domain. First Impressions v2 is a multimodal dataset for personality recognition and is used in this work.  ... 
arXiv:2112.12180v1 fatcat:ixsmn33l3rbtnaotz5bhksc2pq

MyoTac: Real-Time Recognition of Tactical Sign Language Based on Lightweight Deep Neural Network

Huiyong Li, Yifan Zhang, Qian Cao, Hugo Landaluce
2022 Wireless Communications and Mobile Computing  
When dealing with new users, MyoTac achieves an average accuracy of 92.67% and the average recognition time is 2.81 ms.  ...  In this paper, we present MyoTac, a user-independent real-time tactical sign language classification system that makes the network lightweight through knowledge distillation, so as to balance between high  ...  For the distillation intensity value T for distilling out the soft target, the optimal value was not obtained.  ... 
doi:10.1155/2022/2774430 fatcat:qxytnpoqmnakdpgccjoqxr33ky

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos [article]

Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny (+1 others)
2021 arXiv   pre-print
To this end, we extend the concept of instance-level contrastive learning with a multimodal clustering step in the training pipeline to capture semantic similarities across modalities.  ...  Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities  ...  Acknowledgments: We thank IBM for the donation to MIT of the Satori GPU cluster. This work is supported by IARPA via DOI/IBC contract number D17PC00341.  ... 
arXiv:2104.12671v3 fatcat:3sgcrya54ndrto3xabpwszw3ra

A Review on Methods and Applications in Multimodal Deep Learning [article]

Jabeen Summaira, Xi Li, Amin Muhammad Shoib, Jabbar Abdul
2022 arXiv   pre-print
Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning.  ...  The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities.  ...  for the multiple-choice format.  ... 
arXiv:2202.09195v1 fatcat:wwxrmrwmerfabbenleylwmmj7y

Bootstrapped Representation Learning for Skeleton-Based Action Recognition [article]

Olivier Moliner, Sangxia Huang, Kalle Åström
2022 arXiv   pre-print
In this work, we study self-supervised representation learning for 3D skeleton-based action recognition.  ...  We also introduce a multi-viewpoint sampling method that leverages multiple viewing angles of the same action captured by different cameras.  ...  The choice of data augmentation is thus critical for learning representations that are semantically relevant for action recognition. We use the following augmentation strategies: Shear.  ... 
arXiv:2202.02232v2 fatcat:ieolhx425fgczisnakhgvlmbva

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text [article]

Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong
2021 arXiv   pre-print
We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures.  ...  We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification  ...  Acknowledgements: We would like to thank Min-Hsuan Tsai, Jean-Baptise Alayrac, Andrew Audibert, Yeqing Li, Vidush Mukund, and the TensorFlow team for their help with codes, infrastructure, and insightful  ... 
arXiv:2104.11178v3 fatcat:whpa4ulskfcttgiylo4q24vnpe

A Review of Deep Learning-based Human Activity Recognition on Benchmark Video Datasets

Vijeta Sharma, Manjari Gupta, Anil Kumar Pandey, Deepti Mishra, Ajai Kumar
2022 Applied Artificial Intelligence  
We propose a new taxonomy for categorizing the literature as CNN and RNN-based approaches.  ...  This paper aims to present a comparative review of vision-based human activity recognition with the main focus on deep learning techniques on various benchmark video datasets comprehensively.  ...  (Zhu et al. 2016 ) also examined both handcrafted and learning-based approaches for action recognition.  ... 
doi:10.1080/08839514.2022.2093705 fatcat:6on4g3sp3vaktnyyrk72k4mqta

Iterative Distillation for Better Uncertainty Estimates in Multitask Emotion Recognition [article]

Didan Deng, Liang Wu, Bertram E. Shi
2021 arXiv   pre-print
We further apply iterative self-distillation. Iterative distillation over multiple generations significantly improves performance in both emotion recognition and uncertainty estimation.  ...  From a Bayesian perspective, we propose to use deep ensembles to capture uncertainty for multiple emotion descriptors, i.e., action units, discrete expression labels and continuous descriptors.  ...  For multimodal training, we trained the models for 15 epochs and decreased the learning rate by a factor of 10 after every 4 epochs. Metrics Emotion metrics.  ... 
arXiv:2108.04228v2 fatcat:zlqous3inbdnnlvmvzhsvwtqsy

Multi-Modal Pre-Training for Automated Speech Recognition [article]

David M. Chan, Shalini Ghosh, Debmalya Chakrabarty, Björn Hoffmeister
2021 arXiv   pre-print
Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance.  ...  In this work, we introduce a novel approach which leverages a self-supervised learning technique based on masked language modeling to compute a global, multi-modal encoding of the environment in which  ...  Multimodal Pre-Training While many methods for learning multimodal representations focus on self-supervised learning with a contrastive objective, our proposed method, AV-BERT, differs in that it uses  ... 
arXiv:2110.09890v1 fatcat:pv3preijmzbffiwd73ceqyxja4

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition [article]

Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen
2021 arXiv   pre-print
We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance.  ...  To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities, with an explicit language model providing action sequence context  ...  There is a rich literature of seminal works in action recognition innovating temporal sampling [32, 64, 76] , multiple streams [18] , spatio-temporal modelling [22, 42, 62] or modelling actions as  ... 
arXiv:2111.01024v1 fatcat:2mui57jljzabnowww62ghovbca

Feature-Supervised Action Modality Transfer [article]

Fida Mohammad Thoker, Cees G. M. Snoek
2021 arXiv   pre-print
This paper strives for action recognition and detection in video modalities like RGB, depth maps or 3D-skeleton sequences when only limited modality-specific labeled examples are available.  ...  They have become the de facto pre-training choice when recognizing or detecting new actions from RGB datasets that have limited amounts of labeled examples available.  ...  Modalities for Action Recognition Modern action recognition, e.g., [2] , [3] , [20] - [22] relies on deep (2D or 3D) CNN architectures that learn to classify human actions from video data.  ... 
arXiv:2108.03329v1 fatcat:mgyviin3fnbwfbjuzj32j2mqui
« Previous Showing results 1 — 15 out of 1,610 results