Filters








767 Hits in 4.5 sec

Multi-modal Egocentric Activity Recognition using Audio-Visual Features [article]

Mehmet Ali Arabacı, Fatih Özkan, Elif Surer, Peter Jančovič, Alptekin Temizel
2019 arXiv   pre-print
In this work, we propose a new framework for egocentric activity recognition problem based on combining audio-visual features with multi-kernel learning (MKL) and multi-kernel boosting (MKBoost).  ...  The proposed framework was evaluated on a number of egocentric datasets. The results showed that using multi-modal features with MKL outperforms the existing methods.  ...  CONCLUSION In this work, we proposed a new framework for egocentric activity recognition problem based on audio-visual features combined with multi-kernel learning classification.  ... 
arXiv:1807.00612v2 fatcat:6bdk35purrfgnlraheavbbuodi

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen
2019 2019 IEEE/CVF International Conference on Computer Vision (ICCV)  
Second, we present the first audio-visual fusion attempt in egocentric action recognition.  ...  In this work, we explore audio as a prime modality to provide complementary information to visual modalities (appearance and motion) in egocentric action recognition.  ... 
doi:10.1109/iccv.2019.00559 dblp:conf/iccv/KazakosNZD19 fatcat:65kqse2knvesnbp7emjfr27sze

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition [article]

Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen
2019 arXiv   pre-print
We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i.e. the combination of modalities within a range of temporal offsets  ...  We demonstrate the importance of audio in egocentric vision, on per-class basis, for identifying actions as well as interacting objects.  ...  Second, we present the first audio-visual fusion attempt in egocentric action recognition.  ... 
arXiv:1908.08498v1 fatcat:n2em7rlljnagbeoglorp65y4by

OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context [article]

Merey Ramazanova, Victor Escorcia, Fabian Caba Heilbron, Chen Zhao, Bernard Ghanem
2022 arXiv   pre-print
However, current TAL methods only use visual signals, neglecting the audio modality that exists in most videos and that shows meaningful action information in egocentric videos.  ...  leverage audio-visual information and context for egocentric TAL.  ...  The visual modality tokens are used as Q, and audio modality tokens are used a K and V.  ... 
arXiv:2202.04947v2 fatcat:i4e25herxrhhtf7ks3j25ypcwy

EgoCom: A Multi-person Multi-modal Egocentric Communications Dataset

Curtis Northcutt, Shengxin Zha, Steven Lovegrove, Richard Newcombe
2020 IEEE Transactions on Software Engineering  
EgoCom is a first-of-its-kind natural conversations dataset containing multi-modal human communication data captured simultaneously from the participants' egocentric perspectives.  ...  Multi-modal datasets in artificial intelligence (AI) often capture a third-person perspective, but our embodied human intelligence evolved with sensory input from the egocentric, first-person perspective  ...  Multi-modal AI datasets include AVA-ActiveSpeaker, an audio-visual dataset for speaker detection (Roth et al., 2019) , VGG lip reading dataset, an audio-visual dataset for speech recognition and separation  ... 
doi:10.1109/tpami.2020.3025105 pmid:32946385 fatcat:7nsepwlm6ng5bfuv6zxyiduuku

EgoCom: A Multi-person Multi-modal Egocentric Communications Dataset

Curtis Northcutt, Shengxin Zha, Steven Lovegrove, Richard Newcombe
2020 IEEE Transactions on Pattern Analysis and Machine Intelligence  
EgoCom is a first-of-its-kind natural conversations dataset containing multi-modal human communication data captured simultaneously from the participants' egocentric perspectives.  ...  Multi-modal datasets in artificial intelligence (AI) often capture a third-person perspective, but our embodied human intelligence evolved with sensory input from the egocentric, first-person perspective  ...  Multi-modal AI datasets include AVA-ActiveSpeaker, an audio-visual dataset for speaker detection (Roth et al., 2019) , VGG lip reading dataset, an audio-visual dataset for speech recognition and separation  ... 
doi:10.1109/tpami.2020.3025105 fatcat:pxgvb6i3uvc5jlpubqs2ieali4

Domain Generalization through Audio-Visual Relative Norm Alignment in First Person Action Recognition [article]

Mirco Planamente, Chiara Plizzari, Emanuele Alberti, Barbara Caputo
2021 arXiv   pre-print
In this work, we introduce the first domain generalization approach for egocentric activity recognition, by proposing a new audio-visual loss, called Relative Norm Alignment loss.  ...  It re-balances the contributions from the two modalities during training, over different domains, by aligning their feature norm representations.  ...  The work was partially supported by the ERC project N. 637076 RoboExNovo and the research herein was carried out using the IIT HPC infrastructure.  ... 
arXiv:2110.10101v1 fatcat:ug6mjvwrvjekpppygvoceyzcb4

Seeing and Hearing Egocentric Actions: How Much Can We Learn? [article]

Alejandro Cartas and Jordi Luque and Petia Radeva and Carlos Segura and Mariella Dimiccoli
2019 arXiv   pre-print
In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information.  ...  In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose.  ...  In [30, 31] are proposed attention mechanisms for action recognition using audio as a modality branch.  ... 
arXiv:1910.06693v1 fatcat:tmofzojtuvh23jdjwx4hvtgkbi

Seeing and Hearing Egocentric Actions: How Much Can We Learn?

Alejandro Cartas, Jordi Luque, Petia Radeva, Carlos Segura, Mariella Dimiccoli
2019 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)  
In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information.  ...  In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose.  ...  In [30, 31] are proposed attention mechanisms for action recognition using audio as a modality branch.  ... 
doi:10.1109/iccvw.2019.00548 dblp:conf/iccvw/CartasLRSD19 fatcat:5nriiccjqzeezkfqmovohehgee

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition [article]

Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen
2021 arXiv   pre-print
In egocentric videos, actions occur in quick succession.  ...  Our ablations showcase the advantage of utilising temporal context as well as incorporating audio input modality and language model to rescore predictions.  ...  We use w learnt absolute positional encodings, shared between audio-visual features to model corresponding inputs from the two modalities.  ... 
arXiv:2111.01024v1 fatcat:2mui57jljzabnowww62ghovbca

Multimodal Deep Learning for Group Activity Recognition in Smart Office Environments

George Albert Florea, Radu-Casian Mihailescu
2020 Future Internet  
In this paper we investigate the problem of group activity recognition in office environments using a multimodal deep learning approach, by fusing audio and visual data from video.  ...  First, we extract a joint audiovisual feature representation for activity recognition, and second, we account for the temporal dependencies in the video in order to complete the classification task.  ...  Feature-level fusion is applied in [14] by combining audio-visual features with multi-kernel learning and multi-kernel boosting.  ... 
doi:10.3390/fi12080133 fatcat:hckhaqzlzjbkfhskvgvhbgodr4

Online Cross-Modal Adaptation for Audio–Visual Person Identification With Wearable Cameras

Alessio Brutti, Andrea Cavallaro
2016 IEEE Transactions on Human-Machine Systems  
We propose an audio-visual target identification approach for egocentric data with cross-modal model adaptation.  ...  Importantly, unlike traditional audio-visual integration methods, the proposed approach is also useful for temporal intervals during which only one modality is available or when different modalities are  ...  In summary, our main contributions are the following. 1) We address the multi-modal target identification task for egocentric applications with wearable audio-visual devices.  ... 
doi:10.1109/thms.2016.2620110 fatcat:goygftl3dfdcbmpvk2vuzbu5mi

E^2(GO)MOTION: Motion Augmented Event Stream for Egocentric Action Recognition [article]

Chiara Plizzari, Mirco Planamente, Gabriele Goletto, Marco Cannici, Emanuele Gusso, Matteo Matteucci, Barbara Caputo
2022 arXiv   pre-print
In this paper, we show that event data is a very valuable modality for egocentric action recognition.  ...  These characteristics make them a perfect fit to several real-world applications such as egocentric action recognition on wearable devices, where fast camera motion and limited power challenge traditional  ...  videos accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and multi-view videos.  ... 
arXiv:2112.03596v3 fatcat:xw5b3c7p4zfs7dfm7r5jkk6vb4

Jointly Learning Energy Expenditures and Activities Using Egocentric Multimodal Signals

Katsuyuki Nakamura, Serena Yeung, Alexandre Alahi, Li Fei-Fei
2017 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
This study can lead to new applications such as a visual calorie counter.  ...  We use heart rate signals as privileged self-supervision to derive energy expenditure in a training stage. A multitask objective is used to jointly optimize the two tasks.  ...  Here, we focus on activity recognition using egocentric video and wearable sensors.  ... 
doi:10.1109/cvpr.2017.721 dblp:conf/cvpr/NakamuraYAF17 fatcat:zlqt4iaj2nc2jivjda2nafmkkm

EasyCom: An Augmented Reality Dataset to Support Algorithms for Easy Communication in Noisy Environments [article]

Jacob Donley, Vladimir Tourbabin, Jung-Suk Lee, Mark Broyles, Hao Jiang, Jie Shen, Maja Pantic, Vamsi Krishna Ithapu, Ravish Mehra
2021 arXiv   pre-print
The dataset we are releasing contains AR glasses egocentric multi-channel microphone array audio, wide field-of-view RGB video, speech source pose, headset microphone audio, annotated voice activity, speech  ...  In this work, we describe, evaluate and release a dataset that contains over 5 hours of multi-modal data useful for training and testing algorithms for the application of improving conversations for an  ...  , direction of arrival estimators, multi-channel beamforming algorithms, single-channel audio and audio-visual speech enhancement algorithms, automatic speech recognition algorithms, and more.  ... 
arXiv:2107.04174v2 fatcat:owdguaovsnd67n57vm6l253jn4
« Previous Showing results 1 — 15 out of 767 results