Filters








15,746 Hits in 6.8 sec

Fusion and Orthogonal Projection for Improved Face-Voice Association [article]

Muhammad Saad Saeed, Muhammad Haris Khan, Shah Nawaz, Muhammad Haroon Yousaf, Alessio Del Bue
2021 arXiv   pre-print
In this work, we hypothesize that enriched feature representation coupled with an effective yet efficient supervision is necessary in realizing a discriminative joint embedding space for improved face-voice  ...  We study the problem of learning association between face and voice, which is gaining interest in the computer vision community lately.  ...  towards realizing a discriminative joint embedding space for improved face-voice association.  ... 
arXiv:2112.10483v1 fatcat:au6vln3mczejbluzv6tiktz7ne

Cross-modal Speaker Verification and Recognition: A Multilingual Perspective [article]

Muhammad Saad Saeed, Shah Nawaz, Pietro Morerio, Arif Mahmood, Ignazio Gallo, Muhammad Haroon Yousaf, Alessio Del Bue
2021 arXiv   pre-print
Recent years have seen a surge in finding association between faces and voices within a cross-modal biometric application along with speaker recognition.  ...  Inspired from this, we introduce a challenging task in establishing association between faces and voices across multiple languages spoken by the same set of persons.  ...  embedding of face and voice to study face-voice association across multiple languages using the proposed dataset.  ... 
arXiv:2004.13780v2 fatcat:mm7bzzp5bfgpzos5ibsdapobq4

Learnable PINs: Cross-modal Embeddings for Person Identity [chapter]

Arsha Nagrani, Samuel Albanie, Andrew Zisserman
2018 Lecture Notes in Computer Science  
We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval from voice to face and from face to voice.  ...  retrieval for identities unseen and unheard during training over a number of scenarios and establish a benchmark for this novel task; finally, we show an application of using the joint embedding for automatically  ...  Fig. 1 . 1 Learning a joint embedding between faces and voices.  ... 
doi:10.1007/978-3-030-01261-8_5 fatcat:jvuptz3i2rezvi2rcjpoitoi4u

Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast [article]

Boqing Zhu, Kele Xu, Changjian Wang, Zheng Qin, Tao Sun, Huaimin Wang, Yuxing Peng
2022 arXiv   pre-print
Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face.  ...  We present an approach to learn voice-face representations from the talking face videos, without any identity labels.  ...  Acknowledgments This work is supported by the major Science and Technology Innovation 2030 "New Generation Artificial Intelligence" project 2020AAA0104803.  ... 
arXiv:2204.14057v3 fatcat:o4fimnk2bvbatagqfsaslcqil4

Machine Learning Based Robust Access for Multimodal Biometric Recognition

2020 International journal of recent technology and engineering  
For every biometric we used separately feature extraction techniques and we combined those features in efficient way to get robust combination.  ...  Computational models for the unimodal biometric scans have so far been well recognized but research into multimodal scans and their models have been gaining momentum recently.  ...  Joint sparse is nothing but the joining of the different biometrics for efficient combination. This is machine learning algorithm which will work on three different biometrics.  ... 
doi:10.35940/ijrte.f2374.018520 fatcat:oav7tky7ujedzkf275lrvdxkxe

Multi-modal Multi-channel Target Speech Separation [article]

Rongzhi Gu, Shi-Xiong Zhang, Yong Xu, Lianwu Chen, Yuexian Zou, Dong Yu
2020 arXiv   pre-print
Also, under this framework, we investigate on the fusion methods for multi-modal joint modeling.  ...  and lip movements.  ...  Firstly, the joint training of video and audio stream may not produce lip embeddings that are discriminative enough.  ... 
arXiv:2003.07032v1 fatcat:xghervhtvvckhjnxdqovj2k5ha

Gaussian process decentralized data fusion meets transfer learning in large-scale distributed cooperative perception

Ruofei Ouyang, Bryan Kian Hsiang Low
2019 Autonomous Robots  
Ramos Joint Dictionaries for Zero-Shot Learning Soheil Kolouri*, Mohammad Rostami, Yuri Owechko, Kyungnam Kim Joint Learning of Set Cardinality and State Distribution Seyed Hamid Rezatofighi*, Anton Milan  ...  : Discriminative Feature Learning and Fusion Network for RGB-D Indoor Scene Classification Yabei Li*, Junge Zhang, Yanhua Cheng, Kaiqi Huang, Tieniu Tan Diagnosing and Improving Topic Models by Analyzing  ... 
doi:10.1007/s10514-018-09826-z fatcat:67yqhwmgozccxni56rxmuapjgm

An End-to-End Text-Independent Speaker Identification System on Short Utterances

Ruifang Ji, Xinyuan Cai, Xu Bo
2018 Interspeech 2018  
For example, the GRU learned feature reduces the equal error rate by 27.53% relatively and the speaker identity subspace loss further brings 7.22% relative reduction compared to softmax loss.  ...  Experimental results demonstrate the effectiveness of our proposed system and superiroity over pervious methods.  ...  Its efficiency and discriminative ability are showed in the experiments. Model and Approach Our proposed architecture is presented in Fig.1 .  ... 
doi:10.21437/interspeech.2018-1058 dblp:conf/interspeech/JiCB18 fatcat:3welcoupuzegnaaszeyjxc3bvi

Deep Audio-visual Learning: A Survey

Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, Ran He
2021 International Journal of Automation and Computing  
We divide the current audio-visual learning tasks into four different subfields: audio-visual separation and localization, audio-visual correspondence learning, audio-visual generation, and audio-visual  ...  AbstractAudio-visual learning, aimed at exploiting the relationship between audio and visual modalities, has drawn considerable attention since deep learning started to be used successfully.  ...  [63] proposed a new joint embedding model that mapped two modalities into a joint embedding space and then directly calculated the Euclidean distance between Voice-face matching Nagrani et al.  ... 
doi:10.1007/s11633-021-1293-0 fatcat:an5lfyf4m5fh7mlngmdcbx7joy

Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation

Chih-Chun Yang, Wan-Cyuan Fan, Cheng-Fu Yang, Yu-Chiang Frank Wang
2022 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
(ASR/VSR) but also for manipulating data within/across modalities.  ...  solution that is able to jointly tackle the aforementioned audio-visual learning tasks.  ...  Acknowledgements This work is supported in part by the Ministry of Science and Technology of Taiwan under grants MOST 110-2221-E-002-121 and 110-2634-F-002-052.  ... 
doi:10.1609/aaai.v36i3.20210 fatcat:7nw3ixacvfdqrott6gh7fclx5m

Look&Listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement [article]

Junwen Xiong, Yu Zhou, Peng Zhang, Lei Xie, Wei Huang, Yufei Zha
2022 arXiv   pre-print
Therefore, as a motivation to bridge the multi-modal associations in audio-visual tasks, a unified framework is proposed to achieve target speaker detection and speech enhancement with joint learning of  ...  More recent studies have shown that establishing cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning.  ...  This also demonstrates that the idea of joint training for these two tasks is beneficial to establish more reliable cross-task associations.  ... 
arXiv:2203.02216v2 fatcat:4dowhemn5bburltfwcjjgeohti

Deep Multimodal Emotion Recognition on Human Speech: A Review

Panagiotis Koromilas, Theodoros Giannakopoulos
2021 Applied Sciences  
In addition, we review the basic feature representation methods for each modality, and we present aggregated evaluation results on the reported methodologies.  ...  Finally, we conclude this work with an in-depth analysis of the future challenges related to validation procedures, representation learning and method robustness.  ...  for face movement recognition.  ... 
doi:10.3390/app11177962 fatcat:cezjfmjmvbgapo3tdz5j3iecp4

Audio-Visual Biometric Recognition and Presentation Attack Detection: A Comprehensive Survey [article]

Hareesh Mandalapu, Aravinda Reddy P N, Raghavendra Ramachandra, K Sreenivasa Rao, Pabitra Mitra, S R Mahadeva Prasanna, Christoph Busch
2021 arXiv   pre-print
Amidst the classically used biometrics, voice and face attributes are the most propitious for prevalent applications in day-to-day life because they are easy to obtain through restrained and user-friendly  ...  The pervasiveness of low-cost audio and face capture sensors in smartphones, laptops, and tablets has made the advantage of voice and face biometrics more exceptional when compared to other biometrics.  ...  The features used are DCT-mod2 for face and MFCCs for voice.  ... 
arXiv:2101.09725v1 fatcat:huejyfaeojhzddlckqt5nfivlq

Audio-Visual Biometric Recognition and Presentation Attack Detection: A Comprehensive Survey

Hareesh Mandalapu, Aravinda Reddy P N, Raghavendra Ramachandra, Krothapalli Sreenivasa Rao, Pabitra Mitra, S. R. Mahadeva Prasanna, Christoph Busch
2021 IEEE Access  
Amidst the classically used biometrics, voice and face attributes are the most propitious for prevalent applications in day-to-day life because they are easy to obtain through restrained and user-friendly  ...  The pervasiveness of low-cost audio and face capture sensors in smartphones, laptops, and tablets has made the advantage of voice and face biometrics more exceptional when compared to other biometrics.  ...  The features used are DCT-mod2 for face and MFCCs for voice.  ... 
doi:10.1109/access.2021.3063031 fatcat:q6emam55frhlzp53t7lxb4qz3e

Large-scale multilingual audio visual dubbing [article]

Yi Yang, Brendan Shillingford, Yannis Assael, Miaosen Wang, Wendi Liu, Yutian Chen, Yu Zhang, Eren Sezener, Luis C. Cobo, Misha Denil, Yusuf Aytar, Nando de Freitas
2020 arXiv   pre-print
We describe a system for large-scale audiovisual translation and dubbing, which translates videos from one language to another.  ...  The source language's speech content is transcribed to text, translated, and automatically synthesized into target language speech using the original speaker's voice.  ...  We would like to thank Martin Aguinis, Paige Bailey, and Laurence Moroney for their permission to use their Tensorflow tutorial videos to demonstrate our end-to-end video dubbing approach, and for allowing  ... 
arXiv:2011.03530v1 fatcat:xzgcifgigvakhetruzsseal36i
« Previous Showing results 1 — 15 out of 15,746 results