13,264 Hits in 8.4 sec

An Empirical Analysis of Deep Audio-Visual Models for Speech Recognition [article]

Devesh Walawalkar, Yihui He, Rohit Pillai
2018 arXiv   pre-print
In this project, we worked on speech recognition, specifically predicting individual words based on both the video frames and audio.  ...  Empowered by convolutional neural networks, the recent speech recognition and lip reading models are comparable to human level performance.  ...  We re-implemented and made derivations of the state-of-the-art model presented in [28] . Preprint. Work in progress.  ... 
arXiv:1812.09336v1 fatcat:grljy67llre2lo3fpqx5tdheiu

Audio-visual speech recognition using deep bottleneck features and high-performance lipreading

Satoshi Tamura, Hiroshi Ninomiya, Norihide Kitaoka, Shin Osuga, Yurie Iribe, Kazuya Takeda, Satoru Hayamizu
2015 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)  
This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance  ...  In addition, we extracted speech segments from visual features, resulting 77.80% lipreading accuracy. It is found VAD is useful in both audio and visual modalities, for better lipreading and AVSR.  ...  As one of methods to compensate the degradation, Audio-Visual Speech Recognition (AVSR), namely bimodal or multi-modal speech recognition, has been studied for a couple of decades.  ... 
doi:10.1109/apsipa.2015.7415335 dblp:conf/apsipa/TamuraNKOITH15 fatcat:jux4kcgmnjhl7k5hv43wjs5tcu

Investigation of DNN-Based Audio-Visual Speech Recognition

Satoshi TAMURA, Hiroshi NINOMIYA, Norihide KITAOKA, Shin OSUGA, Yurie IRIBE, Kazuya TAKEDA, Satoru HAYAMIZU
2016 IEICE transactions on information and systems  
Audio-Visual Speech Recognition (AVSR) is one of techniques to enhance robustness of speech recognizer in noisy or real environments.  ...  There are two ways to employ DNN techniques for speech recognition: a hybrid approach and a tandem approach; in the hybrid approach an emission probability on each Hidden Markov Model (HMM) state is computed  ...  .), for his support. A part of this work was supported by JSPS KAKENHI Grant Number 25730109.  ... 
doi:10.1587/transinf.2016slp0019 fatcat:7uvp7jui7jdmjdw33dlocwjom4

Audio-visual speech recognition using deep learning

Kuniaki Noda, Yuki Yamaguchi, Kazuhiro Nakadai, Hiroshi G. Okuno, Tetsuya Ogata
2014 Applied intelligence (Boston)  
Audio-visual speech recognition (AVSR) system is thought to be one of the most promising solutions for reliable speech recognition, particularly when the audio is corrupted by noise.  ...  This study introduces a connectionist-hidden Markov model (HMM) system for noise-robust AVSR. First, a deep denoising autoencoder is utilized for acquiring noise-robust audio features.  ...  Acknowledgments This work has been supported by JST PRESTO "Information Environment and Humans" and MEXT Grant-in-Aid for Scientific Research on Innovative Areas "Constructive Developmental Science" (24119003  ... 
doi:10.1007/s10489-014-0629-7 fatcat:jirfvfejibdarkcbnp72flc4by

Deep Learning for Visual Speech Analysis: A Survey [article]

Changchong Sheng, Gangyao Kuang, Liang Bai, Chenping Hou, Yulan Guo, Xin Xu, Matti Pietikäinen, Li Liu
2022 arXiv   pre-print
To push forward future research on visual speech, this paper aims to present a comprehensive review of recent progress in deep learning methods on visual speech analysis.  ...  As a powerful AI strategy, deep learning techniques have extensively promoted the development of visual speech learning.  ...  One of the most common tasks is audio-visual speech recognition (AVSR), a speech recognition technology that uses visual and audio information.  ... 
arXiv:2205.10839v1 fatcat:l5m4ohtcvnevrliaiwawg3phjq

A Systematic Study and Empirical Analysis of Lip Reading Models using Traditional and Deep Learning Algorithms

R Sangeetha, D. Malathi
2022 Journal of advanced applied scientific research  
In recent years there have been lot of interest in Deep Neural Networks(DNN) and break through results in various domains including Image Classification, Speech Recognition andNatural Language Processing  ...  Modelling of the framework has been playing a major role inadvance yield of sequential framework.  ...  The only source of communication is conversation. The research in lip reading paved a way for Audio-Visual Automatic Speech Recognition (AV-ASR) systems using deep learning.  ... 
doi:10.46947/joaasr412022231 fatcat:33677stv6jehxleiqa57cuk6tu

Learning Better Representations for Audio-Visual Emotion Recognition with Common Information

Fei Ma, Wei Zhang, Yang Li, Shao-Lun Huang, Lin Zhang
2020 Applied Sciences  
To address this issue, we propose a deep learning approach in order to efficiently utilize common information for audio-visual emotion recognition by correlation analysis.  ...  Audio-visual emotion recognition aims to distinguish human emotional states by integrating the audio and visual data acquired in the expression of emotions.  ...  Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition.  ... 
doi:10.3390/app10207239 fatcat:iqfdpdejwfhdvcjtu47hskgtt4

Deep Multimodal Learning for Audio-Visual Speech Recognition [article]

Youssef Mroueh, Etienne Marcheret, Vaibhava Goel
2015 arXiv   pre-print
In this paper, we present methods in deep multimodal learning for fusing speech and visual modalities for Audio-Visual Automatic Speech Recognition (AV-ASR).  ...  While the audio network alone achieves a phone error rate (PER) of 41% under clean condition on the IBM large vocabulary audio-visual studio dataset, this fusion model achieves a PER of 35.83% demonstrating  ...  In Audio-Visual Automatic Speech Recognition (AV-ASR), both audio recordings and videos of the person talking are available at training time.  ... 
arXiv:1501.05396v1 fatcat:anbp47zv5vcvfkwpymfvkrzwx4

Audio Visual Speech Recognition using Deep Recurrent Neural Networks [article]

Abhinav Thanda, Shankar M Venkatesan
2016 arXiv   pre-print
In this work, we propose a training algorithm for an audio-visual automatic speech recognition (AV-ASR) system using deep recurrent neural network (RNN).First, we train a deep RNN acoustic model with a  ...  Audio and visual features are fused and used to train a fusion RNN. The use of bottleneck features for visual modality helps the model to converge properly during training.  ...  Introduction Audio-visual automatic speech recognition (AV-ASR) is a case of multi-modal analysis in which two modalities (audio and visual) complement each other to recognize speech.  ... 
arXiv:1611.02879v1 fatcat:niwn4c7v6jabvaggkug7axd64q

Evaluating the Potential Gain of Auditory and Audiovisual Speech-Predictive Coding Using Deep Learning

Thomas Hueber, Eric Tatulli, Laurent Girin, Jean-Luc Schwartz
2020 Neural Computation  
We propose a set of computational models based on artificial neural networks (mixing deep feedforward and convolutional networks), which are trained to predict future audio observations from present and  ...  Interestingly the visual gain vanishes more slowly, and the gain is maximum for a delay of 75 ms between image and predicted sound.  ...  Acknowledgments This work has been supported by the European Research Council under the European Community Seventh Framework Programme (FP7/2007-2013 grant agreement 339152, Speech Unit(e)s).  ... 
doi:10.1162/neco_a_01264 pmid:31951798 fatcat:klxrrmm6pbbwnef44tegzzdsl4

Transfer Learning from Audio-Visual Grounding to Speech Recognition

Wei-Ning Hsu, David Harwath, James Glass
2019 Interspeech 2019  
Moreover, while most previous studies include training data for speech recognition for feature extractor training, our grounding models are not trained on any of those data, indicating more universal applicability  ...  To study the properties of features distilled from different layers, we use them as input separately to train multiple speech recognition models.  ...  In summary, our contributions are three-fold: (1) To the best of our knowledge, this is the first work connecting audio-visual grounding with speech recognition. (2) Our empirical study verifies that the  ... 
doi:10.21437/interspeech.2019-1227 dblp:conf/interspeech/HsuHG19 fatcat:vtf6iei6xfh5rela5wa77r6brq

A deep representation for invariance and music classification

Chiyuan Zhang, Georgios Evangelopoulos, Stephen Voinea, Lorenzo Rosasco, Tomaso Poggio
2014 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
We present the main theoretical and computational aspects of a framework for unsupervised learning of invariant audio representations, empirically evaluated on music genre classification.  ...  Modules of projection and pooling can then constitute layers of deep networks, for learning composite representations.  ...  RELATED WORK Deep learning and convolutional networks (CNNs) have been recently applied for learning mid-and high-level audio representations, motivated by successes in improving image and speech recognition  ... 
doi:10.1109/icassp.2014.6854954 dblp:conf/icassp/ZhangEVRP14 fatcat:cbxy36hr3vbchmk4j7utqo5wzi

A Review on Methods and Applications in Multimodal Deep Learning [article]

Jabeen Summaira, Xi Li, Amin Muhammad Shoib, Jabbar Abdul
2022 arXiv   pre-print
Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the last five years (2017 to 2021) in multimodal deep learning applications has been provided.  ...  The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities.  ...  In this method, for better use of visual, textual and audio features for video emotion detection, bidirectional GRU is cascaded with an attention mechanism.  ... 
arXiv:2202.09195v1 fatcat:wwxrmrwmerfabbenleylwmmj7y

Continuous Multimodal Emotion Recognition Approach for AVEC 2017 [article]

Narotam Singh Indian Institute of Technology Ropar)
2017 arXiv   pre-print
This paper reports the analysis of audio and visual features in predicting the continuous emotion dimensions under the seventh Audio/Visual Emotion Challenge (AVEC 2017), which was done as part of a B.Tech  ...  For visual features we used the HOG (Histogram of Gradients) features, Fisher encodings of SIFT (Scale-Invariant Feature Transform) features based on Gaussian mixture model (GMM) and some pretrained Convolutional  ...  Thus a final fisher vector is obtained for each frame of each video. 3) Deep Visual Features: An output of a particular layer of pretrained models : VGG-Face [19] and ResNet-50-dag [8] are used as  ... 
arXiv:1709.05861v2 fatcat:2v7jtjjlfzempp6s4tbh4yrr7a

Bi-modal First Impressions Recognition using Temporally Ordered Deep Audio and Stochastic Visual Features [article]

Arulkumar Subramaniam, Vismay Patel, Ashish Mishra, Prashanth Balasubramanian, Anurag Mittal
2016 arXiv   pre-print
We propose a novel approach for First Impressions Recognition in terms of the Big Five personality-traits from short videos.  ...  We empirically show that the trained models perform exceptionally well, even after training from a small sub-portions of inputs.  ...  Conclusions and Future Works In this work, we proposed two deep neural network based models that use audio and visual features for the task of First Impressions Recognition.  ... 
arXiv:1610.10048v1 fatcat:n7voeg47qraq5oz43m64mqgawq
« Previous Showing results 1 — 15 out of 13,264 results