A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
An Empirical Analysis of Deep Audio-Visual Models for Speech Recognition
[article]
2018
arXiv
pre-print
In this project, we worked on speech recognition, specifically predicting individual words based on both the video frames and audio. ...
Empowered by convolutional neural networks, the recent speech recognition and lip reading models are comparable to human level performance. ...
We re-implemented and made derivations of the state-of-the-art model presented in [28] . Preprint. Work in progress. ...
arXiv:1812.09336v1
fatcat:grljy67llre2lo3fpqx5tdheiu
Audio-visual speech recognition using deep bottleneck features and high-performance lipreading
2015
2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)
This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance ...
In addition, we extracted speech segments from visual features, resulting 77.80% lipreading accuracy. It is found VAD is useful in both audio and visual modalities, for better lipreading and AVSR. ...
As one of methods to compensate the degradation, Audio-Visual Speech Recognition (AVSR), namely bimodal or multi-modal speech recognition, has been studied for a couple of decades. ...
doi:10.1109/apsipa.2015.7415335
dblp:conf/apsipa/TamuraNKOITH15
fatcat:jux4kcgmnjhl7k5hv43wjs5tcu
Investigation of DNN-Based Audio-Visual Speech Recognition
2016
IEICE transactions on information and systems
Audio-Visual Speech Recognition (AVSR) is one of techniques to enhance robustness of speech recognizer in noisy or real environments. ...
There are two ways to employ DNN techniques for speech recognition: a hybrid approach and a tandem approach; in the hybrid approach an emission probability on each Hidden Markov Model (HMM) state is computed ...
.), for his support. A part of this work was supported by JSPS KAKENHI Grant Number 25730109. ...
doi:10.1587/transinf.2016slp0019
fatcat:7uvp7jui7jdmjdw33dlocwjom4
Audio-visual speech recognition using deep learning
2014
Applied intelligence (Boston)
Audio-visual speech recognition (AVSR) system is thought to be one of the most promising solutions for reliable speech recognition, particularly when the audio is corrupted by noise. ...
This study introduces a connectionist-hidden Markov model (HMM) system for noise-robust AVSR. First, a deep denoising autoencoder is utilized for acquiring noise-robust audio features. ...
Acknowledgments This work has been supported by JST PRESTO "Information Environment and Humans" and MEXT Grant-in-Aid for Scientific Research on Innovative Areas "Constructive Developmental Science" (24119003 ...
doi:10.1007/s10489-014-0629-7
fatcat:jirfvfejibdarkcbnp72flc4by
Deep Learning for Visual Speech Analysis: A Survey
[article]
2022
arXiv
pre-print
To push forward future research on visual speech, this paper aims to present a comprehensive review of recent progress in deep learning methods on visual speech analysis. ...
As a powerful AI strategy, deep learning techniques have extensively promoted the development of visual speech learning. ...
One of the most common tasks is audio-visual speech recognition (AVSR), a speech recognition technology that uses visual and audio information. ...
arXiv:2205.10839v1
fatcat:l5m4ohtcvnevrliaiwawg3phjq
A Systematic Study and Empirical Analysis of Lip Reading Models using Traditional and Deep Learning Algorithms
2022
Journal of advanced applied scientific research
In recent years there have been lot of interest in Deep Neural Networks(DNN) and break through results in various domains including Image Classification, Speech Recognition andNatural Language Processing ...
Modelling of the framework has been playing a major role inadvance yield of sequential framework. ...
The only source of communication is conversation. The research in lip reading paved a way for Audio-Visual Automatic Speech Recognition (AV-ASR) systems using deep learning. ...
doi:10.46947/joaasr412022231
fatcat:33677stv6jehxleiqa57cuk6tu
Learning Better Representations for Audio-Visual Emotion Recognition with Common Information
2020
Applied Sciences
To address this issue, we propose a deep learning approach in order to efficiently utilize common information for audio-visual emotion recognition by correlation analysis. ...
Audio-visual emotion recognition aims to distinguish human emotional states by integrating the audio and visual data acquired in the expression of emotions. ...
Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition. ...
doi:10.3390/app10207239
fatcat:iqfdpdejwfhdvcjtu47hskgtt4
Deep Multimodal Learning for Audio-Visual Speech Recognition
[article]
2015
arXiv
pre-print
In this paper, we present methods in deep multimodal learning for fusing speech and visual modalities for Audio-Visual Automatic Speech Recognition (AV-ASR). ...
While the audio network alone achieves a phone error rate (PER) of 41% under clean condition on the IBM large vocabulary audio-visual studio dataset, this fusion model achieves a PER of 35.83% demonstrating ...
In Audio-Visual Automatic Speech Recognition (AV-ASR), both audio recordings and videos of the person talking are available at training time. ...
arXiv:1501.05396v1
fatcat:anbp47zv5vcvfkwpymfvkrzwx4
Audio Visual Speech Recognition using Deep Recurrent Neural Networks
[article]
2016
arXiv
pre-print
In this work, we propose a training algorithm for an audio-visual automatic speech recognition (AV-ASR) system using deep recurrent neural network (RNN).First, we train a deep RNN acoustic model with a ...
Audio and visual features are fused and used to train a fusion RNN. The use of bottleneck features for visual modality helps the model to converge properly during training. ...
Introduction Audio-visual automatic speech recognition (AV-ASR) is a case of multi-modal analysis in which two modalities (audio and visual) complement each other to recognize speech. ...
arXiv:1611.02879v1
fatcat:niwn4c7v6jabvaggkug7axd64q
Evaluating the Potential Gain of Auditory and Audiovisual Speech-Predictive Coding Using Deep Learning
2020
Neural Computation
We propose a set of computational models based on artificial neural networks (mixing deep feedforward and convolutional networks), which are trained to predict future audio observations from present and ...
Interestingly the visual gain vanishes more slowly, and the gain is maximum for a delay of 75 ms between image and predicted sound. ...
Acknowledgments This work has been supported by the European Research Council under the European Community Seventh Framework Programme (FP7/2007-2013 grant agreement 339152, Speech Unit(e)s). ...
doi:10.1162/neco_a_01264
pmid:31951798
fatcat:klxrrmm6pbbwnef44tegzzdsl4
Transfer Learning from Audio-Visual Grounding to Speech Recognition
2019
Interspeech 2019
Moreover, while most previous studies include training data for speech recognition for feature extractor training, our grounding models are not trained on any of those data, indicating more universal applicability ...
To study the properties of features distilled from different layers, we use them as input separately to train multiple speech recognition models. ...
In summary, our contributions are three-fold: (1) To the best of our knowledge, this is the first work connecting audio-visual grounding with speech recognition. (2) Our empirical study verifies that the ...
doi:10.21437/interspeech.2019-1227
dblp:conf/interspeech/HsuHG19
fatcat:vtf6iei6xfh5rela5wa77r6brq
A deep representation for invariance and music classification
2014
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
We present the main theoretical and computational aspects of a framework for unsupervised learning of invariant audio representations, empirically evaluated on music genre classification. ...
Modules of projection and pooling can then constitute layers of deep networks, for learning composite representations. ...
RELATED WORK Deep learning and convolutional networks (CNNs) have been recently applied for learning mid-and high-level audio representations, motivated by successes in improving image and speech recognition ...
doi:10.1109/icassp.2014.6854954
dblp:conf/icassp/ZhangEVRP14
fatcat:cbxy36hr3vbchmk4j7utqo5wzi
A Review on Methods and Applications in Multimodal Deep Learning
[article]
2022
arXiv
pre-print
Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the last five years (2017 to 2021) in multimodal deep learning applications has been provided. ...
The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. ...
In this method, for better use of visual, textual and audio features for video emotion detection, bidirectional GRU is cascaded with an attention mechanism. ...
arXiv:2202.09195v1
fatcat:wwxrmrwmerfabbenleylwmmj7y
Continuous Multimodal Emotion Recognition Approach for AVEC 2017
[article]
2017
arXiv
pre-print
This paper reports the analysis of audio and visual features in predicting the continuous emotion dimensions under the seventh Audio/Visual Emotion Challenge (AVEC 2017), which was done as part of a B.Tech ...
For visual features we used the HOG (Histogram of Gradients) features, Fisher encodings of SIFT (Scale-Invariant Feature Transform) features based on Gaussian mixture model (GMM) and some pretrained Convolutional ...
Thus a final fisher vector is obtained for each frame of each video.
3) Deep Visual Features: An output of a particular layer of pretrained models : VGG-Face [19] and ResNet-50-dag [8] are used as ...
arXiv:1709.05861v2
fatcat:2v7jtjjlfzempp6s4tbh4yrr7a
Bi-modal First Impressions Recognition using Temporally Ordered Deep Audio and Stochastic Visual Features
[article]
2016
arXiv
pre-print
We propose a novel approach for First Impressions Recognition in terms of the Big Five personality-traits from short videos. ...
We empirically show that the trained models perform exceptionally well, even after training from a small sub-portions of inputs. ...
Conclusions and Future Works In this work, we proposed two deep neural network based models that use audio and visual features for the task of First Impressions Recognition. ...
arXiv:1610.10048v1
fatcat:n7voeg47qraq5oz43m64mqgawq
« Previous
Showing results 1 — 15 out of 13,264 results