Filters








116,899 Hits in 4.3 sec

Conditional End-to-End Audio Transforms

Albert Haque, Michelle Guo, Prateek Verma
2018 Interspeech 2018  
We present an end-to-end method for transforming audio from one style to another.  ...  For the case of speech, by conditioning on speaker identities, we can train a single model to transform words spoken by multiple people into multiple target voices.  ...  We would like to thank Malcolm Slaney and Dan Jurafsky for helpful feedback. Additionally, we thank members of the Stanford AI Lab for participating in subjective experiments.  ... 
doi:10.21437/interspeech.2018-38 dblp:conf/interspeech/HaqueGV18 fatcat:y7ny2wqx7jgspgj6j64zsyb4lq

Conditional End-to-End Audio Transforms [article]

Albert Haque, Michelle Guo, Prateek Verma
2018 arXiv   pre-print
We present an end-to-end method for transforming audio from one style to another.  ...  For the case of speech, by conditioning on speaker identities, we can train a single model to transform words spoken by multiple people into multiple target voices.  ...  We would like to thank Malcolm Slaney and Dan Jurafsky for feedback. Additionally, we thank members of the Stanford AI Lab for participating in subjective experiments.  ... 
arXiv:1804.00047v2 fatcat:7vc4e244fjepfcwkmuwamq2iku

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition [article]

Dmitriy Serdyuk, Otavio Braga, Olivier Siohan
2022 arXiv   pre-print
In particular, the information contained in the motion of the speaker's mouth is used to augment the audio features.  ...  Recently, image transformer networks arXiv:2010.11929 demonstrated the ability to extract rich visual features for the image classification task.  ...  We are aware of the sensitive nature of the audio-visual speech recognition research and other AI technologies used in this work.  ... 
arXiv:2201.10439v1 fatcat:ribgjwzwhjc3tfatytvcc7dh3e

Fusing information streams in end-to-end audio-visual speech recognition [article]

Wentao Yu, Steffen Zeiler, Dorothea Kolossa
2021 arXiv   pre-print
While audio-visual speech recognition can significantly improve the recognition rate of end-to-end models in such poor conditions, it is not obvious how to best utilize any available information on acoustic  ...  On average, the new system achieves a relative word error rate reduction of 43% compared to the audio-only setup and 31% compared to the audiovisual end-to-end baseline.  ...  CONCLUSION In noisy conditions, large-vocabulary end-to-end speech recognition remains a difficult task.  ... 
arXiv:2104.09482v1 fatcat:brigsmdknnholizjtgieujqp6q

End-to-end Audio-visual Speech Recognition with Conformers [article]

Pingchuan Ma, Stavros Petridis, Maja Pantic
2021 arXiv   pre-print
In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner.  ...  We show that end-to-end training, instead of using pre-computed visual features which is common in the literature, the use of a conformer, instead of a recurrent network, and the use of a transformer-based  ...  We would like to thank Dr. Jie Shen for his help with face tracking. The work of Pingchuan Ma has been partially supported by Honda and "AWS Cloud Credits for Research".  ... 
arXiv:2102.06657v1 fatcat:pr5od7w73vfr5gs7wyjl4tvhvq

SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation [article]

Arya D. McCarthy and Liezl Puzon and Juan Pino
2020 arXiv   pre-print
This technique directly transforms an audio sequence, resulting in audio synthesized to resemble another speaker's voice.  ...  Finally, we show that we can combine our approach with augmentation by machine-translated transcripts to obtain a competitive end-to-end AST model that outperforms a very strong cascade model on an English  ...  Fig. 1 . 1 Conditional autoencoding of two speakers' audio (top). The latent speaker representation z i can transform ("skin") new audio from unseen speakers (bottom).  ... 
arXiv:2002.12231v1 fatcat:epuqipqds5ckloljxaekxmmixu

Audio-Visual Speech Recognition is Worth 32×32×8 Voxels [article]

Dmitriy Serdyuk, Otavio Braga, Olivier Siohan
2021 arXiv   pre-print
In this work, we propose to replace the 3D convolutional visual front-end with a video transformer front-end.  ...  On a lip-reading task, the transformer-based front-end shows superior performance compared to a strong convolutional baseline.  ...  Our paper aims to test and explore the viability of using a fully transformer-based architecture, where both the video and audio front-ends are transformer networks.  ... 
arXiv:2109.09536v1 fatcat:hhy76zxdyrdkpmdgasmreh5u7a

AVMSN: An audio-visual two stream crowd counting framework under low-quality conditions

Ruihan Hu, Qinglong Mo, Yuanfei Xie, Yongqian Xu, Jiaqi Chen, Yalun Yang, Hongjian Zhou, Zhi-Ri Tang, Edmond Q. Wu
2021 IEEE Access  
Vision-end branch in the Feature extraction module to calculate the weighted-visual feature.  ...  Besides, the audio, which is the temporal domain transformed into the spectrogram information and the audio feature is learned by the audio-VGG network.  ...  consists of the Vision-end and the Audio-end branches to extract the features.  ... 
doi:10.1109/access.2021.3074797 fatcat:kdj6dpkln5dkvbghws7meqhah4

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition [article]

Xichen Pan, Peiyu Chen, Yichen Gong, Helong Zhou, Xinbing Wang, Zhouhan Lin
2022 arXiv   pre-print
In particular, audio and visual front-ends are trained on large-scale unimodal datasets, then we integrate components of both front-ends into a larger multimodal framework which learns to recognize parallel  ...  Training Transformer-based models demands a large amount of data, while obtaining aligned and labelled data in multimodality is rather cost-demanding, especially for audio-visual speech recognition (AVSR  ...  Decoders MoCo v2 Decoders Fusion Module Decoders Wav2vec 2.0 Res ×4 Initialize Classification Head Initialize Visual Back-end Freeze Visual Back-end Audio Back-end Freeze Audio Back-end Initialize Transformer  ... 
arXiv:2203.07996v2 fatcat:top2qntzqfhtpfmgcrjwled43m

A Survey on Audio Synthesis and Audio-Visual Multimodal Processing [article]

Zhaofeng Shi
2021 arXiv   pre-print
Meanwhile, significant efforts have been dedicated by researchers to handle multimodal tasks at present such as audio-visual multimodal processing.  ...  This review focuses on text to speech(TTS), music generation and some tasks that combine visual and acoustic information.  ...  In detail, 1-D condition is used to condition chords of the generation, while 2-D condition is the previous generated bar in order to condition the current bar.  ... 
arXiv:2108.00443v1 fatcat:5xkj7lf7pfgpppvfqwynoqkqjm

Neural Speech Synthesis with Transformer Network

Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu
2019 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results.  ...  Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-theart performance, they still suffer from two problems: 1) low efficiency during training  ...  With the end-to-end neural network, quality of synthesized audios is greatly improved and even comparable with human recordings on some datasets.  ... 
doi:10.1609/aaai.v33i01.33016706 fatcat:325z3grkm5bh7ozwmbbzjjizdq

A Hybrid Approach to Audio-to-Score Alignment [article]

Ruchit Agrawal, Simon Dixon
2020 arXiv   pre-print
Audio-to-score alignment aims at generating an accurate mapping between a performance audio and the score of a given piece.  ...  Experiments on music data from different acoustic conditions demonstrate that this method generates robust alignments whilst being adaptable at the same time.  ...  We conduct experiments using both the Short-Time Fourier transform (STFT) as well as the Constant-Q transform (CQT) transformations of the raw audios.  ... 
arXiv:2007.14333v1 fatcat:mpshh6kqjfe2pciqg2grg4zxmi

Semi-Supervised Training of Transformer and Causal Dilated Convolution Network with Applications to Topic Classification

Jinxiang Zeng, Du Zhang, Zhiyi Li, Xiaolin Li
2021 Applied Sciences  
In order to reliably and stably identify audio topics, we extract different features and different loss function calculation methods to find the best model solution.  ...  Aiming at the audio event recognition problem of speech recognition, a decision fusion method based on the Transformer and Causal Dilated Convolutional Network (TCDCN) framework is proposed.  ...  The combined probability of audio waveforms can be decomposed into a conditional probability distribution.  ... 
doi:10.3390/app11125712 fatcat:stdxsde37vbx3it2sabr44h2ny

Neural Speech Synthesis with Transformer Network [article]

Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou
2019 arXiv   pre-print
Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results.  ...  Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training  ...  With the end-to-end neural network, quality of synthesized audios is greatly improved and even comparable with human recordings on some datasets.  ... 
arXiv:1809.08895v3 fatcat:2lrukvnil5ds7b5nxj7y46ucle

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss [article]

Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, Shankar Kumar
2020 arXiv   pre-print
In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system.  ...  Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently.  ...  CONCLUSIONS In this paper, we presented the Transformer Transducer model, embedding Transformer based self-attention for audio and label encoding within the RNN-T architecture, resulting in an end-to-end  ... 
arXiv:2002.02562v2 fatcat:n7zabn3rubav7ce4nb5y6puez4
« Previous Showing results 1 — 15 out of 116,899 results