11,419 Hits in 3.4 sec

Speech Pre-training with Acoustic Piece [article]

Shuo Ren, Shujie Liu, Yu Wu, Long Zhou, Furu Wei
2022 arXiv   pre-print
With the acoustic piece as the training signal, we can implicitly bridge the input audio and natural language, which benefits audio-to-text tasks, such as automatic speech recognition (ASR).  ...  Previous speech pre-training methods, such as wav2vec2.0 and HuBERT, pre-train a Transformer encoder to learn deep representations from audio data, with objectives predicting either elements from latent  ...  Based on that, we extract the patterns called "acoustic piece" with the sentence piece method, and take it as the training signal for speech pre-training.  ... 
arXiv:2204.03240v1 fatcat:thqntbcjcfegpgvmi543zmsza4

Transfer learning emotion manifestation across music and speech

Eduardo Coutinho, Jun Deng, Bjorn Schuller
2014 2014 International Joint Conference on Neural Networks (IJCNN)  
Overall, results indicate a good cross-domain generalization performance, particularly for the model trained on speech and tested on music without pre-encoding of the input features.  ...  First, we compare the use of Recurrent Neural Networks (RNN) with standard hidden units (Simple Recurrent Network -SRN) and Long-Short Term Memory (LSTM) blocks for intra-domain acoustic emotion recognition  ...  Each DAE was trained to reproduce the feature space of acoustic descriptors of the full set of music pieces (Music to Speech) or speech samples (Speech to Music).  ... 
doi:10.1109/ijcnn.2014.6889814 dblp:conf/ijcnn/CoutinhoDS14 fatcat:2ftnwdj7svh3pjj42sf7m6rzku

Shared acoustic codes underlie emotional communication in music and speech—Evidence from deep transfer learning

Eduardo Coutinho, Björn Schuller, Yudong Zhang
2017 PLoS ONE  
., models trained and tested on the same modality, either music or speech) and cross-domain experiments (i.e., models trained in one modality and tested on the other).  ...  In a meta-analysis that reviews 104 studies of vocal expression and 41 studies of music performance and compared the acoustic characteristics of speech and music associated with particular emotions [8]  ...  Speech DAE pre-training: Semaine database.  ... 
doi:10.1371/journal.pone.0179289 pmid:28658285 pmcid:PMC5489171 fatcat:jytpt7nehnch7pml3en33sxttu

A Study into Pre-training Strategies for Spoken Language Understanding on Dysarthric Speech [article]

Pu Wang, Bagher BabaAli, Hugo Van hamme
2021 arXiv   pre-print
The acoustic model is pre-trained in two stages: initialization with a corpus of normal speech and finetuning on a mixture of dysarthric and normal speech.  ...  This paper investigates the efficiency of pre-training strategies for SLU tasks on dysarthric speech.  ...  The acoustic model is pre-trained with ASR targets in two stages.  ... 
arXiv:2106.08313v1 fatcat:6rvmm7swvvcc7ij6osh7dtnbfi

Exploring Pre-Training with Alignments for RNN Transducer Based End-to-End Speech Recognition

Hu Hu, Rui Zhao, Jinyu Li, Liang Lu, Yifan Gong
2020 ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively.  ...  In particular, the encoder pre-training solution achieved a 10% and a 8% relative word error rate reduction when compared with random initialization and the widely used CTC+RNNLM initialization strategy  ...  The word piece units are generated by running byte pair encoding [31] on the acoustic training texts.  ... 
doi:10.1109/icassp40776.2020.9054663 dblp:conf/icassp/HuZLLG20 fatcat:76spwtthpfffrm3zh2qxg6ddky

Unsupervised pre-training for sequence to sequence speech recognition [article]

Zhiyun Fan and Shiyu Zhou and Bo Xu
2020 arXiv   pre-print
In the acoustic pre-training stage, we use a large amount of speech to pre-train the encoder by predicting masked speech feature chunks with its context.  ...  Our pre-training method is divided into two stages, named acoustic pre-trianing and linguistic pre-training.  ...  Two pre-training stages are used to extract acoustic and linguistic information with speech and transcripts respectively.  ... 
arXiv:1910.12418v2 fatcat:btxci4ozonbyblohq6lymnwzo4

End-to-end spoken language understanding using transformer networks and self-supervised pre-trained features [article]

Edmilson Morais, Hong-Kwang J. Kuo, Samuel Thomas, Zoltan Tuske, Brian Kingsbury
2020 arXiv   pre-print
These experiments investigate the interaction of pre-trained model initialization and multi-task training with either traditional filterbank or self-supervised pre-trained acoustic features.  ...  Results show not only that self-supervised pre-trained acoustic features outperform filterbank features in almost all the experiments, but also that when these features are used in combination with multi-task  ...  features (filterbank), self-supervised pre-trained acoustic features (wav2vec), pre-trained model initialization, and multi-task training.  ... 
arXiv:2011.08238v1 fatcat:kmtdxomgfnhlxne4rmv3bxmizu

AIPNet: Generative Adversarial Pre-training of Accent-invariant Networks for End-to-end Speech Recognition [article]

Yi-Chen Chen, Zhaojun Yang, Ching-Feng Yeh, Mahaveer Jain, Michael L. Seltzer
2019 arXiv   pre-print
We pre-train AIPNet to disentangle accent-invariant and accent-specific characteristics from acoustic features through adversarial training on accented data for which transcriptions are not necessarily  ...  For this purpose, we propose a novel pre-training framework AIPNet based on generative adversarial nets (GAN) for accent-invariant representation learning: Accent Invariant Pre-training Networks.  ...  In the pre-training stage, AIPNet is built through adversarial training to disentangle accentinvariant and accent-specific characteristics from acoustic features.  ... 
arXiv:1911.11935v1 fatcat:do2blvazhfdntoshwwwo4dran4

PM-MMUT: Boosted Phone-Mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition [article]

Guodong Ma, Pengfei Hu, Nurmemet Yolwas, Shen Huang, Hao Huang
2022 arXiv   pre-print
ESPnet1 pre-trained model.  ...  To boost the performance of PMT, we propose multi-modeling unit training (MMUT) architecture fusion with PMT (PM-MMUT).  ...  Finally, our model achieves the best performance without external language model, which achieves about 10% relative WER reduction on all the tests comparing with the official ESPnet1 pre-trained model.  ... 
arXiv:2112.06721v3 fatcat:eszmrtk2qbg2hkkuiqeirrtxye

Deep Learning based NLP Techniques In Text to Speech Synthesis for Communication Recognition

Eriss Eisa Babikir Adam
2020 Journal of Soft Computing Paradigm  
The main objective of this research article is that implements deep learning techniques into speech synthesis and compares the performance in terms of aperiodic distortion with prior model of algorithms  ...  The computer system is developing the model for speech synthesis of various aspects for natural language processing. The speech synthesis explores by articulatory, formant and concatenate synthesis.  ...  ACKNOWLEDGEMENT We would like to thank the Mr.Karunakaran, Senior instructor, Bahrain Training Institute, Bahrain for his data collection for this research article.  ... 
doi:10.36548/jscp.2020.4.002 fatcat:bcmjtjaykvcodp5jrgffp2plzu

On Addressing Practical Challenges for RNN-Transducer [article]

Rui Zhao, Jian Xue, Jinyu Li, Wenning Wei, Lei He, Yifan Gong
2021 arXiv   pre-print
The first challenge is solved with a splicing data method which concatenates the speech segments extracted from the source domain data.  ...  Evaluated with Microsoft production data, the splicing data adaptation method improves the baseline and adaptation with the text to speech method by 58.03% and 15.25% relative word error rate reduction  ...  Secondly, the speech data generated with the proposed method is "real" speech at each segment, hence it has the potential to cover all the speakers and acoustic environments in the existing training data  ... 
arXiv:2105.00858v3 fatcat:td4zbglyyvhf3dfizyyvdkfxce

Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition [article]

Hu Hu, Rui Zhao, Jinyu Li, Liang Lu, Yifan Gong
2020 arXiv   pre-print
Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively.  ...  In particular, the encoder pre-training solution achieved a 10% and a 8% relative word error rate reduction when compared with random initialization and the widely used CTC+RNNLM initialization strategy  ...  The word piece units are generated by running byte pair encoding [31] on the acoustic training texts.  ... 
arXiv:2005.00572v1 fatcat:hddikwhlbfgnlcgiklp5rpapf4

End-to-End Neural Systems for Automatic Children Speech Recognition: An Empirical Study [article]

Prashanth Gurunath Shivakumar, Shrikanth Narayanan
2021 arXiv   pre-print
Children speech recognition is more challenging due to the larger intra-inter speaker variability in terms of acoustic and linguistic characteristics compared to adult speech.  ...  Insights are provided on the aspects of training data requirements, adaptation on children data, and the effect of children age, utterance lengths, different architectures and loss functions for end-to-end  ...  We initialize the acoustic model with the pre-trained adult model trained on LIBRISPEECH.  ... 
arXiv:2102.09918v1 fatcat:ikg5tmf45bcv7b3hequyfprpqa

Decoupling Pronunciation and Language for End-to-end Code-switching Automatic Speech Recognition [article]

Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai, Jianhua Tao, Zhengqi wen
2020 arXiv   pre-print
Then the generated phoneme-text paired data is used to train the P2T network. This network can be pre-trained with large amounts of external unpaired text data.  ...  The A2P network can learn acoustic pattern scenarios using large-scale monolingual paired data.  ...  The MER/CER/WER (%) of decoupled transformer with different pre-training data.  ... 
arXiv:2010.14798v1 fatcat:7qsud65tsfhbrjrnxp37hovqfa

MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation [article]

Junkun Chen, Mingbo Ma, Renjie Zheng, Liang Huang
2021 arXiv   pre-print
Pre-training of MAM with arbitrary acoustic signals also has an average improvement with +1.6 BLEU for those languages.  ...  This technique termed Masked Acoustic Modeling (MAM), not only provides an alternative solution to improving E2E-ST, but also can perform pre-training on any acoustic signals (including non-speech ones  ...  This allows us to perform pre-training with MAM with three different settings, pre-training with source language speech, with multilingual speech, and arbitrary audios.  ... 
arXiv:2010.11445v2 fatcat:56veneg6averbggnt3tlbvlt34
« Previous Showing results 1 — 15 out of 11,419 results