28 Hits in 2.9 sec

Recent Developments on ESPnet Toolkit Boosted by Conformer [article]

Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe (+3 others)
2020 arXiv   pre-print
In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer  ...  Our experiments reveal various training tips and significant performance benefits obtained with the Conformer on different tasks.  ...  Both Transformer and Conformer models are implemented based on ESPnet toolkit. Results marked with (*) are obtained with ESPnet2. Table 2.  ... 
arXiv:2010.13956v2 fatcat:kfxu5x7qrzacrdfmirxddzn3la

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding [article]

Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian (+1 others)
2022 arXiv   pre-print
This paper presents recent progress on integrating speech separation and enhancement (SSE) into the ESPnet toolkit.  ...  The code is available online at The multi-channel ST and SLU datasets, which are another contribution of this work, are released on HuggingFace.  ...  To accelerate research in SSE, ESPnet-SE toolkit [9] was developed and currently supports multiple state-of-the-art enhancement approaches and various corpora.  ... 
arXiv:2207.09514v1 fatcat:waiuul7ypfexbjeq54llxbydhq

PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit [article]

Hui Zhang, Tian Yuan, Junkun Chen, Xintong Li, Renjie Zheng, Yuxin Huang, Xiaojie Chen, Enlei Gong, Zeyu Chen, Xiaoguang Hu, Dianhai Yu, Yanjun Ma (+1 others)
2022 arXiv   pre-print
PaddleSpeech is an open-source all-in-one speech toolkit.  ...  It aims at facilitating the development and research of speech processing technologies by providing an easy-to-use command-line interface and a simple code structure.  ...  This work was supported by the National Key Research and Development Project of China (2020AAA0103503).  ... 
arXiv:2205.12007v1 fatcat:zfubo5stfvczhaaeepxod4u6hy

WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit [article]

Binbin Zhang, Di Wu, Zhendong Peng, Xingchen Song, Zhuoyuan Yao, Hang Lv, Lei Xie, Chao Yang, Fuping Pan, Jianwei Niu
2022 arXiv   pre-print
Recently, we made available WeNet, a production-oriented end-to-end speech recognition toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address the streaming and non-streaming  ...  this paper, we present WeNet 2.0 with four important updates. (1) We propose U2++, a unified two-pass framework with bidirectional attention decoders, which includes the future contextual information by  ...  The result of the model trained on Wenet-Speech by shard mode is comparable to ESPnet, which further illustrates the effectiveness of UIO.  ... 
arXiv:2203.15455v2 fatcat:4qtv44edlfafxftkah6pvz6onu

SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition [article]

Jing Pan, Tao Lei, Kwangyoun Kim, Kyu Han, Shinji Watanabe
2021 arXiv   pre-print
Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.  ...  On the popular LibriSpeech benchmark, our SRU++ model achieves 2.0% / 4.7% WER on test-clean / test-other, showing competitive performances compared with the state-of-the-art Conformer encoder under the  ...  [12] , the corresponding trained models along with the language models are also available on ESPNet toolkit.  ... 
arXiv:2110.05571v1 fatcat:bsqrxudj5zftpgdxgsmdsclvkq

SpliceOut: A Simple and Efficient Audio Augmentation Method [article]

Arjit Jain, Pranay Reddy Samala, Deepak Mittal, Preethi Jyoti, Maneesh Singh
2021 arXiv   pre-print
as well as on speech translation, sound and music classification, thus establishing itself as a broadly applicable audio augmentation method.  ...  SpliceOut performs comparably to (and sometimes outperforms) SpecAugment on a wide variety of speech and audio tasks, including ASR for seven different languages using varying amounts of training data,  ...  Our base model is the large variant of the Conformer model [53] , Conformer(L), which is a state-of-the-art network for ASR and is implemented using the ESPnet toolkit [54] .  ... 
arXiv:2110.00046v2 fatcat:oxpxaw7iorbexi3zwv63nihg4i

Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers [article]

Huahuan Zheng, Wenjie Peng, Zhijian Ou, Jinsong Zhang
2021 arXiv   pre-print
Specifically, we investigate techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs.  ...  Experiments are conducted on two English datasets (Switchboard, Librispeech) and a German dataset from CommonVoice.  ...  Compared with hybrid and E2E systems, the recently developed CTC-CRF [8, 9] framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach.  ... 
arXiv:2107.03007v2 fatcat:ukggt5xugzbczoygl77qjd6v7i

Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [article]

Tianzi Wang, Yuya Fujita, Xuankai Chang, Shinji Watanabe
2021 arXiv   pre-print
All of our codes will be publicly available at  ...  However, the recognition inference needs to wait for the completion of a full speech utterance, which limits their applications on low latency scenarios.  ...  All experiments are conducted using the open-source, E2E speech processing toolkit ESPnet [31, 34, 35] .  ... 
arXiv:2107.09428v1 fatcat:r6y2eeiezrdohos7h6fysllzii

DEFORMER: Coupling Deformed Localized Patterns with Global Context for Robust End-to-end Speech Recognition [article]

Jiamin Xie, John H.L. Hansen
2022 arXiv   pre-print
By analyzing our best-performing model, we visualize both local receptive fields and global attention maps learned by the Deformer and show increased feature associations on the utterance level.  ...  Finally, replacing only half of the layers in the encoder, the Deformer improves +5.6% relative WER without a LM and +6.4% relative WER with a LM over the Conformer baseline on the WSJ eval92 set.  ...  Acknowledgements The authors would like to thank Wei Xia and Szu-Jui Chen for their meaningful discussion and suggestions on the work.  ... 
arXiv:2207.01732v2 fatcat:h6dwnb5mijfjpn7szzwhe36i7q

Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders [article]

Chen Xu, Bojie Hu, Yanyang Li, Yuhao Zhang, shen huang, Qi Ju, Tong Xiao, Jingbo Zhu
2021 arXiv   pre-print
Experimental results on the LibriSpeech En-Fr and MuST-C En-De ST tasks show that our method achieves state-of-the-art BLEU scores of 18.3 and 25.2.  ...  Also, we develop an adaptor module to alleviate the representation inconsistency between the pre-trained ASR encoder and MT encoder, and develop a multi-teacher knowledge distillation method to preserve  ...  Acknowledgement This work was supported in part by the National Science Foundation of China (Nos. 61876035 and 61732005), the National Key R&D Program of China (No. 2019QY1801), and the Ministry of Science  ... 
arXiv:2105.05752v2 fatcat:eiq6eadyzncq5njocuk2hancje

Better Intermediates Improve CTC Inference [article]

Tatsuya Komatsu, Yusuke Fujita, Jaesong Lee, Lukas Lee, Shinji Watanabe, Yusuke Kida
2022 arXiv   pre-print
We then propose two new conditioning methods based on the new formulation: (1) Searched intermediate conditioning that refines intermediate predictions with beam-search, (2) Multi-pass conditioning that  ...  Encoding of audio sequence: Let us consider a CTC-based system based on N -layer Conformer encoders.  ...  Experiment To verify the effectiveness of the proposed method, we conducted experiments by using ESPnet [27, 6] with almost the same hyperparameters.  ... 
arXiv:2204.00176v1 fatcat:h6uqmupntrcdbpacdcf7ssulri

InterAug: Augmenting Noisy Intermediate Predictions for CTC-based ASR [article]

Yu Nakagome, Tatsuya Komatsu, Yusuke Fujita, Shuta Ichimura, Yusuke Kida
2022 arXiv   pre-print
In experiments using augmentations simulating deletion, insertion, and substitution error, we confirmed that the trained model acquires robustness to each error, boosting the speech recognition performance  ...  The proposed method exploits the conditioning framework of self-conditioned CTC to train robust models by conditioning with "noisy" intermediate predictions.  ...  In recent years, there has been a growing demand for technologies that allow ASR systems to operate on devices such as smartphones and tablets.  ... 
arXiv:2204.00174v1 fatcat:g6foioplpnhw5pxw4wxiqv3xpi

A^3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing [article]

He Bai, Renjie Zheng, Junkun Chen, Xintong Li, Mingbo Ma, Liang Huang
2022 arXiv   pre-print
Recently, speech representation learning has improved many speech-related tasks such as speech recognition, speech classification, and speech-to-text translation.  ...  Experiments show A^3T outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis without the external speaker verification model.  ...  Recent developments on espnet toolkit boosted Signal Processing (ICASSP), pp. 7038–7042. IEEE, 2021. by conformer.  ... 
arXiv:2203.09690v2 fatcat:h44bnzrjerge7b33srgsb6txii

Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks [article]

Siddharth Dalmia, Brian Yan, Vikas Raunak, Florian Metze, Shinji Watanabe
2021 arXiv   pre-print
The model demonstrates the aforementioned benefits and outperforms the previous state-of-the-art by around +6 and +3 BLEU on the two test sets of Fisher-CallHome and by around +3 and +4 BLEU on the English-German  ...  One instance of the proposed framework is a Multi-Decoder model for speech translation that extracts the searchable hidden intermediates from a speech recognition sub-task.  ...  Recent developments on ESPnet toolkit boosted by conformer. In 2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE.  ... 
arXiv:2105.00573v1 fatcat:abymra264rfzlki2jcynsg7fli

PDAugment: Data Augmentation by Pitch and Duration Adjustments for Automatic Lyrics Transcription [article]

Chen Zhang, Jiaxing Yu, LuChin Chang, Xu Tan, Jiawei Chen, Tao Qin, Kejun Zhang
2021 arXiv   pre-print
Experiments on DSing30 and Dali corpus show that the ALT system equipped with our PDAugment outperforms previous state-of-the-art systems by 5.9% and 18.1% WERs respectively, demonstrating the effectiveness  ...  ALT has not been well developed mainly due to the dearth of paired singing voice and lyrics datasets for model training.  ...  We train the language model for 25 epochs on 2 GeForce RTX 3090 GPUs with Adam optimizer. Our code of basic model architecture is implemented based on the ESPnet toolkit (Watanabe et al. 2018) 8 .  ... 
arXiv:2109.07940v2 fatcat:d3wqu7yidvaqtja2xphdm5sbum
« Previous Showing results 1 — 15 out of 28 results