Filters








27,380 Hits in 5.1 sec

Relative Positional Encoding for Speech Recognition and Direct Translation

Ngoc-Quan Pham, Thanh-Le Ha, Tuan-Nam Nguyen, Thai-Son Nguyen, Elizabeth Salesky, Sebastian Stüker, Jan Niehues, Alex Waibel
2020 Interspeech 2020  
In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition is relative distance between input states in the self-attention network.  ...  However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs.  ...  In this work, we bring the advantages of relative position encoding to the Deep Transformer [8] for both speech recognition (ASR) and direct speech translation (ST).  ... 
doi:10.21437/interspeech.2020-2526 dblp:conf/interspeech/PhamHNNSSNW20 fatcat:kud4mx2nrnb4vbmlqxr7vxyacu

Relative Positional Encoding for Speech Recognition and Direct Translation [article]

Ngoc-Quan Pham, Thanh-Le Ha, Tuan-Nam Nguyen, Thai-Son Nguyen, Elizabeth Salesky, Sebastian Stueker, Jan Niehues, Alexander Waibel
2020 arXiv   pre-print
In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition is relative distance between input states in the self-attention network.  ...  However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs.  ...  In this work, we bring the advantages of relative position encoding to the Deep Transformer [8] for both speech recognition (ASR) and direct speech translation (ST).  ... 
arXiv:2005.09940v1 fatcat:ukuyawblzvg2rcox42y7oynvuu

Incorporating Relative Position Information in Transformer-Based Sign Language Recognition and Translation

Neena Aloysius, Geetha M, Prema Nedungadi
2021 IEEE Access  
The study proposes Gated Recurrent Unit (GRU)-Relative Sign Transformer (RST) for jointly learning Continuous Sign Language Recognition (CSLR) and translation.  ...  In this approach, GRU acts as the relative position encoder and RST is the Transformer model with relative position incorporated in the Multi-Head Attention (MHA).  ...  This calls for a new sign translation dataset similar to that for NMTs in Speech and NLP. VI.  ... 
doi:10.1109/access.2021.3122921 fatcat:kpz6eezlfnebfp3ogkcu5ihfri

Visualization of Uncertainty in Lattices to Support Decision-Making [article]

Christopher Collins, Sheelagh Carpendale, Gerald Penn
2007 EUROVIS 2005: Eurographics / IEEE VGTC Symposium on Visualization  
Applications such as machine translation and automated speech recognition typically present users with a best-guess about the appropriate output, with apparent complete confidence.  ...  Lattices compactly represent multiple possible outputs and are usually hidden from users.  ...  Also, since value, size, position, and transparency are ordered (values can be visually sorted), we used these to encode uncertainty to allow for comparison of the relative scores between nodes.  ... 
doi:10.2312/vissym/eurovis07/051-058 fatcat:osjskhnxsnhj7mc4thxvbqv6dq

LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition

Pengbin Fu, Daxing Liu, Huirong Yang
2022 Information  
Specifically, we use relative positional embedding, rather than absolute positional embedding, to improve the generalization of the Transformer for speech sequences of different lengths.  ...  To this end, we propose a local attention Transformer model for speech recognition that combines the high correlation among speech frames.  ...  This mechanism is extremely suitable for certain tasks, such as machine translation, where the input and output words are not in the same order; however, in speech recognition, the output text sequences  ... 
doi:10.3390/info13050250 fatcat:rgrac3t6wfachmzhs54uoye66m

Neural Machine Translation using Recurrent Neural Network

2020 International Journal of Engineering and Advanced Technology  
In this era of globalization, it is quite likely to come across people or community who do not share the same language for communication as us.  ...  in order to facilitate machine translation.  ...  We are also grateful to our fellow students and other staff for dedicating their time and support for successful completion of the research.  ... 
doi:10.35940/ijeat.d7637.049420 fatcat:rgi7ro62vvgapod73qsmfit4uu

Cross Attention with Monotonic Alignment for Speech Transformer

Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma
2020 Interspeech 2020  
However, automatic speech recognition (ASR) has a characteristic to have monotonic alignment between text output and speech input.  ...  Techniques like Connectionist Temporal Classification (CTC), RNN Transducer (RNN-T) and Recurrent Neural Aligner (RNA) build on top of this monotonic alignment and use local encoded speech representations  ...  Monotonic alignment regularization Alignment positions between the output and input should be strictly monotonic in the input sequence for speech recognition.  ... 
doi:10.21437/interspeech.2020-1198 dblp:conf/interspeech/ZhaoNLJCM20b fatcat:boyv4vubknalhc3sxaztxooppe

Cascade or Direct Speech Translation? A Case Study

Thierry Etchegoyhen, Haritz Arzelus, Harritxu Gete, Aitor Alvarez, Iván G. Torre, Juan Manuel Martín-Doñas, Ander González-Docasal, Edson Benites Fernandez
2022 Applied Sciences  
Speech translation has been traditionally tackled under a cascade approach, chaining speech recognition and machine translation components to translate from an audio source in a given language into text  ...  We describe and analysed in detail the mintzai-ST corpus, prepared from the sessions of the Basque Parliament, and evaluated the strengths and limitations of cascade and direct speech translation models  ...  alternatives, namely: cascade models, based on state-of-the art components for speech recognition and machine translation, and end-to-end neural speech translation models.  ... 
doi:10.3390/app12031097 fatcat:wfn7wfe7izb6ncepopuj3n4c5q

UWSpeech: Speech to Speech Translation for Unwritten Languages [article]

Chen Zhang, Xu Tan, Yi Ren, Tao Qin, Kejun Zhang, Tie-Yan Liu
2020 arXiv   pre-print
In this paper, we develop a translation system for unwritten languages, named as UWSpeech, which converts target unwritten speech into discrete tokens with a converter, and then translates source-language  ...  Experiments on Fisher Spanish-English conversation translation dataset show that UWSpeech outperforms direct translation and VQ-VAE baseline by about 16 and 10 BLEU points respectively, which demonstrate  ...  model that has a shared speech encoder and two decoders: one is for phone-level automatic speech recognition on auxiliary written languages (e.g., German, French, and Chinese in this paper), and the other  ... 
arXiv:2006.07926v2 fatcat:5q4flanbzzdwfjlvjyi5vqcrxu

Transformer with Bidirectional Decoder for Speech Recognition

Xi Chen, Songyang Zhang, Dandan Song, Peng Ouyang, Shouyi Yin
2020 Interspeech 2020  
Attention-based models have made tremendous progress on end-to-end automatic speech recognition(ASR) recently.  ...  In this work, we introduce a bidirectional speech transformer to utilize the different directional contexts simultaneously.  ...  targets is helpful for the speech recognition.  ... 
doi:10.21437/interspeech.2020-2677 dblp:conf/interspeech/ChenZSOY20 fatcat:75iwsbclhvci7n2k2iw6qayami

FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task [article]

Yun Tang, Hongyu Gong, Xian Li, Changhan Wang, Juan Pino, Holger Schwenk, Naman Goyal
2021 arXiv   pre-print
In some translation directions, our speech translation results evaluated on the public Multilingual TEDx test set are even comparable with the ones from a strong text-to-text translation system, which  ...  In this paper, we describe our end-to-end multilingual speech translation system submitted to the IWSLT 2021 evaluation campaign on the Multilingual Speech Translation shared task.  ...  We build the multilingual model to perform speech translation and speech recognition tasks for all evaluation directions.  ... 
arXiv:2107.06959v2 fatcat:ubwxhxiiivgcfoktasexnv4umm

Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection [article]

Danni Liu, Gerasimos Spanakis, Jan Niehues
2020 arXiv   pre-print
Encoder-decoder models provide a generic architecture for sequence-to-sequence tasks such as speech recognition and translation.  ...  On How2 English-Portuguese speech translation, we reduce latency to 0.7 second (-84% rel.) while incurring a loss of 2.4 BLEU points (5% rel.) compared to the offline system.  ...  Conclusion In this paper, we explored approaches for latency reduction in sequence-to-sequence speech recognition and translation.  ... 
arXiv:2005.11185v2 fatcat:cvryrozpnnc3foiy4e2axdjnr4

Cross-modality translations improve recognition by reducing false alarms

Noah D. Forrin, Colin M. MacLeod
2017 Memory  
Acknowledgement We thank Tyler Good, Madison Stange, and Deanna Priori for their assistance in collecting the data.  ...  Correspondence may be directed to nforrin@gmail.com or to cmacleod@uwaterloo.ca Disclosure statement No potential conflict of interest was reported by the authors.  ...  For example, Dodson and Schacter (2001) found that a speech distinctiveness heuristic reduced FAs to lures on a recognition test, but did not increase hits to studied items.  ... 
doi:10.1080/09658211.2017.1321129 pmid:28462620 fatcat:qohabatz45e5dav2cnnna6ooc4

Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition [article]

Julian Salazar, Katrin Kirchhoff, Zhiheng Huang
2019 arXiv   pre-print
The success of self-attention in NLP has led to recent applications in end-to-end encoder-decoder architectures for speech recognition.  ...  We propose SAN-CTC, a deep, fully self-attentional network for CTC, and show it is tractable and competitive for end-to-end speech recognition.  ...  temporal and spectral translation in ASR [8] , or image translation in handwriting recognition [35] ; they also serve as a form of dimensionality reduction (Section 2.4).  ... 
arXiv:1901.10055v2 fatcat:vjmxuek45vb3nccyqm6mg4khhy

Improving Generalization of Transformer for Speech Recognition with Parallel Schedule Sampling and Relative Positional Embedding [article]

Pan Zhou, Ruchao Fan, Wei Chen, Jia Jia
2019 arXiv   pre-print
Our proposed methods achieve 7% relative improvement for short utterances and 30% absolute gains for long utterances on a 10,000-hour ASR task.  ...  To address these problems, we propose to use parallel schedule sampling (PSS) and relative positional embedding (RPE) to help transformer generalize to unseen data.  ...  Thus RPE helps to decrease TD and ID. This also indicates local and relative position is more suitable for speech recognition.  ... 
arXiv:1911.00203v1 fatcat:kphr4sswp5dafnrdwi5jexxa2u
« Previous Showing results 1 — 15 out of 27,380 results