A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Relative Positional Encoding for Speech Recognition and Direct Translation
2020
Interspeech 2020
In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition is relative distance between input states in the self-attention network. ...
However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. ...
In this work, we bring the advantages of relative position encoding to the Deep Transformer [8] for both speech recognition (ASR) and direct speech translation (ST). ...
doi:10.21437/interspeech.2020-2526
dblp:conf/interspeech/PhamHNNSSNW20
fatcat:kud4mx2nrnb4vbmlqxr7vxyacu
Relative Positional Encoding for Speech Recognition and Direct Translation
[article]
2020
arXiv
pre-print
In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition is relative distance between input states in the self-attention network. ...
However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. ...
In this work, we bring the advantages of relative position encoding to the Deep Transformer [8] for both speech recognition (ASR) and direct speech translation (ST). ...
arXiv:2005.09940v1
fatcat:ukuyawblzvg2rcox42y7oynvuu
Incorporating Relative Position Information in Transformer-Based Sign Language Recognition and Translation
2021
IEEE Access
The study proposes Gated Recurrent Unit (GRU)-Relative Sign Transformer (RST) for jointly learning Continuous Sign Language Recognition (CSLR) and translation. ...
In this approach, GRU acts as the relative position encoder and RST is the Transformer model with relative position incorporated in the Multi-Head Attention (MHA). ...
This calls for a new sign translation dataset similar to that for NMTs in Speech and NLP.
VI. ...
doi:10.1109/access.2021.3122921
fatcat:kpz6eezlfnebfp3ogkcu5ihfri
Visualization of Uncertainty in Lattices to Support Decision-Making
[article]
2007
EUROVIS 2005: Eurographics / IEEE VGTC Symposium on Visualization
Applications such as machine translation and automated speech recognition typically present users with a best-guess about the appropriate output, with apparent complete confidence. ...
Lattices compactly represent multiple possible outputs and are usually hidden from users. ...
Also, since value, size, position, and transparency are ordered (values can be visually sorted), we used these to encode uncertainty to allow for comparison of the relative scores between nodes. ...
doi:10.2312/vissym/eurovis07/051-058
fatcat:osjskhnxsnhj7mc4thxvbqv6dq
LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition
2022
Information
Specifically, we use relative positional embedding, rather than absolute positional embedding, to improve the generalization of the Transformer for speech sequences of different lengths. ...
To this end, we propose a local attention Transformer model for speech recognition that combines the high correlation among speech frames. ...
This mechanism is extremely suitable for certain tasks, such as machine translation, where the input and output words are not in the same order; however, in speech recognition, the output text sequences ...
doi:10.3390/info13050250
fatcat:rgrac3t6wfachmzhs54uoye66m
Neural Machine Translation using Recurrent Neural Network
2020
International Journal of Engineering and Advanced Technology
In this era of globalization, it is quite likely to come across people or community who do not share the same language for communication as us. ...
in order to facilitate machine translation. ...
We are also grateful to our fellow students and other staff for dedicating their time and support for successful completion of the research. ...
doi:10.35940/ijeat.d7637.049420
fatcat:rgi7ro62vvgapod73qsmfit4uu
Cross Attention with Monotonic Alignment for Speech Transformer
2020
Interspeech 2020
However, automatic speech recognition (ASR) has a characteristic to have monotonic alignment between text output and speech input. ...
Techniques like Connectionist Temporal Classification (CTC), RNN Transducer (RNN-T) and Recurrent Neural Aligner (RNA) build on top of this monotonic alignment and use local encoded speech representations ...
Monotonic alignment regularization Alignment positions between the output and input should be strictly monotonic in the input sequence for speech recognition. ...
doi:10.21437/interspeech.2020-1198
dblp:conf/interspeech/ZhaoNLJCM20b
fatcat:boyv4vubknalhc3sxaztxooppe
Cascade or Direct Speech Translation? A Case Study
2022
Applied Sciences
Speech translation has been traditionally tackled under a cascade approach, chaining speech recognition and machine translation components to translate from an audio source in a given language into text ...
We describe and analysed in detail the mintzai-ST corpus, prepared from the sessions of the Basque Parliament, and evaluated the strengths and limitations of cascade and direct speech translation models ...
alternatives, namely: cascade models, based on state-of-the art components for speech recognition and machine translation, and end-to-end neural speech translation models. ...
doi:10.3390/app12031097
fatcat:wfn7wfe7izb6ncepopuj3n4c5q
UWSpeech: Speech to Speech Translation for Unwritten Languages
[article]
2020
arXiv
pre-print
In this paper, we develop a translation system for unwritten languages, named as UWSpeech, which converts target unwritten speech into discrete tokens with a converter, and then translates source-language ...
Experiments on Fisher Spanish-English conversation translation dataset show that UWSpeech outperforms direct translation and VQ-VAE baseline by about 16 and 10 BLEU points respectively, which demonstrate ...
model that has a shared speech encoder and two decoders: one is for phone-level automatic speech recognition on auxiliary written languages (e.g., German, French, and Chinese in this paper), and the other ...
arXiv:2006.07926v2
fatcat:5q4flanbzzdwfjlvjyi5vqcrxu
Transformer with Bidirectional Decoder for Speech Recognition
2020
Interspeech 2020
Attention-based models have made tremendous progress on end-to-end automatic speech recognition(ASR) recently. ...
In this work, we introduce a bidirectional speech transformer to utilize the different directional contexts simultaneously. ...
targets is helpful for the speech recognition. ...
doi:10.21437/interspeech.2020-2677
dblp:conf/interspeech/ChenZSOY20
fatcat:75iwsbclhvci7n2k2iw6qayami
FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task
[article]
2021
arXiv
pre-print
In some translation directions, our speech translation results evaluated on the public Multilingual TEDx test set are even comparable with the ones from a strong text-to-text translation system, which ...
In this paper, we describe our end-to-end multilingual speech translation system submitted to the IWSLT 2021 evaluation campaign on the Multilingual Speech Translation shared task. ...
We build the multilingual model to perform speech translation and speech recognition tasks for all evaluation directions. ...
arXiv:2107.06959v2
fatcat:ubwxhxiiivgcfoktasexnv4umm
Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection
[article]
2020
arXiv
pre-print
Encoder-decoder models provide a generic architecture for sequence-to-sequence tasks such as speech recognition and translation. ...
On How2 English-Portuguese speech translation, we reduce latency to 0.7 second (-84% rel.) while incurring a loss of 2.4 BLEU points (5% rel.) compared to the offline system. ...
Conclusion In this paper, we explored approaches for latency reduction in sequence-to-sequence speech recognition and translation. ...
arXiv:2005.11185v2
fatcat:cvryrozpnnc3foiy4e2axdjnr4
Cross-modality translations improve recognition by reducing false alarms
2017
Memory
Acknowledgement We thank Tyler Good, Madison Stange, and Deanna Priori for their assistance in collecting the data. ...
Correspondence may be directed to nforrin@gmail.com or to cmacleod@uwaterloo.ca
Disclosure statement No potential conflict of interest was reported by the authors. ...
For example, Dodson and Schacter (2001) found that a speech distinctiveness heuristic reduced FAs to lures on a recognition test, but did not increase hits to studied items. ...
doi:10.1080/09658211.2017.1321129
pmid:28462620
fatcat:qohabatz45e5dav2cnnna6ooc4
Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition
[article]
2019
arXiv
pre-print
The success of self-attention in NLP has led to recent applications in end-to-end encoder-decoder architectures for speech recognition. ...
We propose SAN-CTC, a deep, fully self-attentional network for CTC, and show it is tractable and competitive for end-to-end speech recognition. ...
temporal and spectral translation in ASR [8] , or image translation in handwriting recognition [35] ; they also serve as a form of dimensionality reduction (Section 2.4). ...
arXiv:1901.10055v2
fatcat:vjmxuek45vb3nccyqm6mg4khhy
Improving Generalization of Transformer for Speech Recognition with Parallel Schedule Sampling and Relative Positional Embedding
[article]
2019
arXiv
pre-print
Our proposed methods achieve 7% relative improvement for short utterances and 30% absolute gains for long utterances on a 10,000-hour ASR task. ...
To address these problems, we propose to use parallel schedule sampling (PSS) and relative positional embedding (RPE) to help transformer generalize to unseen data. ...
Thus RPE helps to decrease TD and ID. This also indicates local and relative position is more suitable for speech recognition. ...
arXiv:1911.00203v1
fatcat:kphr4sswp5dafnrdwi5jexxa2u
« Previous
Showing results 1 — 15 out of 27,380 results