329 Hits in 3.4 sec

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [article]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed
2021 arXiv   pre-print
Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound  ...  To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target  ...  RELATED WORK We discuss recent studies on self-supervised speech representation learning by grouping them by training objective.  ... 
arXiv:2106.07447v1 fatcat:y2x227ubtzbmzduuphvlptoghy

Towards an Efficient Voice Identification Using Wav2Vec2.0 and HuBERT Based on the Quran Reciters Dataset [article]

Aly Moustafa, Salah A. Aly
2021 arXiv   pre-print
The end-to-end Wav2Vec2.0 paradigm acquires contextualized speech representations learnings by randomly masking a set of feature vectors, and then applies a transformer neural network.  ...  In this paper, we develop a deep learning constructed model for Arabic speakers identification by using Wav2Vec2.0 and HuBERT audio representation learning tools.  ...  ACKNOWLEDGEMENT This research is partially funded by a grant from the academy of scientific research and technology (ASRT), 2020-2021, research grant number 6547.  ... 
arXiv:2111.06331v1 fatcat:y2xayleywvagfogradf6l5mep4

DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT [article]

Heng-Jui Chang, Shu-wen Yang, Hung-yi Lee
2021 arXiv   pre-print
Self-supervised speech representation learning methods like wav2vec 2.0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training and offer good representations for numerous speech  ...  Therefore, this paper introduces DistilHuBERT, a novel multi-task learning framework to distill hidden representations from a HuBERT model directly.  ...  We note that the number of prediction heads can be 1 to L, where L is the number of hidden layers in the self-supervised speech model to be distilled.  ... 
arXiv:2110.01900v2 fatcat:rm5gy4ie6ndnfprkq5nh27xjxa

An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition [article]

Xuankai Chang, Takashi Maekaku, Pengcheng Guo, Jing Shi, Yen-Ju Lu, Aswin Shanmugam Subramanian, Tianzi Wang, Shu-wen Yang, Yu Tsao, Hung-yi Lee, Shinji Watanabe
2021 arXiv   pre-print
Self-supervised pretraining on speech data has achieved a lot of progress.  ...  High-fidelity representation of the speech signal is learned from a lot of untranscribed data and shows promising performance.  ...  Specifically, it used the Bridges system [50] , which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).  ... 
arXiv:2110.04590v1 fatcat:p4peb5urpzaxja62cgalfjnyuy

Self-Supervised Learning for speech recognition with Intermediate layer supervision [article]

Chengyi Wang, Yu Wu, Sanyuan Chen, Shujie Liu, Jinyu Li, Yao Qian, Zhenglu Yang
2021 arXiv   pre-print
To this end, we propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL), which forces the model to concentrate on content information as much as possible by adding an additional SSL  ...  Detailed analysis shows the bottom layers of our model have a better correlation with phonetic units, which is consistent with our intuition and explains the success of our method for ASR.  ...  During pre-training, we select a set of layers K as supervised layers and compute the masked prediction loss on the output hidden states h l , where l ∈ K.  ... 
arXiv:2112.08778v1 fatcat:alzeddobl5hxthe63nc5xjxwpi

W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training [article]

Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, Yonghui Wu
2021 arXiv   pre-print
Motivated by the success of masked language modeling~(MLM) in pre-training natural language processing models, we propose w2v-BERT that explores MLM for self-supervised speech representation learning.  ...  the latter trains the model to learn contextualized speech representations via solving a masked prediction task consuming the discretized tokens.  ...  CONCLUSION AND FUTURE WORK We proposed w2v-BERT for self-supervised speech representation learning. w2v-BERT is composed of a contrastive module for discretizing continuous speech and a masked prediction  ... 
arXiv:2108.06209v2 fatcat:scwfvpaaanb7lj3w4ywjxmm5h4

A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion [article]

Benjamin van Niekerk, Marc-André Carbonneau, Julian Zaïdi, Mathew Baas, Hugo Seuté, Herman Kamper
2021 arXiv   pre-print
The goal of voice conversion is to transform source speech into a target voice, keeping the content unchanged. In this paper, we focus on self-supervised representation learning for voice conversion.  ...  To learn soft units, we predict a distribution over discrete speech units.  ...  HuBERT consists of two steps: acoustic unit discovery followed by masked prediction.  ... 
arXiv:2111.02392v1 fatcat:aisilct3sza2necqmaxrjg5ale

SUPERB: Speech processing Universal PERformance Benchmark [article]

Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng (+8 others)
2021 arXiv   pre-print
We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model.  ...  Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV).  ...  It is becoming a new principle to solve problems by pretraining a shared model with self-supervision tasks on a large amount of unlabeled data to encode general-purpose knowledge.  ... 
arXiv:2105.01051v4 fatcat:sjhwizrsdngovf4reyzttwqkba

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [article]

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou (+6 others)
2021 arXiv   pre-print
WavLM extends HuBERT framework to denoising masked speech modeling, where the target is to predict pseudo-labels of simulated noisy speech on masked regions.  ...  Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks.  ...  Motivated by the masked language model loss in NLP, DiscreteBERT and HuBERT predict discrete targets of masked regions.  ... 
arXiv:2110.13900v3 fatcat:yii5iys4bzfcnc4ddy4f3h7r2m

Characterizing the adversarial vulnerability of speech self-supervised learning [article]

Haibin Wu, Bo Zheng, Xu Li, Xixin Wu, Hung-yi Lee, Helen Meng
2021 arXiv   pre-print
As the paradigm of the self-supervised learning upstream model followed by downstream tasks arouses more attention in the speech community, characterizing the adversarial robustness of such paradigm is  ...  speech tasks with minimal modification of architectures and small amount of data, has fueled the research for speech representation learning.  ...  It firstly masks the hidden speech representations extracted by a multi-layer convolutional network from an utterance, followed by transformer layers to build contextualized representations given the hidden  ... 
arXiv:2111.04330v1 fatcat:lfkdm4xoy5gwporqxe5putsrvq

SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing [article]

Junyi Ao, Rui Wang, Long Zhou, Shujie Liu, Shuo Ren, Yu Wu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li (+1 others)
2021 arXiv   pre-print
for self-supervised speech/text representation learning.  ...  Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-training natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training  ...  Within the same period, self-supervised speech representation learning has also been investigated and shown promising results, benefiting from richly learned representations (Chung and Glass, 2018; Chuang  ... 
arXiv:2110.07205v1 fatcat:z2datuiax5gs7jhpb5crxhiwsu

Speech Representation Learning Through Self-supervised Pretraining And Multi-task Finetuning [article]

Yi-Chen Chen, Shu-wen Yang, Cheng-Kuang Lee, Simon See, Hung-yi Lee
2021 arXiv   pre-print
Speech representation learning plays a vital role in speech processing. Among them, self-supervised learning (SSL) has become an important research direction.  ...  We analyze the generalizability of supervised MTL finetuning to examine if the speech representation learned by MTL finetuning can generalize to unseen new tasks.  ...  HuBERT utilizes an offline clustering algorithm on hidden representations to provide aligned target labels for a BERT-like [30] prediction.  ... 
arXiv:2110.09930v1 fatcat:6zpldsek6zarpmpvohgef3c3he

Vector-based navigation using grid-like representations in artificial agents

Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap, Piotr Mirowski, Alexander Pritzel, Martin J. Chadwick, Thomas Degris, Joseph Modayil, Greg Wayne, Hubert Soyer (+14 others)
2018 Nature  
Navigation, however, remains a substantial 15 challenge for artificial agents, with deep neural networks trained by reinforcement learn-16 ing (RL) 3-5 failing to rival the proficiency of mammalian spatial  ...  As before, the grid network was trained using supervised learning but, 96 to better approximate the information available to navigating mammals, it now received velocity 97 signals perturbed with random  ...  Network architecture in the supervised learning experiment. The recurrent layer of the grid cell network is an LSTM with 128 hidden units.  ... 
doi:10.1038/s41586-018-0102-6 pmid:29743670 fatcat:y32i4g5hhnbgpo42othotkwwei

Generative Spoken Language Modeling from Raw Audio [article]

Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, Emmanuel Dupoux
2021 arXiv   pre-print
Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach  ...  evaluate the learned representations at acoustic and linguistic levels for both encoding and generation.  ...  In Neural Information Processing Systems Workshop on Self-Supervised Learning for Speech and Audio Processing Workshop, pages 6533-6537.  ... 
arXiv:2102.01192v2 fatcat:vuucz32wxjcqrc42s3wo7d5tk4

Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021 [article]

Takashi Maekaku, Xuankai Chang, Yuya Fujita, Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky
2021 arXiv   pre-print
Phoneme discriminative representation is achieved by executing the second-round clustering with the outputs of the final layer of the autoregressive model.  ...  We present a system for the Zero Resource Speech Challenge 2021, which combines a Contrastive Predictive Coding (CPC) with deep cluster.  ...  Contrastive Predictive Coding The speech representation model is based on CPC, a selfsupervised representation learning method proposed in [10] .  ... 
arXiv:2107.05899v1 fatcat:3fnj3b2ppvbnrntacwgndsgjay
« Previous Showing results 1 — 15 out of 329 results