Filters








52 Hits in 4.3 sec

Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet [article]

Mingyang Zhang, Xin Wang, Fuming Fang, Haizhou Li, Junichi Yamagishi
2019 arXiv   pre-print
We investigated the training of a shared model for both text-to-speech (TTS) and voice conversion (VC) tasks.  ...  We propose using an extended model architecture of Tacotron, that is a multi-source sequence-to-sequence model with a dual attention mechanism as the shared model for both the TTS and VC tasks.  ...  This research was carried out while the first author was at NII, Japan in 2018 using NII International Internship Program.  ... 
arXiv:1903.12389v2 fatcat:k43cqpkwfvaebhp64tydqgynui

Joint Training Framework for Text-to-Speech and Voice Conversion Using Multi-Source Tacotron and WaveNet

Mingyang Zhang, Xin Wang, Fuming Fang, Haizhou Li, Junichi Yamagishi
2019 Interspeech 2019  
We investigated the training of a shared model for both textto-speech (TTS) and voice conversion (VC) tasks.  ...  We propose using an extended model architecture of Tacotron, that is a multi-source sequence-to-sequence model with a dual attention mechanism as the shared model for both the TTS and VC tasks.  ...  This research was carried out while the first author was at NII, Japan in 2018 using NII International Internship Program.  ... 
doi:10.21437/interspeech.2019-1357 dblp:conf/interspeech/00030F0Y19 fatcat:urmwhweuzjakhisxfsgb67ijdq

Taco-VC: A Single Speaker Tacotron based Voice Conversion with Limited Data [article]

Roee Levy Leshem, Raja Giryes
2020 arXiv   pre-print
The training of multi-speaker voice conversion systems requires a large number of resources, both in training and corpus size.  ...  Using mid-size public datasets, our method outperforms the baseline in the VCC 2018 SPOKE non-parallel voice conversion task and achieves competitive results compared to multi-speaker networks trained  ...  INTRODUCTION The purpose of voice conversion (VC) is to convert the speech of a source speaker into a given desired target speaker.  ... 
arXiv:1904.03522v4 fatcat:efumvvpw6jbb7ehp2qfdatgxzy

A Survey on Neural Speech Synthesis [article]

Xu Tan, Tao Qin, Frank Soong, Tie-Yan Liu
2021 arXiv   pre-print
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad  ...  We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive  ...  [52] or jointed trained with voice conversion task [440] .  ... 
arXiv:2106.15561v3 fatcat:pbrbs6xay5e4fhf4ewlp7qvybi

A Review of Deep Learning Based Speech Synthesis

Yishuang Ning, Sheng He, Zhiyong Wu, Chunxiao Xing, Liang-Jie Zhang
2019 Applied Sciences  
For speech synthesis, deep learning based techniques can leverage a large scale of <text, speech> pairs to learn effective feature representations to bridge the gap between text and speech, thus better  ...  Speech synthesis, also known as text-to-speech (TTS), has attracted increasingly more attention.  ...  There are also some works that combine Tacotron and WaveNet for speech synthesis, such as Deep Voice 2 [72] .  ... 
doi:10.3390/app9194050 fatcat:gfhpemdjxvgatlocsyrud255r4

Review of end-to-end speech synthesis technology based on deep learning [article]

Zhaoxi Mu, Xinyu Yang, Yizhuo Dong
2021 arXiv   pre-print
Moreover, this paper also summarizes the open-source speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks, and introduces some commonly used subjective and  ...  Due to the limitations of high complexity and low efficiency of traditional speech synthesis technology, the current research focus is the deep learning-based end-to-end speech synthesis technology, which  ...  [81] used a voice conversion (VC) model to convert the voice data of other speakers into the voice of the target speaker for data augmentation, then trained the TTS model with the expanded speech data  ... 
arXiv:2104.09995v1 fatcat:q5lx74ycx5hobjox4ktl3amfta

Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis [article]

Yusuke Yasuda, Xin Wang, Junichi Yamagishi
2020 arXiv   pre-print
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes.  ...  For the model architecture aspect, we adopt modified Tacotron systems that we previously proposed and their variants using an encoder from Tacotron or Tacotron2.  ...  We use WaveNet [18] to synthesize waveforms from the mel-spectrogram. WaveNet is trained with the same melspectrogram used for training Japanese Tacotron.  ... 
arXiv:2005.10390v2 fatcat:bnny4jsnqvf4nl4gpeias5lbjm

An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning

Berrak Sisman, Junichi Yamagishi, Simon King, Haizhou Li
2020 IEEE/ACM Transactions on Audio Speech and Language Processing  
Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding.  ...  We will also report the recent Voice Conversion Challenges (VCC), the performance of the current state of technology, and provide a summary of the available resources for voice conversion research.  ...  Zhang et al. proposed a joint training system architecture for both text-to-speech and voice conversion [3] by extending the model architecture of Tacotron, which features a multi-source sequence-to-sequence  ... 
doi:10.1109/taslp.2020.3038524 fatcat:duw2edjapzc3pb24hcr5vgjb5y

An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning [article]

Berrak Sisman, Junichi Yamagishi, Simon King, Haizhou Li
2020 arXiv   pre-print
Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding.  ...  We will also report the recent Voice Conversion Challenges (VCC), the performance of the current state of technology, and provide a summary of the available resources for voice conversion research.  ...  Zhang et al. proposed a joint training system architecture for both text-to-speech and voice conversion [3] by extending the model architecture of Tacotron, which features a multi-source sequence-to-sequence  ... 
arXiv:2008.03648v2 fatcat:nehs6o22pzdirffvedqtby4sd4

Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations [article]

Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai
2019 arXiv   pre-print
This paper presents a method of sequence-to-sequence (seq2seq) voice conversion using non-parallel training data.  ...  The model parameters are estimated by two-stage training, including a pretraining stage using a multi-speaker dataset and a fine-tuning stage using the dataset of a specific conversion pair.  ...  Voice cloning Voice cloning is a task that learns the voice of unseen speakers from a few speech samples for text-to-speech synthesis [41] - [43] .  ... 
arXiv:1906.10508v3 fatcat:osntc3a7kfezjf3aym525vtm7a

Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai
2020 Interspeech 2020  
For training model parameters, a strategy of pre-training on a multi-speaker dataset and then fine-tuning on the source-target speaker pair is designed.  ...  This paper presents an adversarial learning method for recognition-synthesis based non-parallel voice conversion.  ...  The WaveNet vocoder [18] is adopted for recovering the waveforms of converted voice. For training model parameters, an external multi-speaker dataset is first adopted for pre-training.  ... 
doi:10.21437/interspeech.2020-0036 dblp:conf/interspeech/ZhangL020 fatcat:5qwtspoamfdjlnt3ghcorhagba

Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams

Guanlong Zhao, Shaojin Ding, Ricardo Gutierrez-Osuna
2019 Interspeech 2019  
We present a framework for FAC that eliminates the need for conventional vocoders (e.g., STRAIGHT, World) and therefore the need to use the native speaker's excitation.  ...  Our approach uses an acoustic model trained on a native speech corpus to extract speaker-independent phonetic posteriorgrams (PPGs), and then train a speech synthesizer to map PPGs from the non-native  ...  Acknowledgements This work was supported by NSF awards 1619212 and 1623750.  ... 
doi:10.21437/interspeech.2019-1778 dblp:conf/interspeech/ZhaoDG19 fatcat:avdfb5brwnhslmd4ertixipnle

Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning [article]

Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai
2020 arXiv   pre-print
For training model parameters, a strategy of pre-training on a multi-speaker dataset and then fine-tuning on the source-target speaker pair is designed.  ...  This paper presents an adversarial learning method for recognition-synthesis based non-parallel voice conversion.  ...  The WaveNet vocoder [18] is adopted for recovering the waveforms of converted voice. For training model parameters, an external multi-speaker dataset is first adopted for pre-training.  ... 
arXiv:2008.02371v1 fatcat:iaive77nsfepvetgfztdbvojea

Towards Universal Text-to-Speech

Jingzhou Yang, Lei He
2020 Interspeech 2020  
This paper studies a multilingual sequence-to-sequence textto-speech framework towards universal modeling, that is able to synthesize speech for any speaker in any language using a single model.  ...  A data balance training strategy is successfully applied and effectively improves the voice quality of the low-resource languages.  ...  Introduction The conventional text-to-speech (TTS) system employs different models to generate voices in different languages.  ... 
doi:10.21437/interspeech.2020-1590 dblp:conf/interspeech/YangH20 fatcat:regdhb4offdbxkww3bh2w3zkhy

A Review on Methods and Applications in Multimodal Deep Learning [article]

Jabeen Summaira, Xi Li, Amin Muhammad Shoib, Jabbar Abdul
2022 arXiv   pre-print
The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities.  ...  ., image, video, text, audio, body gestures, facial expressions, and physiological signals.  ...  • As for DLTTS application concerns, the use of speech synthesis for other real-world applications like voice conversion or translation, cross-lingual speech conversion, audio-video speech synthesis, etc  ... 
arXiv:2202.09195v1 fatcat:wwxrmrwmerfabbenleylwmmj7y
« Previous Showing results 1 — 15 out of 52 results