A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet
[article]
2019
arXiv
pre-print
We investigated the training of a shared model for both text-to-speech (TTS) and voice conversion (VC) tasks. ...
We propose using an extended model architecture of Tacotron, that is a multi-source sequence-to-sequence model with a dual attention mechanism as the shared model for both the TTS and VC tasks. ...
This research was carried out while the first author was at NII, Japan in 2018 using NII International Internship Program. ...
arXiv:1903.12389v2
fatcat:k43cqpkwfvaebhp64tydqgynui
Joint Training Framework for Text-to-Speech and Voice Conversion Using Multi-Source Tacotron and WaveNet
2019
Interspeech 2019
We investigated the training of a shared model for both textto-speech (TTS) and voice conversion (VC) tasks. ...
We propose using an extended model architecture of Tacotron, that is a multi-source sequence-to-sequence model with a dual attention mechanism as the shared model for both the TTS and VC tasks. ...
This research was carried out while the first author was at NII, Japan in 2018 using NII International Internship Program. ...
doi:10.21437/interspeech.2019-1357
dblp:conf/interspeech/00030F0Y19
fatcat:urmwhweuzjakhisxfsgb67ijdq
Taco-VC: A Single Speaker Tacotron based Voice Conversion with Limited Data
[article]
2020
arXiv
pre-print
The training of multi-speaker voice conversion systems requires a large number of resources, both in training and corpus size. ...
Using mid-size public datasets, our method outperforms the baseline in the VCC 2018 SPOKE non-parallel voice conversion task and achieves competitive results compared to multi-speaker networks trained ...
INTRODUCTION The purpose of voice conversion (VC) is to convert the speech of a source speaker into a given desired target speaker. ...
arXiv:1904.03522v4
fatcat:efumvvpw6jbb7ehp2qfdatgxzy
A Survey on Neural Speech Synthesis
[article]
2021
arXiv
pre-print
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad ...
We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive ...
[52] or jointed trained with voice conversion task [440] . ...
arXiv:2106.15561v3
fatcat:pbrbs6xay5e4fhf4ewlp7qvybi
A Review of Deep Learning Based Speech Synthesis
2019
Applied Sciences
For speech synthesis, deep learning based techniques can leverage a large scale of <text, speech> pairs to learn effective feature representations to bridge the gap between text and speech, thus better ...
Speech synthesis, also known as text-to-speech (TTS), has attracted increasingly more attention. ...
There are also some works that combine Tacotron and WaveNet for speech synthesis, such as Deep Voice 2 [72] . ...
doi:10.3390/app9194050
fatcat:gfhpemdjxvgatlocsyrud255r4
Review of end-to-end speech synthesis technology based on deep learning
[article]
2021
arXiv
pre-print
Moreover, this paper also summarizes the open-source speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks, and introduces some commonly used subjective and ...
Due to the limitations of high complexity and low efficiency of traditional speech synthesis technology, the current research focus is the deep learning-based end-to-end speech synthesis technology, which ...
[81] used a voice conversion (VC) model to convert the voice data of other speakers into the voice of the target speaker for data augmentation, then trained the TTS model with the expanded speech data ...
arXiv:2104.09995v1
fatcat:q5lx74ycx5hobjox4ktl3amfta
Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
[article]
2020
arXiv
pre-print
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes. ...
For the model architecture aspect, we adopt modified Tacotron systems that we previously proposed and their variants using an encoder from Tacotron or Tacotron2. ...
We use WaveNet [18] to synthesize waveforms from the mel-spectrogram. WaveNet is trained with the same melspectrogram used for training Japanese Tacotron. ...
arXiv:2005.10390v2
fatcat:bnny4jsnqvf4nl4gpeias5lbjm
An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning
2020
IEEE/ACM Transactions on Audio Speech and Language Processing
Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding. ...
We will also report the recent Voice Conversion Challenges (VCC), the performance of the current state of technology, and provide a summary of the available resources for voice conversion research. ...
Zhang et al. proposed a joint training system architecture for both text-to-speech and voice conversion [3] by extending the model architecture of Tacotron, which features a multi-source sequence-to-sequence ...
doi:10.1109/taslp.2020.3038524
fatcat:duw2edjapzc3pb24hcr5vgjb5y
An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning
[article]
2020
arXiv
pre-print
Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding. ...
We will also report the recent Voice Conversion Challenges (VCC), the performance of the current state of technology, and provide a summary of the available resources for voice conversion research. ...
Zhang et al. proposed a joint training system architecture for both text-to-speech and voice conversion [3] by extending the model architecture of Tacotron, which features a multi-source sequence-to-sequence ...
arXiv:2008.03648v2
fatcat:nehs6o22pzdirffvedqtby4sd4
Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations
[article]
2019
arXiv
pre-print
This paper presents a method of sequence-to-sequence (seq2seq) voice conversion using non-parallel training data. ...
The model parameters are estimated by two-stage training, including a pretraining stage using a multi-speaker dataset and a fine-tuning stage using the dataset of a specific conversion pair. ...
Voice cloning Voice cloning is a task that learns the voice of unseen speakers from a few speech samples for text-to-speech synthesis [41] - [43] . ...
arXiv:1906.10508v3
fatcat:osntc3a7kfezjf3aym525vtm7a
Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning
2020
Interspeech 2020
For training model parameters, a strategy of pre-training on a multi-speaker dataset and then fine-tuning on the source-target speaker pair is designed. ...
This paper presents an adversarial learning method for recognition-synthesis based non-parallel voice conversion. ...
The WaveNet vocoder [18] is adopted for recovering the waveforms of converted voice. For training model parameters, an external multi-speaker dataset is first adopted for pre-training. ...
doi:10.21437/interspeech.2020-0036
dblp:conf/interspeech/ZhangL020
fatcat:5qwtspoamfdjlnt3ghcorhagba
Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams
2019
Interspeech 2019
We present a framework for FAC that eliminates the need for conventional vocoders (e.g., STRAIGHT, World) and therefore the need to use the native speaker's excitation. ...
Our approach uses an acoustic model trained on a native speech corpus to extract speaker-independent phonetic posteriorgrams (PPGs), and then train a speech synthesizer to map PPGs from the non-native ...
Acknowledgements This work was supported by NSF awards 1619212 and 1623750. ...
doi:10.21437/interspeech.2019-1778
dblp:conf/interspeech/ZhaoDG19
fatcat:avdfb5brwnhslmd4ertixipnle
Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning
[article]
2020
arXiv
pre-print
For training model parameters, a strategy of pre-training on a multi-speaker dataset and then fine-tuning on the source-target speaker pair is designed. ...
This paper presents an adversarial learning method for recognition-synthesis based non-parallel voice conversion. ...
The WaveNet vocoder [18] is adopted for recovering the waveforms of converted voice. For training model parameters, an external multi-speaker dataset is first adopted for pre-training. ...
arXiv:2008.02371v1
fatcat:iaive77nsfepvetgfztdbvojea
Towards Universal Text-to-Speech
2020
Interspeech 2020
This paper studies a multilingual sequence-to-sequence textto-speech framework towards universal modeling, that is able to synthesize speech for any speaker in any language using a single model. ...
A data balance training strategy is successfully applied and effectively improves the voice quality of the low-resource languages. ...
Introduction The conventional text-to-speech (TTS) system employs different models to generate voices in different languages. ...
doi:10.21437/interspeech.2020-1590
dblp:conf/interspeech/YangH20
fatcat:regdhb4offdbxkww3bh2w3zkhy
A Review on Methods and Applications in Multimodal Deep Learning
[article]
2022
arXiv
pre-print
The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. ...
., image, video, text, audio, body gestures, facial expressions, and physiological signals. ...
• As for DLTTS application concerns, the use of speech synthesis for other real-world applications like voice conversion or translation, cross-lingual speech conversion, audio-video speech synthesis, etc ...
arXiv:2202.09195v1
fatcat:wwxrmrwmerfabbenleylwmmj7y
« Previous
Showing results 1 — 15 out of 52 results