Filters








14,516 Hits in 4.0 sec

Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis [article]

Erica Cooper, Xin Wang, Yi Zhao, Yusuke Yasuda, Junichi Yamagishi
2020 arXiv   pre-print
We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis.  ...  We find that fine-tuning a multi-speaker model from found audiobook data that has passed a simple quality threshold can improve naturalness and similarity to unseen target speakers of synthetic speech.  ...  However, end-to-end multi-speaker TTS is very data-hungry and hence model pretraining is very important in practice.  ... 
arXiv:2011.04839v1 fatcat:v3wl7qyj3rfhxokedqtvxkscdy

Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement [article]

Dongyang Dai, Li Chen, Yuping Wang, Mu Wang, Rui Xia, Xuchen Song, Zhiyong Wu, Yuxuan Wang
2020 arXiv   pre-print
In this paper, the proposed end-to-end speech synthesis model uses both speaker embedding and noise representation as conditional inputs to model speaker and noise information respectively.  ...  With the popularity of deep neural network, speech synthesis task has achieved significant improvements based on the end-to-end encoder-decoder framework in the recent days.  ...  on the end-to-end speech synthesis model, and pre-train the TTS model on multi-speaker's enhancement data for noise robust personalized TTS.  ... 
arXiv:2005.12531v2 fatcat:unc2rvyw6rawve5qqapu2qhu4q

Semi-Supervised Learning for Multi-Speaker Text-to-Speech Synthesis Using Discrete Speech Representation

Tao Tu, Yuan-Jui Chen, Alexander H. Liu, Hung-yi Lee
2020 Interspeech 2020  
Recently, end-to-end multi-speaker text-to-speech (TTS) systems gain success in the situation where a lot of high-quality speech plus their corresponding transcriptions are available.  ...  Index Terms: multi-speaker speech synthesis, semi-supervised learning, discrete speech representation In this section, we briefly overview the SeqRQ-AE, which is trained from a large amount of unpaired  ...  Introduction Recent advances in the neural-based end-to-end text-to-speech (TTS) systems have closed the gaps between the human speech and synthesized speech in the aspects of both speech quality and speech  ... 
doi:10.21437/interspeech.2020-1824 dblp:conf/interspeech/TuCLL20 fatcat:fg4atpd63bgnrb5jxovg5dkvfq

Cross-Lingual, Multi-Speaker Text-To-Speech Synthesis Using Neural Speaker Embedding

Mengnan Chen, Minchuan Chen, Shuang Liang, Jun Ma, Lei Chen, Shaojun Wang, Jing Xiao
2019 Interspeech 2019  
In this paper, we present a cross-lingual, multi-speaker neural end-to-end TTS framework which can model speaker characteristics and synthesize speech in different languages.  ...  Neural network-based model for text-to-speech (TTS) synthesis has made significant progress in recent years.  ...  In our work, we extract the speakers' voice characteristics across languages and enable an end-to-end speech synthesis system to support multiple languages.  ... 
doi:10.21437/interspeech.2019-1632 dblp:conf/interspeech/ChenCLMCWX19 fatcat:72s4uq45ivdp5gc7g4x7qtudya

The MSXF TTS System for ICASSP 2022 ADD Challenge [article]

Chunyong Yang, Pengfei Liu, Yanli Chen, Hongbin Wang, Min Liu
2022 arXiv   pre-print
We use an end to end text to speech system, and add a constraint loss to the system when training stage. The end to end TTS system is VITS, and the pre-training self-supervised model is wav2vec 2.0.  ...  And we also explore the influence of the speech speed and volume in spoofing. The faster speech means the less the silence part in audio, the easier to fool the detector.  ...  Finally, we choose to build a multi-speaker text to speech system. There are many successful multi-speaker text to speech system, like [3] [4] [5] etc.  ... 
arXiv:2201.11400v1 fatcat:luagwlz4fretfg6dfngmfx3ree

Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation [article]

Tao Tu, Yuan-Jui Chen, Alexander H. Liu, Hung-yi Lee
2020 arXiv   pre-print
Recently, end-to-end multi-speaker text-to-speech (TTS) systems gain success in the situation where a lot of high-quality speech plus their corresponding transcriptions are available.  ...  A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.  ...  Introduction Recent advances in the neural-based end-to-end text-to-speech (TTS) systems have closed the gaps between the human speech and synthesized speech in the aspects of both speech quality and speech  ... 
arXiv:2005.08024v2 fatcat:cra4qc5tjfezdlzjsroskmeqom

Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS? [article]

Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Junichi Yamagishi
2020 arXiv   pre-print
Previous work on speaker adaptation for end-to-end speech synthesis still falls short in speaker similarity.  ...  A large-scale listening test is conducted, and a distance metric is adopted to evaluate synthesis of dialects.  ...  Introduction Recent advances in end-to-end text-to-speech (TTS) synthesis enable the production of synthetic speech of high quality and good speaker similarity [1, 2, 3, 4] .  ... 
arXiv:2005.01245v2 fatcat:kquxt33kgndivhgm267qc6yv3q

fairseq S^2: A Scalable and Integrable Speech Synthesis Toolkit [article]

Changhan Wang, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Ann Lee, Peng-Jen Chen, Jiatao Gu, Juan Pino
2021 arXiv   pre-print
This paper presents fairseq S^2, a fairseq extension for speech synthesis. We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants.  ...  To enable training speech synthesis models with less curated data, a number of preprocessing tools are built and their importance is shown empirically.  ...  Experiments We evaluate our models in three settings: singlespeaker synthesis, multi-speaker synthesis and multi-speaker synthesis using noisy data.  ... 
arXiv:2109.06912v1 fatcat:jnedcf7hd5b3dpuzqpi3slwet4

CUHK-EE Voice Cloning System for ICASSP 2021 M2VoC Challenge [article]

Daxin Tan, Hingpang Huang, Guangyan Zhang, Tan Lee
2021 arXiv   pre-print
Our system comprises three stages: multi-speaker training stage, target speaker adaption stage and target speaker synthesis stage. Our team is identified as T17.  ...  An end-to-end voicing cloning system is developed to accomplish the task, which includes: 1. a text and speech front-end module with the help of forced alignment, 2. an acoustic model combining Tacotron2  ...  Our system consists of three stages: multi-speaker training stage, target speaker adaption stage and target speaker synthesis stage.  ... 
arXiv:2103.04699v5 fatcat:iuzklb7p5bh4ba5e7ef3zjy2de

Speech Recognition with Augmented Synthesized Speech [article]

Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Ye Jia, Pedro Moreno, Yonghui Wu, Zelin Wu
2019 arXiv   pre-print
The multi-speaker speech synthesis architecture can learn latent embedding spaces of prosody, speaker and style variations derived from input acoustic representations thereby allowing for manipulation  ...  Recent success of the Tacotron speech synthesis architecture and its variants in producing natural sounding multi-speaker synthesized speech has raised the exciting possibility of replacing expensive,  ...  There are instances when end-to-end speech synthesis fails to faithfully synthesize the input utterance.  ... 
arXiv:1909.11699v1 fatcat:nkte7qcubvdstiz33gm3cefttq

Introduction to the Special Issue "Speaker and Language Characterization and Recognition: Voice Modeling, Conversion, Synthesis and Ethical Aspects"

Jean-François Bonastre, Tomi Kinnunen, Anthony Larcher, Junichi Yamagishi
2019 Computer Speech and Language  
in text-to-speech (TTS) synthesis.  ...  The article entitled Vocoder-Free Text-to-Speech Synthesis Incorporating Generative Adversarial Networks Using Low-/ Multi-Frequency STFT Amplitude Spectra by Saito et al. addresses quality degradation  ... 
doi:10.1016/j.csl.2019.101021 fatcat:mpw674uefrbuxmrfmbvvcyphwi

Dynamic Soft Windowing and Language Dependent Style Token for Code-Switching End-to-End Speech Synthesis

Ruibo Fu, Jianhua Tao, Zhengqi Wen, Jiangyan Yi, Chunyu Qiang, Tao Wang
2020 Interspeech 2020  
Most of current end-to-end speech synthesis assumes the input text is in a single language situation.  ...  In this paper, both windowing technique and style token modeling are designed for the code-switching endto-end speech synthesis.  ...  In this paper, we look into Mandarin and English codeswitching end-to-end speech synthesis based on a multi-speaker bilingual speech database.  ... 
doi:10.21437/interspeech.2020-1754 dblp:conf/interspeech/FuTWYQW20 fatcat:axfpfvlqe5e6fmfmscxzaso274

Bi-Level Speaker Supervision for One-Shot Speech Synthesis

Tao Wang, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Chunyu Qiang
2020 Interspeech 2020  
The speaker feature extraction and speaker identity reconstruction are integrated in an end-to-end speech synthesis network, with the one on speaker feature level for closing speaker characteristics and  ...  The gap between speaker characteristics of reference speech and synthesized speech remains a challenging problem in oneshot speech synthesis.  ...  Introduction With the development of deep learning, end-to-end speech synthesis models, such as Tacotron [1] and its varieties [2] [3] [4] , are proposed to simplify traditional TTS pipeline [5] [6  ... 
doi:10.21437/interspeech.2020-1737 dblp:conf/interspeech/WangTFYWQ20 fatcat:3cfl5yfiybgbdmij5llqplqazy

Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis

Ruibo Fu, Jianhua Tao, Zhengqi Wen, Jiangyan Yi, Tao Wang, Chunyu Qiang
2020 Interspeech 2020  
End-to-end speech synthesis can reach high quality and naturalness with low-resource adaptation data.  ...  The limited adaptation data leads to unacceptable errors and low similarity of the synthetic speech.  ...  Introduction End-to-end speech synthesis, such as Tacotron, can achieve the state-of-art performance, and even close to human recording based on a large corpus [1] [2] [3] [4] [5] .  ... 
doi:10.21437/interspeech.2020-1623 dblp:conf/interspeech/FuTWYWQ20 fatcat:roh4ihznnvcwbcez6nnrdm3lbi

Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis [article]

Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan
2018 arXiv   pre-print
GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expressive factors of variation in speaking style.  ...  Finally, we demonstrate that multi-speaker TP-GST models successfully factorize speaker identity and speaking style. We provide a website with audio samples for each of our findings.  ...  When synthesizing with the expressive audiobook voice, however, the multi-speaker TP-GST model yields more expressive speech than a multi-speaker Tacotron conditioned on the same data.  ... 
arXiv:1808.01410v1 fatcat:jgac2iugrngqfoxo7qv6ub54iu
« Previous Showing results 1 — 15 out of 14,516 results