Filters








23,606 Hits in 6.3 sec

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis [article]

Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao
2022 arXiv   pre-print
Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing  ...  , including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) fine-grained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer  ...  However, controlling and transferring styles to out-ofdomain (OOD) target voice in speech synthesis remains elusive.  ... 
arXiv:2205.07211v1 fatcat:xbd5tfkxdnfnbhg2zoj6w63ycq

A Survey on Neural Speech Synthesis [article]

Xu Tan, Tao Qin, Frank Soong, Tie-Yan Liu
2021 arXiv   pre-print
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad  ...  We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive  ...  How to design better methods for expressive/controllable/transferrable speech synthesis is also appealing. • More human-like speech synthesis.  ... 
arXiv:2106.15561v3 fatcat:pbrbs6xay5e4fhf4ewlp7qvybi

DurIAN: Duration Informed Attention Network For Multimodal Synthesis [article]

Chengzhu Yu, Heng Lu, Na Hu, Meng Yu, Chao Weng, Kun Xu, Peng Liu, Deyi Tuo, Shiyin Kang, Guangzhi Lei, Dan Su, Dong Yu
2019 arXiv   pre-print
Finally, a simple yet effective approach for fine-grained control of expressiveness of speech and facial expression is introduced.  ...  This is different from the end-to-end attention mechanism used, and accounts for various unavoidable artifacts, in existing end-to-end speech synthesis systems such as Tacotron.  ...  Linchao Bao, Haozhi Huang and other members in the Tencent AI Lab computer vision team for providing facial modeling features and multimodal experiment environment.  ... 
arXiv:1909.01700v2 fatcat:zh36kga3czak5fey5r3sje4p6e

Fine-grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis [article]

Yi Lei, Shan Yang, Lei Xie
2020 arXiv   pre-print
This paper proposes a unified model to conduct emotion transfer, control and prediction for sequence-to-sequence based fine-grained emotional speech synthesis.  ...  As for the emotional speech synthesis with arbitrary text inputs, the proposed model can also predict phoneme-level emotion expressions from texts, which does not require any reference audio or manual  ...  Undoubtedly, human speech contains subtle expressions at various granularities. For instance, phoneme-level prosody variations are important for expressive speech synthesis [15] .  ... 
arXiv:2011.08477v1 fatcat:pldzha6vczf6hmt3dyttex7giy

Synthesis of Expressive Speaking Styles with Limited Training Data in a Multi-Speaker, Prosody-Controllable Sequence-to-Sequence Architecture

Slava Shechtman, Raul Fernandez, Alexander Sorin, David Haws
2021 Conference of the International Speech Communication Association  
bottleneck when we are interested in generating speech in a variety of expressive styles.  ...  The architecture is furthermore controllable, allowing the user to select an operating point that conveys a desired level of expressiveness.  ...  Note that the global offsets can be applied to both sentence-level and word-level components We applied the finetuning to derive distinctive moderate and high levels of expression for each of the non-neutral  ... 
doi:10.21437/interspeech.2021-1446 dblp:conf/interspeech/ShechtmanFSH21 fatcat:6thwek3unve2bjks4egjh7udgq

PPSpeech: Phrase based Parallel End-to-End TTS System [article]

Yahuan Cong, Ran Zhang, Jian Luan
2020 arXiv   pre-print
Experiments show that, the synthesis speed of PPSpeech is much faster than sentence level autoregressive Tacotron 2 when a sentence has more than 5 phrases.  ...  On the other hand, the style of synthetic speech becomes unstable and may change obviously among sentences.  ...  It implements speech synthesis parallelly in phrase-level, which greatly shortens the time in comparison to sentence-level autoregressive speech synthesis.  ... 
arXiv:2008.02490v1 fatcat:mmwl6xe43rh6znvx6wqdqkeo2m

Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis [article]

Yukiya Hono, Kazuna Tsuboi, Kei Sawada, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda
2021 arXiv   pre-print
In recent years, fine-grained latent variables are introduced into the text-to-speech synthesis that enable the fine control of the prosody and speaking styles of synthesized speech.  ...  Our proposed framework also provides the controllability of speaking style in an entire utterance.  ...  Experimental results indicate that the proposed model is effective for expressive and controllable speech synthesis.  ... 
arXiv:2009.08474v2 fatcat:cvdnfbhvwvb2tlp5nmzjpgxd4y

Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis

Yukiya Hono, Kazuna Tsuboi, Kei Sawada, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda
2020 Interspeech 2020  
In recent years, fine-grained latent variables are introduced into the text-to-speech synthesis that enable the fine control of the prosody and speaking styles of synthesized speech.  ...  Our proposed framework also provides the controllability of speaking style in an entire utterance.  ...  Experimental results indicate that the proposed model is effective for expressive and controllable speech synthesis.  ... 
doi:10.21437/interspeech.2020-2477 dblp:conf/interspeech/HonoTSHONT20 fatcat:gwmaqc6fmrfg7krx2azwwr4qiq

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech [article]

Keon Lee, Kyumin Park, Daeyoung Kim
2021 arXiv   pre-print
Previous works on neural text-to-speech (TTS) have been addressed on limited speed in training and inference time, robustness for difficult synthesis conditions, expressiveness, and controllability.  ...  Various experiments demonstrate that STYLER is more effective in speed and robustness than expressive TTS with autoregressive decoding and more expressive and controllable than reading style non-autoregressive  ...  In this paper, we propose STYLER, a fast and robust style modeling TTS framework for expressive and controllable speech synthesis.  ... 
arXiv:2103.09474v4 fatcat:slqogt6e3vbcvnohmnustfklzu

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis [article]

Yi Lei, Shan Yang, Xinsheng Wang, Lei Xie
2022 arXiv   pre-print
Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years.  ...  Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style  ...  Instead of predicting global styles, some recent work tried to predict the styles from a fine-grained level, such as phoneme level [37] or word level [44] .  ... 
arXiv:2201.06460v1 fatcat:jzjhbd6f5req3d2bk4zk24lf5a

Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis [article]

Raul Fernandez, David Haws, Guy Lorberbom, Slava Shechtman, Alexander Sorin
2022 arXiv   pre-print
Sequence-to-Sequence Text-to-Speech architectures that directly generate low level acoustic features from phonetic sequences are known to produce natural and expressive speech when provided with adequate  ...  Such systems can learn and transfer desired speaking styles from one seen speaker to another (in multi-style multi-speaker settings), which is highly desirable for creating scalable and customizable Human-Computer  ...  We also ensured that the proposed HPC-controlled NAT2 system is responsive to HPC offsets [6] and suitable for local word focus realization, as in [11] , and for "simpler" speaking style transfer, as  ... 
arXiv:2207.12262v1 fatcat:bsmh23buejfmzcyl7l4bpapiu4

Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis [article]

Antti Suni, Sofoklis Kakouros, Martti Vainio, Juraj Šimko
2020 arXiv   pre-print
This type of prosodic variation often reflects long-distance semantic relationships that are not accessible for end-to-end systems with a single sentence as their synthesis domain.  ...  this effort, the state-of-the-art systems fall short of faithfully reproducing local prosodic events that give rise to, e.g., word-level emphasis and phrasal structure.  ...  speech style and single sentence level prosody.  ... 
arXiv:2006.15967v1 fatcat:xei2h7ok5relrbqrkhcuaanlha

Review of end-to-end speech synthesis technology based on deep learning [article]

Zhaoxi Mu, Xinyu Yang, Yizhuo Dong
2021 arXiv   pre-print
Moreover, this paper also summarizes the open-source speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks, and introduces some commonly used subjective and  ...  and more attention.  ...  How to achieve fine-grained style control of speech at word level and phrase level will also be the focus of TTS research in the future.  ... 
arXiv:2104.09995v1 fatcat:q5lx74ycx5hobjox4ktl3amfta

Hierarchical prosody modeling and control in non-autoregressive parallel neural TTS [article]

Tuomo Raitio, Jiangchuan Li, Shreyas Seshadri
2022 arXiv   pre-print
Moreover, many models lack the ability to control the output prosody, which does not allow for different styles for the same text input.  ...  Neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech.  ...  styles and control emphasis at word level.  ... 
arXiv:2110.02952v2 fatcat:xhp2ekbcpvehzikhxnshhmsxli

A review of paralinguistic information processing for natural speech communication

Yoichi Yamashita
2013 Acoustical Science and Technology  
This paper reviews recognition and synthesis techniques for speech communication focusing on emotion and emphasis as well as corpora that are dispensable to development of current speech technologies.  ...  on, and is called para-or non-linguistic information.  ...  Yamagishi et al. describes MLLR adaptation from reading style to emotional speech for HMM-based speech synthesis [55] .  ... 
doi:10.1250/ast.34.73 fatcat:amqsvafo35cntnyln2xue273wu
« Previous Showing results 1 — 15 out of 23,606 results