343 Hits in 6.4 sec

Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis [article]

Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Yonghui Wu
2020 arXiv   pre-print
This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model.  ...  Interpretations of prosody attributes are provided together with the comparison between word-level and phone-level prosody representations.  ...  Model Average variance ratio Baseline fine-grained VAE 5.8 ± 0.8 Fully-hierarchical VAE 8.0 ± 2.9 Table 4 .  ... 
arXiv:2002.03785v1 fatcat:6hn4ajomdbeetaq2mfj7dfl6mu

Fine-grained Noise Control for Multispeaker Speech Synthesis [article]

Karolos Nikitaras, Georgios Vamvoukakis, Nikolaos Ellinas, Konstantinos Klapsas, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis
2022 arXiv   pre-print
To the same end, we perform fine-grained prosody modeling via a Fully Hierarchical Variational AutoEncoder (FVAE) which additionally results in more expressive speech synthesis.  ...  unsupervised, interpretable and fine-grained noise and prosody modeling.  ...  In order to learn speech attributes other than content and speaker, we perform fine-grained noise and prosody modeling.  ... 
arXiv:2204.05070v1 fatcat:cmy3dsmyjfhm5plzuvezehsp6i

Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis [article]

Alexandra Vioni, Myrsini Christidou, Nikolaos Ellinas, Georgios Vamvoukakis, Panos Kakoulidis, Taehoon Kim, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis
2021 arXiv   pre-print
This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system.  ...  By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.  ...  Furthermore, a hierarchical, multi-level, fine-grained VAE structure is proposed in [9] , modeling word-level and phoneme-level prosody features, while a similar VAE structure with the addition of a quantization  ... 
arXiv:2111.10177v1 fatcat:7wa5o5yqsbfale6juzhkok3m24

DurIAN: Duration Informed Attention Network For Multimodal Synthesis [article]

Chengzhu Yu, Heng Lu, Na Hu, Meng Yu, Chao Weng, Kun Xu, Peng Liu, Deyi Tuo, Shiyin Kang, Guangzhi Lei, Dan Su, Dong Yu
2019 arXiv   pre-print
Finally, a simple yet effective approach for fine-grained control of expressiveness of speech and facial expression is introduced.  ...  This is different from the end-to-end attention mechanism used, and accounts for various unavoidable artifacts, in existing end-to-end speech synthesis systems such as Tacotron.  ...  Linchao Bao, Haozhi Huang and other members in the Tencent AI Lab computer vision team for providing facial modeling features and multimodal experiment environment.  ... 
arXiv:1909.01700v2 fatcat:zh36kga3czak5fey5r3sje4p6e

Emphasis control for parallel neural TTS [article]

Shreyas Seshadri, Tuomo Raitio, Dan Castellani, Jiangchuan Li
2022 arXiv   pre-print
Recent parallel neural text-to-speech (TTS) synthesis methods are able to generate speech with high fidelity while maintaining high performance.  ...  However, these systems often lack control over the output prosody, thus restricting the semantic information conveyable for a given text.  ...  The phoneme encodings are then fed to the phonemewise feature predictors that are organized in a hierarchical manner that allows for interpretability and fine-grained control of the individual features  ... 
arXiv:2110.03012v2 fatcat:kbmmbocuzfeh5mkq62a26zvdlm

Supervised and Unsupervised Approaches for Controlling Narrow Lexical Focus in Sequence-to-Sequence Speech Synthesis [article]

Slava Shechtman, Raul Fernandez, David Haws
2021 arXiv   pre-print
Although Sequence-to-Sequence (S2S) architectures have become state-of-the-art in speech synthesis, capable of generating outputs that approach the perceptual quality of natural samples, they are limited  ...  In this work we present a framework capable of controlling the prosodic output via a set of concise, interpretable, disentangled parameters.  ...  Hierarchical Prosodic-Control Model Following the motivation for a perceptually-interpretable, lowdimensional control mechanism for prosody discussed in Sec. 1, we propose a hierarchical set of four prosodic  ... 
arXiv:2101.09940v1 fatcat:h7kavilufjhqjctsmdwhoyp23m

UniTTS: Residual Learning of Unified Embedding Space for Speech Style Control [article]

Minsu Kang, Sungjae Kim, Injung Kim
2022 arXiv   pre-print
We propose a novel high-fidelity expressive speech synthesis model, UniTTS, that learns and controls overlapping style attributes avoiding interference.  ...  As well, UniTTS synthesized high-fidelity speech signals controlling multiple style attributes.  ...  Additionally, UniTTS includes a fine-grained prosody model that learns unlabeled prosody at the phoneme level. Prior work has shown that the fine-grained prosody model improves speech quality [23] .  ... 
arXiv:2106.11171v3 fatcat:wp47ixwb7bhsxnv2pvkziet6ay

Expressive TTS Training with Frame and Style Reconstruction Loss

Rui Liu, Berrak Sisman, Guang lai Gao, Haizhou Li
2021 IEEE/ACM Transactions on Audio Speech and Language Processing  
For more information, see This article has been accepted for publication in a future issue of this journal, but has not been fully edited.  ...  We propose a novel training strategy for Tacotronbased text-to-speech (TTS) system that improves the speech styling at utterance level.  ...  [34] , [35] further study a hierarchical, fine-grained and interpretable latent variable model for prosody rendering.  ... 
doi:10.1109/taslp.2021.3076369 fatcat:akcftzhvzfh5dbk6nolwd36moa

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis [article]

Yi Lei, Shan Yang, Xinsheng Wang, Lei Xie
2022 arXiv   pre-print
and thus ignores the multi-scale nature of speech prosody.  ...  In this paper, we propose MsEmoTTS, a multi-scale emotional speech synthesis framework, to model the emotion from different levels.  ...  CONCLUSION Inspired by the hierarchical nature of prosody, a multi-scale model for emotional speech synthesis, called MsEmoTTS, is proposed in this paper.  ... 
arXiv:2201.06460v1 fatcat:jzjhbd6f5req3d2bk4zk24lf5a

Expressive TTS Training with Frame and Style Reconstruction Loss [article]

Rui Liu, Berrak Sisman, Guanglai Gao, Haizhou Li
2021 arXiv   pre-print
We propose a novel training strategy for Tacotron-based text-to-speech (TTS) system to improve the expressiveness of speech.  ...  One of the key challenges in prosody modeling is the lack of reference that makes explicit modeling difficult. The proposed technique doesn't require prosody annotations from training data.  ...  [31] , [32] further study a hierarchical, fine-grained and interpretable latent variable model for prosody rendering.  ... 
arXiv:2008.01490v2 fatcat:lpv25uehjvgzdhzwspa757wgkm

Emotion Intensity and its Control for Emotional Voice Conversion [article]

Kun Zhou, Berrak Sisman, Rajib Rana, Björn W. Schuller, Haizhou Li
2022 arXiv   pre-print
As desired, the proposed network controls the fine-grained emotion intensity in the output speech.  ...  We further learn the actual emotion encoder from an emotion-labelled database and study the use of relative attributes to represent fine-grained emotion intensity.  ...  The authors would like to thank the anonymous reviewers for their insightful comments, Dr Bin Wang for valuable discussions and Dr Rui Liu for sharing part of the codes.  ... 
arXiv:2201.03967v2 fatcat:22h7iuofrnd33cf23xzrjun37m

Hierarchical Generative Modeling for Controllable Speech Synthesis [article]

Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang
2018 arXiv   pre-print
The model is formulated as a conditional generative model based on the variational autoencoder (VAE) framework, with two levels of hierarchical latent variables.  ...  The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained  ...  Saurous, William Chan, RJ Skerry-Ryan, Eric Battenberg, and the Google Brain, Perception and TTS teams for their helpful feedback and discussions.  ... 
arXiv:1810.07217v2 fatcat:6xyu5omwfzdedplwuwpghlf6hq

Review of end-to-end speech synthesis technology based on deep learning [article]

Zhaoxi Mu, Xinyu Yang, Yizhuo Dong
2021 arXiv   pre-print
Moreover, this paper also summarizes the open-source speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks, and introduces some commonly used subjective and  ...  Due to the limitations of high complexity and low efficiency of traditional speech synthesis technology, the current research focus is the deep learning-based end-to-end speech synthesis technology, which  ...  [76] for modeling global features of speech such as different prosodic patterns of different speakers, and the other similar to Sun et al. [200] for modeling phoneme-level fine-grained features.  ... 
arXiv:2104.09995v1 fatcat:q5lx74ycx5hobjox4ktl3amfta

Explaining the PENTA model: a reply to Arvaniti and Ladd

Yi Xu, Albert Lee, Santitham Prom-on, Fang Liu
2015 Phonology  
PENTA is a framework for conceptually and computationally linking communicative meanings to fine-grained prosodic details, based on an articulatory-functional view of speech.  ...  This paper presents an overview of the Parallel Encoding and Target Approximation (PENTA) model of speech prosody, in response to an extensive critique by Arvaniti & Ladd (2009).  ...  Ladd and four anonymous reviewers for their comments on earlier drafts of this paper.  ... 
doi:10.1017/s0952675715000299 fatcat:lokmgwpt7jb45hs24a6uy73ojq

Specifying and animating facial signals for discourse in embodied conversational agents

Doug DeCarlo, Matthew Stone, Corey Revilla, Jennifer J. Venditti
2004 Computer Animation and Virtual Worlds  
RUTH adopts an open, layered architecture in which fine-grained features of the animation can be derived by rule from inferred linguistic structure, allowing us to use RUTH, in conjunction with annotation  ...  People highlight the intended interpretation of their utterances within a larger discourse by a diverse set of nonverbal signals.  ...  This process is managed by a fully-customizable flow-of-control in interpreted Scheme.  ... 
doi:10.1002/cav.5 fatcat:hwhch5fi55dcdoadh6syx45t2u
« Previous Showing results 1 — 15 out of 343 results