Filters








704 Hits in 4.1 sec

Towards Improved Zero-shot Voice Conversion with Conditional DSVAE [article]

Jiachen Lian and Chunlei Zhang and Gopala Krishna Anumanchipalli and Dong Yu
2022 arXiv   pre-print
Disentangling content and speaking style information is essential for zero-shot non-parallel voice conversion (VC).  ...  We have demonstrated that simultaneous disentangling content embedding and speaker embedding from one utterance is feasible for zero-shot VC.  ...  In this study, we continue the direction by further improving the disentangled representation learning in the DSVAE framework.  ... 
arXiv:2205.05227v1 fatcat:w3cr4274grd3ne7xtoa3qldn2u

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis [article]

Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao
2022 arXiv   pre-print
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.  ...  Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.  ...  ., 2017; to achieve better generalization is to decompose a model into the domain-agnostic and domain-specific parts via disentangled representation learning.  ... 
arXiv:2205.07211v1 fatcat:xbd5tfkxdnfnbhg2zoj6w63ycq

Meta-Voice: Fast few-shot style transfer for expressive voice cloning using meta learning [article]

Songxiang Liu, Dan Su, Dong Yu
2021 arXiv   pre-print
In this paper, we approach to the hard fast few-shot style transfer for voice cloning task using meta learning.  ...  The task of few-shot style transfer for voice cloning in text-to-speech (TTS) synthesis aims at transferring speaking styles of an arbitrary source speaker to a target speaker's voice using very limited  ...  The task of fast few-shot style transfer is very challenging in the sense that the learning algorithm needs to deal with not only a few-shot voice cloning problem (i.e., cloning a new voice using few samples  ... 
arXiv:2111.07218v1 fatcat:ibb34g7huzbeppe4vkhgha6hx4

Unsupervised Learning of Disentangled Speech Content and Style Representation [article]

Andros Tjandra, Ruoming Pang, Yu Zhang, Shigeki Karita
2021 arXiv   pre-print
We present an approach for unsupervised learning of speech representation disentangling contents and styles.  ...  variables encode speech contents, as reconstructed speech can be recognized by ASR with low word error rates (WER), even with a different global encoding; (2) the global latent variables encode speaker style  ...  Learning disentangled latent representations from speech has a wide set of applications in generative tasks, including speech synthesis, data augmentation, voice transfer, and speech compression.  ... 
arXiv:2010.12973v2 fatcat:cz2bnyh3sngpfffjzttomugtie

Self-Supervised VQ-VAE For One-Shot Music Style Transfer [article]

Ondřej Cífka, Alexey Ozerov, Umut Şimşekli, Gaël Richard
2021 arXiv   pre-print
On the other hand, the results of existing one-shot audio style transfer methods on musical inputs are not as compelling.  ...  While several style conversion methods tailored to musical signals have been proposed, most lack the 'one-shot' capability of classical image style transfer algorithms.  ...  The model operates via mutually disentangled pitch and timbre representations, learned in a self-supervised manner without the need for annotations. • We train and test our model on a dataset where each  ... 
arXiv:2102.05749v1 fatcat:6x3dn2kl3rgfxbsrslelfioqgm

Self-Supervised VQ-VAE for One-Shot Music Style Transfer

Ondrej Cifka, Alexey Ozerov, Umut Simsekli, Gael Richard
2021 ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
On the other hand, the results of existing one-shot audio style transfer methods on musical inputs are not as compelling.  ...  While several style conversion methods tailored to musical signals have been proposed, most lack the 'one-shot' capability of classical image style transfer algorithms.  ...  The model operates via mutually disentangled pitch and timbre representations, learned in a self-supervised manner without the need for annotations. • We train and test our model on a dataset where each  ... 
doi:10.1109/icassp39728.2021.9414235 fatcat:c6fpwcyse5dgdp2awxitx6c6tu

End-to-End Zero-Shot Voice Style Transfer with Location-Variable Convolutions [article]

Wonjune Kang, Deb Roy
2022 arXiv   pre-print
Zero-shot voice conversion is becoming an increasingly popular research direction, as it promises the ability to transform speech to match the voice style of any speaker.  ...  LVC-VC utilizes carefully designed input features that have disentangled content and speaker style information, and the vocoder-like architecture learns to combine them to simultaneously perform voice  ...  Rather than disentangling speaker and content information like a standard zero-shot VC model, LVC-VC utilizes a set of input features that already have disentangled content and speaker style information  ... 
arXiv:2205.09784v1 fatcat:mzzp2k7w7ja7lolcy2n2wmlrw4

Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis [article]

Neeraj Kumar, Srishti Goel, Ankur Narang, Brejesh Lall
2020 arXiv   pre-print
Given an input text and a reference speech sample of an unseen person, FSM-SS can generate speech in that person's style in a few shot manner.  ...  Style is best captured by prosody of a signal.  ...  While some of these approaches consider zero-shot approach for multispeaker speech synthesis, none of them consider few shot explicit prosody transfer.  ... 
arXiv:2012.07252v1 fatcat:yn5npt4xu5dapjnenrlwdj7ei4

Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [article]

Kangle Deng and Aayush Bansal and Deva Ramanan
2021 arXiv   pre-print
To do so, we learn audiovisual bottleneck representations that capture the structured linguistic content of speech.  ...  We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target exemplar speech.  ...  One-shot voice conversion with disentangled representations by leveraging phonetic posteriorgrams. In Interspeech, 2019. Keigo Nakamura, Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano.  ... 
arXiv:2001.04463v3 fatcat:ef7dbok5bjhn3or4bj5d45rtre

Global Rhythm Style Transfer Without Text Transcriptions [article]

Kaizhi Qian, Yang Zhang, Shiyu Chang, Jinjun Xiong, Chuang Gan, David Cox, Mark Hasegawa-Johnson
2021 arXiv   pre-print
AutoPST is an Autoencoder-based Prosody Style Transfer framework with a thorough rhythm removal module guided by the self-expressive representation learning.  ...  Prosody plays an important role in characterizing the style of a speaker or an emotion, but most non-parallel voice or emotion style transfer algorithms do not convert any prosody information.  ...  Voice Style Transfer Many style transfer approaches have been proposed for voice conversion.  ... 
arXiv:2106.08519v1 fatcat:iqxhwopd6javvg4m7e32mbmdua

Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data [article]

Mingyang Zhang, Yi Zhou, Li Zhao, Haizhou Li
2021 arXiv   pre-print
This paper presents a novel framework to build a voice conversion (VC) system by learning from a text-to-speech (TTS) synthesis system, that is called TTS-VC transfer learning.  ...  voice conversion system.  ...  Zero-shot Run-time Inference Once the TTS-VC transfer learning is completed, the voice conversion pipeline is able to perform voice conversion independently without involving the attention mechanism of  ... 
arXiv:2009.14399v2 fatcat:ta32qp23rbayfhj4iwbhvrr7km

Improving Self-Supervised Speech Representations by Disentangling Speakers [article]

Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni, Cheng-I Lai, David Cox, Mark Hasegawa-Johnson, Shiyu Chang
2022 arXiv   pre-print
Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks.  ...  Our approach is adapted from the HuBERT framework, and incorporates disentangling mechanisms to regularize both the teacher labels and the learned representations.  ...  Zero-shot Content Probe The first set of experiments we would like to evaluate is the set of zero-shot probing tasks proposed in the Zero-Resource Speech Challenges Dunbar et al., 2021) , because they  ... 
arXiv:2204.09224v1 fatcat:p53jmuv5kvh5lgq25yye6gi2aa

Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data

Mingyang Zhang, Yi Zhou, Li Zhao, Haizhou Li
2021 IEEE/ACM Transactions on Audio Speech and Language Processing  
We present a novel voice conversion (VC) framework by learning from a text-to-speech (TTS) synthesis system, that is called TTS-VC transfer learning or TTL-VC for short.  ...  voice conversion system.  ...  Zero-shot Run-time Inference Once the TTS-VC transfer learning is completed, the voice conversion pipeline is able to perform voice conversion independently without involving the attention mechanism of  ... 
doi:10.1109/taslp.2021.3066047 fatcat:b5dnzfkxgnfgnj3dipclfmela4

StylePTB: A Compositional Benchmark for Fine-grained Controllable Text Style Transfer [article]

Yiwei Lyu, Paul Pu Liang, Hai Pham, Eduard Hovy, Barnabás Póczos, Ruslan Salakhutdinov, Louis-Philippe Morency
2021 arXiv   pre-print
As a result, StylePTB brings novel challenges that we hope will encourage future research in controllable text style transfer, compositional models, and learning disentangled representations.  ...  Text style transfer aims to controllably generate text with targeted stylistic changes while maintaining core meaning from the source sentence constant.  ...  Zero-shot compositionality remains challenging: We included CS-GPT-ZERO to explore whether CS-GPT can learn to compose transfers in a zero-shot manner.  ... 
arXiv:2104.05196v1 fatcat:2wiad5uainbblbyzhb4wz5rhrq

Emotion Intensity and its Control for Emotional Voice Conversion [article]

Kun Zhou, Berrak Sisman, Rajib Rana, Björn W. Schuller, Haizhou Li
2022 arXiv   pre-print
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.  ...  Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.  ...  research by Berrak Sisman is supported by the Ministry of Education, Singapore, under its MOE Tier 2 funding programme, award no: MOE-T2EP50220-0021, SUTD Startup Grant Artificial Intelligence for Human Voice  ... 
arXiv:2201.03967v2 fatcat:22h7iuofrnd33cf23xzrjun37m
« Previous Showing results 1 — 15 out of 704 results