A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Filters
Towards Improved Zero-shot Voice Conversion with Conditional DSVAE
[article]
2022
arXiv
pre-print
Disentangling content and speaking style information is essential for zero-shot non-parallel voice conversion (VC). ...
We have demonstrated that simultaneous disentangling content embedding and speaker embedding from one utterance is feasible for zero-shot VC. ...
In this study, we continue the direction by further improving the disentangled representation learning in the DSVAE framework. ...
arXiv:2205.05227v1
fatcat:w3cr4274grd3ne7xtoa3qldn2u
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis
[article]
2022
arXiv
pre-print
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. ...
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. ...
., 2017; to achieve better generalization is to decompose a model into the domain-agnostic and domain-specific parts via disentangled representation learning. ...
arXiv:2205.07211v1
fatcat:xbd5tfkxdnfnbhg2zoj6w63ycq
Meta-Voice: Fast few-shot style transfer for expressive voice cloning using meta learning
[article]
2021
arXiv
pre-print
In this paper, we approach to the hard fast few-shot style transfer for voice cloning task using meta learning. ...
The task of few-shot style transfer for voice cloning in text-to-speech (TTS) synthesis aims at transferring speaking styles of an arbitrary source speaker to a target speaker's voice using very limited ...
The task of fast few-shot style transfer is very challenging in the sense that the learning algorithm needs to deal with not only a few-shot voice cloning problem (i.e., cloning a new voice using few samples ...
arXiv:2111.07218v1
fatcat:ibb34g7huzbeppe4vkhgha6hx4
Unsupervised Learning of Disentangled Speech Content and Style Representation
[article]
2021
arXiv
pre-print
We present an approach for unsupervised learning of speech representation disentangling contents and styles. ...
variables encode speech contents, as reconstructed speech can be recognized by ASR with low word error rates (WER), even with a different global encoding; (2) the global latent variables encode speaker style ...
Learning disentangled latent representations from speech has a wide set of applications in generative tasks, including speech synthesis, data augmentation, voice transfer, and speech compression. ...
arXiv:2010.12973v2
fatcat:cz2bnyh3sngpfffjzttomugtie
Self-Supervised VQ-VAE For One-Shot Music Style Transfer
[article]
2021
arXiv
pre-print
On the other hand, the results of existing one-shot audio style transfer methods on musical inputs are not as compelling. ...
While several style conversion methods tailored to musical signals have been proposed, most lack the 'one-shot' capability of classical image style transfer algorithms. ...
The model operates via mutually disentangled pitch and timbre representations, learned in a self-supervised manner without the need for annotations. • We train and test our model on a dataset where each ...
arXiv:2102.05749v1
fatcat:6x3dn2kl3rgfxbsrslelfioqgm
Self-Supervised VQ-VAE for One-Shot Music Style Transfer
2021
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
On the other hand, the results of existing one-shot audio style transfer methods on musical inputs are not as compelling. ...
While several style conversion methods tailored to musical signals have been proposed, most lack the 'one-shot' capability of classical image style transfer algorithms. ...
The model operates via mutually disentangled pitch and timbre representations, learned in a self-supervised manner without the need for annotations. • We train and test our model on a dataset where each ...
doi:10.1109/icassp39728.2021.9414235
fatcat:c6fpwcyse5dgdp2awxitx6c6tu
End-to-End Zero-Shot Voice Style Transfer with Location-Variable Convolutions
[article]
2022
arXiv
pre-print
Zero-shot voice conversion is becoming an increasingly popular research direction, as it promises the ability to transform speech to match the voice style of any speaker. ...
LVC-VC utilizes carefully designed input features that have disentangled content and speaker style information, and the vocoder-like architecture learns to combine them to simultaneously perform voice ...
Rather than disentangling speaker and content information like a standard zero-shot VC model, LVC-VC utilizes a set of input features that already have disentangled content and speaker style information ...
arXiv:2205.09784v1
fatcat:mzzp2k7w7ja7lolcy2n2wmlrw4
Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis
[article]
2020
arXiv
pre-print
Given an input text and a reference speech sample of an unseen person, FSM-SS can generate speech in that person's style in a few shot manner. ...
Style is best captured by prosody of a signal. ...
While some of these approaches consider zero-shot approach for multispeaker speech synthesis, none of them consider few shot explicit prosody transfer. ...
arXiv:2012.07252v1
fatcat:yn5npt4xu5dapjnenrlwdj7ei4
Unsupervised Audiovisual Synthesis via Exemplar Autoencoders
[article]
2021
arXiv
pre-print
To do so, we learn audiovisual bottleneck representations that capture the structured linguistic content of speech. ...
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target exemplar speech. ...
One-shot voice conversion with disentangled representations by leveraging phonetic posteriorgrams. In Interspeech, 2019. Keigo Nakamura, Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano. ...
arXiv:2001.04463v3
fatcat:ef7dbok5bjhn3or4bj5d45rtre
Global Rhythm Style Transfer Without Text Transcriptions
[article]
2021
arXiv
pre-print
AutoPST is an Autoencoder-based Prosody Style Transfer framework with a thorough rhythm removal module guided by the self-expressive representation learning. ...
Prosody plays an important role in characterizing the style of a speaker or an emotion, but most non-parallel voice or emotion style transfer algorithms do not convert any prosody information. ...
Voice Style Transfer Many style transfer approaches have been proposed for voice conversion. ...
arXiv:2106.08519v1
fatcat:iqxhwopd6javvg4m7e32mbmdua
Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data
[article]
2021
arXiv
pre-print
This paper presents a novel framework to build a voice conversion (VC) system by learning from a text-to-speech (TTS) synthesis system, that is called TTS-VC transfer learning. ...
voice conversion system. ...
Zero-shot Run-time Inference Once the TTS-VC transfer learning is completed, the voice conversion pipeline is able to perform voice conversion independently without involving the attention mechanism of ...
arXiv:2009.14399v2
fatcat:ta32qp23rbayfhj4iwbhvrr7km
Improving Self-Supervised Speech Representations by Disentangling Speakers
[article]
2022
arXiv
pre-print
Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks. ...
Our approach is adapted from the HuBERT framework, and incorporates disentangling mechanisms to regularize both the teacher labels and the learned representations. ...
Zero-shot Content Probe The first set of experiments we would like to evaluate is the set of zero-shot probing tasks proposed in the Zero-Resource Speech Challenges Dunbar et al., 2021) , because they ...
arXiv:2204.09224v1
fatcat:p53jmuv5kvh5lgq25yye6gi2aa
Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data
2021
IEEE/ACM Transactions on Audio Speech and Language Processing
We present a novel voice conversion (VC) framework by learning from a text-to-speech (TTS) synthesis system, that is called TTS-VC transfer learning or TTL-VC for short. ...
voice conversion system. ...
Zero-shot Run-time Inference Once the TTS-VC transfer learning is completed, the voice conversion pipeline is able to perform voice conversion independently without involving the attention mechanism of ...
doi:10.1109/taslp.2021.3066047
fatcat:b5dnzfkxgnfgnj3dipclfmela4
StylePTB: A Compositional Benchmark for Fine-grained Controllable Text Style Transfer
[article]
2021
arXiv
pre-print
As a result, StylePTB brings novel challenges that we hope will encourage future research in controllable text style transfer, compositional models, and learning disentangled representations. ...
Text style transfer aims to controllably generate text with targeted stylistic changes while maintaining core meaning from the source sentence constant. ...
Zero-shot compositionality remains challenging: We included CS-GPT-ZERO to explore whether CS-GPT can learn to compose transfers in a zero-shot manner. ...
arXiv:2104.05196v1
fatcat:2wiad5uainbblbyzhb4wz5rhrq
Emotion Intensity and its Control for Emotional Voice Conversion
[article]
2022
arXiv
pre-print
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding. ...
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. ...
research by Berrak Sisman is supported by the Ministry of Education, Singapore, under its MOE Tier 2 funding programme, award no: MOE-T2EP50220-0021, SUTD Startup Grant Artificial Intelligence for Human Voice ...
arXiv:2201.03967v2
fatcat:22h7iuofrnd33cf23xzrjun37m
« Previous
Showing results 1 — 15 out of 704 results