10,211 Hits in 6.5 sec

Deep Voice: Real-time Neural Text-to-Speech [article]

Sercan O. Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, Shubho Sengupta, Mohammad Shoeybi
2017 arXiv   pre-print
We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. Deep Voice lays the groundwork for truly end-to-end neural speech synthesis.  ...  By using a neural network for each component, our system is simpler and more flexible than traditional text-to-speech systems, where each component requires laborious feature engineering and extensive  ...  Deep Voice is inspired by traditional text-to-speech pipelines and adopts the same structure, while replacing all components with neural networks and using simpler features: first we convert text to phoneme  ... 
arXiv:1702.07825v2 fatcat:6atipioagfcxdhnbxaje6xvyzm

Real-Time Neural Text-to-Speech with Sequence-to-Sequence Acoustic Model and WaveGlow or Single Gaussian WaveRNN Vocoders

Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, Hisashi Kawai
2019 Interspeech 2019  
This paper investigates real-time high-fidelity neural text-tospeech (TTS) systems. For real-time neural vocoders, Wave-Glow is introduced and single Gaussian (SG) WaveRNN is proposed.  ...  The results of subjective experiment using a Japanese female corpus indicate that the proposed SG-WaveRNN vocoder with noise shaping can synthesize highquality speech waveforms and real-time high-fidelity  ...  Introduction Real-time text-to-speech (TTS) techniques are among the most important speech communication technologies.  ... 
doi:10.21437/interspeech.2019-1288 dblp:conf/interspeech/OkamotoTSK19 fatcat:z7cmh74fcrh5leen4aoaoncemu

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning [article]

Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, John Miller
2018 arXiv   pre-print
We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system.  ...  Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster.  ...  We demonstrate that we can generate monotonic attention behavior, avoiding error modes commonly affecting sequence-to-sequence models. 4.  ... 
arXiv:1710.07654v3 fatcat:x2kmwi4hwzcglhfu4yy3f7llwi

Detecting Deepfake Voice Using Explainable Deep Learning Techniques

Suk-Young Lim, Dong-Kyu Chae, Sang-Chul Lee
2022 Applied Sciences  
Fake media, generated by methods such as deepfakes, have become indistinguishable from real media, but their detection has not improved at the same pace.  ...  Deepfake voices are generally divided into two categories: text-to-speech (TTS) generation and voice conversion.  ...  (a) Deep Taylor deepfake voice. (b) Integrated gradients deepfake voice. (c) LRP deepfake voice. (d) Deep Taylor real voice. (e) Integrated gradients real voice. (f) LRP real voice.  ... 
doi:10.3390/app12083926 fatcat:vjuwm2j4jbhtrgxgb3v6sufwya

LSTM Deep Neural Networks Postfiltering for Improving the Quality of Synthetic Voices [article]

Marvin Coto-Jiménez, John Goddard-Close
2016 arXiv   pre-print
In this paper we present the application of Long-Short Term Memory Deep Neural Networks as a Postfiltering step of HMM-based speech synthesis, in order to obtain closer spectral characteristics to those  ...  Recent developments in speech synthesis have produced systems capable of outcome intelligible speech, but now researchers strive to create models that more accurately mimic human voices.  ...  INTRODUCTION Text-to-speech (TTS) synthesis is the technique of generating intelligible speech from a given text.  ... 
arXiv:1602.02656v1 fatcat:nshkdywklfhbpcrssyqz6qyuq4

An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning [article]

Berrak Sisman, Junichi Yamagishi, Simon King, Haizhou Li
2020 arXiv   pre-print
In this paper, we provide a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and  ...  Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged.  ...  WaveNet [66] is a deep neural network that learns to generate high quality time-domain waveform.  ... 
arXiv:2008.03648v2 fatcat:nehs6o22pzdirffvedqtby4sd4

Fingerprinting Encrypted Voice Traffic on Smart Speakers with Deep Learning [article]

Chenggang Wang, Sean Kennedy, Haipeng Li, King Hudson, Gowtham Atluri, Xuetao Wei, Wenhai Sun, Boyang Wang
2020 arXiv   pre-print
This is because the AI-based voice services running on the server side response commands in the same voice and with a deterministic or predictable manner in text, which leaves distinguishable pattern over  ...  This paper investigates the privacy leakage of smart speakers under an encrypted traffic analysis attack, referred to as voice command fingerprinting.  ...  popular smart speakers, Amazon Echo and Google Home, using 5 automated voices rendered by public text-to-speech APIs.  ... 
arXiv:2005.09800v1 fatcat:5broa65upjgfrck3slpk5zn77m

A Survey on Recent Deep Learning-driven Singing Voice Synthesis Systems [article]

Yin-Ping Cho, Fu-Rong Yang, Yung-Chuan Chang, Ching-Ting Cheng, Xiao-Han Wang, Yi-Wen Liu
2021 arXiv   pre-print
Singing voice synthesis (SVS) is a task that aims to generate audio signals according to musical scores and lyrics.  ...  This paper aims to review some of the state-of-the-art deep learning-driven SVS systems.  ...  Sinsy: DNN + Neural Vocoder Sinsy [14] is designed to synthesize singing voices at appropriate timing from a musical score.  ... 
arXiv:2110.02511v1 fatcat:4ou5xepnjbg2todhfu3vrn7p44

Adversarial Attack and Defense on Deep Neural Network-Based Voice Processing Systems: An Overview

Xiaojiao Chen, Sheng Li, Hao Huang
2021 Applied Sciences  
Unfortunately, recent research has shown that those systems based on deep neural networks are vulnerable to adversarial examples, which attract significant attention to VPS security.  ...  Voice Processing Systems (VPSes), now widely deployed, have become deeply involved in people's daily lives, helping drive the car, unlock the smartphone, make online purchases, etc.  ...  Benefiting from the rapid development of deep neural networks, speech recognition has also made good progress.  ... 
doi:10.3390/app11188450 fatcat:zjige7gepbdvnpk2i3qwyqv2oe

Investigating Deep Neural Structures and their Interpretability in the Domain of Voice Conversion [article]

Samuel J. Broughton, Md Asif Jalal, Roger K. Moore
2021 arXiv   pre-print
The aim of non-parallel conditional GANs for VC is to translate an acoustic speech feature sequence from one domain to another without the use of paired data.  ...  Voice Conversion (VC) is a subset of voice translation that involves translating the paralinguistic features of a source speaker to a target speaker while preserving the linguistic information.  ...  such as real-time language translation and device-assisted language learning [11] .  ... 
arXiv:2102.11420v1 fatcat:cuaj6ct7rzezdlrbadhhniolrq

Continuous vocoder applied in deep neural network based voice conversion

Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Géza Németh
2019 Multimedia tools and applications  
In this paper, a novel vocoder is proposed for a Statistical Voice Conversion (SVC) framework using deep neural network, where multiple features from the speech of two speakers (source and target) are  ...  frequency, and spectral features. (2) We show that the feed-forward deep neural network (FF-DNN) using our vocoder yields high quality conversion. (3) We apply a geometric approach to spectral subtraction  ...  the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.  ... 
doi:10.1007/s11042-019-08198-5 fatcat:xmaanx6ofvdhxgl3ldse3r6bge

Online Speech Recognition Using Multichannel Parallel Acoustic Score Computation and Deep Neural Network (DNN)- Based Voice-Activity Detector

Yoo Rhee Oh, Kiyoung Park, Jeon Gyu Park
2020 Applied Sciences  
Moreover, the average user-perceived latency is reduced from 11.71 s to 3.09–5.41 s by using the proposed deep neural network-based voice activity detector.  ...  The proposed deep neural network-based voice activity detector detects short pauses in the utterance to reduce response latency, while the user utters long sentences.  ...  DNN-based voice-activity detector An online ASR system needs to send the recognized text to a user as soon as possible even before the end of sentence is detected to reduce response time.  ... 
doi:10.3390/app10124091 fatcat:vu7x3kvkzzfi3lb62hsjlbzu7i

Deep Neural Networks with Voice Entry Estimation Heuristics for Voice Separation in Symbolic Music Representations

Reinier De Valk, Tillman Weyde
2018 Zenodo  
In this study we explore the use of deep feedforward neural networks for voice separation in symbolic music representations.  ...  Using more layers does not lead to a significant performance improvement.  ...  Over the past decade, deep neural networks (DNNs) have been successfully applied to various computer vision, speech recognition, and natural language processing tasks, and, increasingly, to MIR tasks  ... 
doi:10.5281/zenodo.1492402 fatcat:5fca7poutvezlfiywy6kmoskau

Self-Adaptive Soft Voice Activity Detection using Deep Neural Networks for Robust Speaker Verification [article]

Youngmoon Jung, Yeunju Choi, Hoirin Kim
2019 arXiv   pre-print
In this paper, we propose a novel method, called self-adaptive soft VAD, to incorporate a deep neural network (DNN)-based VAD into a deep speaker embedding system.  ...  Voice activity detection (VAD), which classifies frames as speech or non-speech, is an important module in many speech applications including speaker verification.  ...  To deal with this problem, several deep neural network (DNN)-based VADs [9, 10, 11, 12] have been proposed and shown to give better results in low SNRs.  ... 
arXiv:1909.11886v1 fatcat:xe7olwgssrhttdc7dlbdbblu3u

Multimodal Voice Conversion Under Adverse Environment Using a Deep Convolutional Neural Network

Jian Zhou, Yuting Hu, Hailun Lian, Huabin Wang, Liang Tao, Hon Keung Kwan
2019 IEEE Access  
To solve this problem, we propose a multimodal voice conversion model based on a deep convolutional neural network (MDCNN) built by combining two convolutional neural networks (CNN) and a deep neural network  ...  The two CNNs are designed to extract acoustic and visual features, and the DNN is designed to capture the nonlinear mapping relation of source speech and target speech.  ...  with voice disorders, disguising speaker identity in communication, dubbing films, translation into different languages, and synthesis of text-to-speech (TTS) where a voice conversion system is used to  ... 
doi:10.1109/access.2019.2955982 fatcat:ubs54prtffajlp2qrz5xu3hase
« Previous Showing results 1 — 15 out of 10,211 results