A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Deep Voice: Real-time Neural Text-to-Speech
[article]
2017
arXiv
pre-print
We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. Deep Voice lays the groundwork for truly end-to-end neural speech synthesis. ...
By using a neural network for each component, our system is simpler and more flexible than traditional text-to-speech systems, where each component requires laborious feature engineering and extensive ...
Deep Voice is inspired by traditional text-to-speech pipelines and adopts the same structure, while replacing all components with neural networks and using simpler features: first we convert text to phoneme ...
arXiv:1702.07825v2
fatcat:6atipioagfcxdhnbxaje6xvyzm
Real-Time Neural Text-to-Speech with Sequence-to-Sequence Acoustic Model and WaveGlow or Single Gaussian WaveRNN Vocoders
2019
Interspeech 2019
This paper investigates real-time high-fidelity neural text-tospeech (TTS) systems. For real-time neural vocoders, Wave-Glow is introduced and single Gaussian (SG) WaveRNN is proposed. ...
The results of subjective experiment using a Japanese female corpus indicate that the proposed SG-WaveRNN vocoder with noise shaping can synthesize highquality speech waveforms and real-time high-fidelity ...
Introduction Real-time text-to-speech (TTS) techniques are among the most important speech communication technologies. ...
doi:10.21437/interspeech.2019-1288
dblp:conf/interspeech/OkamotoTSK19
fatcat:z7cmh74fcrh5leen4aoaoncemu
Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
[article]
2018
arXiv
pre-print
We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. ...
Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. ...
We demonstrate that we can generate monotonic attention behavior, avoiding error modes commonly affecting sequence-to-sequence models. 4. ...
arXiv:1710.07654v3
fatcat:x2kmwi4hwzcglhfu4yy3f7llwi
Detecting Deepfake Voice Using Explainable Deep Learning Techniques
2022
Applied Sciences
Fake media, generated by methods such as deepfakes, have become indistinguishable from real media, but their detection has not improved at the same pace. ...
Deepfake voices are generally divided into two categories: text-to-speech (TTS) generation and voice conversion. ...
(a) Deep Taylor deepfake voice. (b) Integrated gradients deepfake voice. (c) LRP deepfake voice. (d) Deep Taylor real voice. (e) Integrated gradients real voice. (f) LRP real voice. ...
doi:10.3390/app12083926
fatcat:vjuwm2j4jbhtrgxgb3v6sufwya
LSTM Deep Neural Networks Postfiltering for Improving the Quality of Synthetic Voices
[article]
2016
arXiv
pre-print
In this paper we present the application of Long-Short Term Memory Deep Neural Networks as a Postfiltering step of HMM-based speech synthesis, in order to obtain closer spectral characteristics to those ...
Recent developments in speech synthesis have produced systems capable of outcome intelligible speech, but now researchers strive to create models that more accurately mimic human voices. ...
INTRODUCTION Text-to-speech (TTS) synthesis is the technique of generating intelligible speech from a given text. ...
arXiv:1602.02656v1
fatcat:nshkdywklfhbpcrssyqz6qyuq4
An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning
[article]
2020
arXiv
pre-print
In this paper, we provide a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and ...
Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged. ...
WaveNet [66] is a deep neural network that learns to generate high quality time-domain waveform. ...
arXiv:2008.03648v2
fatcat:nehs6o22pzdirffvedqtby4sd4
Fingerprinting Encrypted Voice Traffic on Smart Speakers with Deep Learning
[article]
2020
arXiv
pre-print
This is because the AI-based voice services running on the server side response commands in the same voice and with a deterministic or predictable manner in text, which leaves distinguishable pattern over ...
This paper investigates the privacy leakage of smart speakers under an encrypted traffic analysis attack, referred to as voice command fingerprinting. ...
popular smart speakers, Amazon Echo and Google Home, using 5 automated voices rendered by public text-to-speech APIs. ...
arXiv:2005.09800v1
fatcat:5broa65upjgfrck3slpk5zn77m
A Survey on Recent Deep Learning-driven Singing Voice Synthesis Systems
[article]
2021
arXiv
pre-print
Singing voice synthesis (SVS) is a task that aims to generate audio signals according to musical scores and lyrics. ...
This paper aims to review some of the state-of-the-art deep learning-driven SVS systems. ...
Sinsy: DNN + Neural Vocoder Sinsy [14] is designed to synthesize singing voices at appropriate timing from a musical score. ...
arXiv:2110.02511v1
fatcat:4ou5xepnjbg2todhfu3vrn7p44
Adversarial Attack and Defense on Deep Neural Network-Based Voice Processing Systems: An Overview
2021
Applied Sciences
Unfortunately, recent research has shown that those systems based on deep neural networks are vulnerable to adversarial examples, which attract significant attention to VPS security. ...
Voice Processing Systems (VPSes), now widely deployed, have become deeply involved in people's daily lives, helping drive the car, unlock the smartphone, make online purchases, etc. ...
Benefiting from the rapid development of deep neural networks, speech recognition has also made good progress. ...
doi:10.3390/app11188450
fatcat:zjige7gepbdvnpk2i3qwyqv2oe
Investigating Deep Neural Structures and their Interpretability in the Domain of Voice Conversion
[article]
2021
arXiv
pre-print
The aim of non-parallel conditional GANs for VC is to translate an acoustic speech feature sequence from one domain to another without the use of paired data. ...
Voice Conversion (VC) is a subset of voice translation that involves translating the paralinguistic features of a source speaker to a target speaker while preserving the linguistic information. ...
such as real-time language translation and device-assisted language learning [11] . ...
arXiv:2102.11420v1
fatcat:cuaj6ct7rzezdlrbadhhniolrq
Continuous vocoder applied in deep neural network based voice conversion
2019
Multimedia tools and applications
In this paper, a novel vocoder is proposed for a Statistical Voice Conversion (SVC) framework using deep neural network, where multiple features from the speech of two speakers (source and target) are ...
frequency, and spectral features. (2) We show that the feed-forward deep neural network (FF-DNN) using our vocoder yields high quality conversion. (3) We apply a geometric approach to spectral subtraction ...
the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. ...
doi:10.1007/s11042-019-08198-5
fatcat:xmaanx6ofvdhxgl3ldse3r6bge
Online Speech Recognition Using Multichannel Parallel Acoustic Score Computation and Deep Neural Network (DNN)- Based Voice-Activity Detector
2020
Applied Sciences
Moreover, the average user-perceived latency is reduced from 11.71 s to 3.09–5.41 s by using the proposed deep neural network-based voice activity detector. ...
The proposed deep neural network-based voice activity detector detects short pauses in the utterance to reduce response latency, while the user utters long sentences. ...
DNN-based voice-activity detector An online ASR system needs to send the recognized text to a user as soon as possible even before the end of sentence is detected to reduce response time. ...
doi:10.3390/app10124091
fatcat:vu7x3kvkzzfi3lb62hsjlbzu7i
Deep Neural Networks with Voice Entry Estimation Heuristics for Voice Separation in Symbolic Music Representations
2018
Zenodo
In this study we explore the use of deep feedforward neural networks for voice separation in symbolic music representations. ...
Using more layers does not lead to a significant performance improvement. ...
Over the past decade, deep neural networks (DNNs) have been successfully applied to various computer vision, speech recognition, and natural language processing tasks, and, increasingly, to MIR tasks ...
doi:10.5281/zenodo.1492402
fatcat:5fca7poutvezlfiywy6kmoskau
Self-Adaptive Soft Voice Activity Detection using Deep Neural Networks for Robust Speaker Verification
[article]
2019
arXiv
pre-print
In this paper, we propose a novel method, called self-adaptive soft VAD, to incorporate a deep neural network (DNN)-based VAD into a deep speaker embedding system. ...
Voice activity detection (VAD), which classifies frames as speech or non-speech, is an important module in many speech applications including speaker verification. ...
To deal with this problem, several deep neural network (DNN)-based VADs [9, 10, 11, 12] have been proposed and shown to give better results in low SNRs. ...
arXiv:1909.11886v1
fatcat:xe7olwgssrhttdc7dlbdbblu3u
Multimodal Voice Conversion Under Adverse Environment Using a Deep Convolutional Neural Network
2019
IEEE Access
To solve this problem, we propose a multimodal voice conversion model based on a deep convolutional neural network (MDCNN) built by combining two convolutional neural networks (CNN) and a deep neural network ...
The two CNNs are designed to extract acoustic and visual features, and the DNN is designed to capture the nonlinear mapping relation of source speech and target speech. ...
with voice disorders, disguising speaker identity in communication, dubbing films, translation into different languages, and synthesis of text-to-speech (TTS) where a voice conversion system is used to ...
doi:10.1109/access.2019.2955982
fatcat:ubs54prtffajlp2qrz5xu3hase
« Previous
Showing results 1 — 15 out of 10,211 results