A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Synthesising Expressiveness in Peking Opera via Duration Informed Attention Network
[article]
2019
arXiv
pre-print
With the Duration Informed Attention Network (DurIAN), this paper makes use of musical note instead of pitch contours for expressive opera singing synthesis. ...
opera singing synthesis. ...
Synthesis In synthesising Peking opera singing voice, the music note retrieved from the score is used instead of the note transcription result used in training to validate whether the proposed system can ...
arXiv:1912.12010v1
fatcat:hhchac35yvdzzaaysky3hokx4a
Corpus-Based Unit Selection TTS for Hungarian
[chapter]
2006
Lecture Notes in Computer Science
The earlier unit concatenation TTS system scored 2.63, the formant synthesizer scored 1.24, and natural speech scored 4.86. ...
The unit selection follows a top-down hierarchical scheme using words and speech sounds as units. A simple prosody model is used, based on the relative position of words within a prosodic phrase. ...
To mark the sound boundaries in the speech waveform, a Hungarian speech recognizer was used in forced alignment mode [5] . ...
doi:10.1007/11846406_46
fatcat:oql75xm75nfdrekeijnwzrxkiu
Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation
[article]
2016
arXiv
pre-print
This paper proposes a first attempt to build an end-to-end speech-to-text translation system, which does not use source language transcription during learning or decoding. ...
For instance, in the former project DARPA TRANSTAC (speech translation from spoken Arabic dialects), a large effort was devoted to the collection of speech transcripts (and a prerequisite to obtain transcripts ...
It is important to note that this is corpus-based concatenative speech synthesis (Schwarz, 2007) and not parametric synthesis. ...
arXiv:1612.01744v1
fatcat:deal7hswkrbcjfycvate4ednfy
Improving sentence-level alignment of speech with imperfect transcripts using utterance concatenation and VAD
2016
2016 IEEE 12th International Conference on Intelligent Computer Communication and Processing (ICCP)
In this paper we build upon a previously developed tool, ALISA, which was developed to align speech with imperfect transcripts using only 10 minutes of manually labelled data, in any alphabetic language ...
To overcome this problem, we propose two methods: one based on utterance concatenation, and one based on voice activity detection (VAD). ...
The main approaches either restrict the language model to match the available transcription [6] - [10] , or use acoustic cues to align the speech and text [11] . ...
doi:10.1109/iccp.2016.7737141
dblp:conf/iccp2/MoldovanSG16
fatcat:gf3iaakepvep5p4xjrkqoh26ni
Developing a unit selection voice given audio without corresponding text
2016
EURASIP Journal on Audio, Speech, and Music Processing
But, a few problems are associated with readily using this data such as (1) these audio files are generally long, and audio-transcription alignment is memory intensive; (2) precise corresponding transcriptions ...
Earlier works on long audio alignment addressing the first and second issue generally preferred reasonable transcripts and mainly focused on (1) less manual intervention, (2) mispronunciation detection ...
Even in the case of Olive and lecture, respectively, we used hypotheses of the ASR system instead of force-aligned reference transcriptions from Project Gutenberg and Coursera because the reference transcriptions ...
doi:10.1186/s13636-016-0084-y
fatcat:yh35ilqlxvbmfmk2chyjmwa6kq
VoCo
2017
ACM Transactions on Graphics
the target speaker's speech samples to the transcript using a forced alignment algorithm [Sjölander 2003 ]. ...
They also show "suitablility" scores as colors in the transcript to indicate suitable points to place cuts. ...
doi:10.1145/3072959.3073702
fatcat:fofvc42v55hhlol4fqpmuyo3oi
A set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese
2002
IEEE Transactions on Speech and Audio Processing
This paper presents a set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese. ...
A large speech corpus produced by a single speaker is used, and the speech output is synthesized from waveform units of variable lengths, with desired linguistic properties, retrieved from this corpus. ...
This is because this is basically an alignment problem and it is reasonable to consider all the given phonetic transcriptions to be correct. ...
doi:10.1109/tsa.2002.803437
fatcat:7x6s3g4fvbeprbx4ajtbzyieme
Development and Evaluation of Speech Synthesis Corpora for Latvian
2020
International Conference on Language Resources and Evaluation
This paper presents an unsupervised approach to obtain a suitable corpus from unannotated recordings using automated speech recognition for transcription, as well as automated speaker segmentation and ...
Recent advances in neural speech synthesis have enabled the development of such systems with a data-driven approach that does not require significant development of language-specific tools. ...
System MOC Concatenative 1.9 Parametric 3.2 Tacotron 2 3.7 Table 2 : Mean opinion scores. ...
dblp:conf/lrec/DargisPGAA20
fatcat:aqx2smzvcjatlcn5p2lfgntkxq
Audio-visual Voice Conversion Using Deep Canonical Correlation Analysis for Deep Bottleneck Features
2018
Interspeech 2018
This paper proposes Audio-Visual Voice Conversion (AVVC) methods using Deep BottleNeck Features (DBNF) and Deep Canonical Correlation Analysis (DCCA). ...
For subjective evaluation, Mean Opinion Score (MOS) was used. ...
Evaluation In this paper, a Mel-Cepstrum Distortion (MCD) score was used for objective evaluation. A small MCD score means that the quality of generated speeches is successful. ...
doi:10.21437/interspeech.2018-2286
dblp:conf/interspeech/TamuraHEHT18
fatcat:qd4nkjqx4vdnpb2bdg7mbobbqm
Score-Informed Transcription For Automatic Piano Tutoring
2012
Zenodo
Using the manually-aligned score, F w = 92.79% while using the automatically-aligned score, F w = 89.04%. ...
Without processing prResults with the 'strict' and 'relaxed' piano-rolls, the score-informed transcription results using manually-aligned scores reach F w = 94.92% and using automatically-aligned scores ...
doi:10.5281/zenodo.52547
fatcat:rszrhol6gvg3fnzrakd7qkok2i
A FAST ADAPTATION TECHNIQUE FOR BUILDING DIALECTAL MALAY SPEECH SYNTHESIS ACOUSTIC MODEL
2015
Jurnal Teknologi
The importance of this study is to propose a quick approach for aligning and building a good dialectal speech synthesis acoustic model by using a different source acoustic model. ...
of speech synthesis system. ...
There are many different approaches to speech synthesis: articulatory synthesis, formant synthesis, concatenative synthesis and hidden Markov model (HMM) synthesis [1] , [2] , [3] , [4] . ...
doi:10.11113/jt.v77.6514
fatcat:264q4cc6ujbdzchid5nsyxsqwy
Implementation and verification of speech database for unit selection speech synthesis
2017
Proceedings of the 2017 Federated Conference on Computer Science and Information Systems
The main aim of this study was to prepare a new speech database for the purpose of unit selection speech synthesis. ...
The quality of the synthetic speech was compared to that of synthetic speech obtained in other Polish unit selection speech synthesis systems. ...
INTRODUCTION NIT selection speech synthesis remains an effective and popular method of concatenative synthesis, yielding speech which is closest to natural sounding human speech. ...
doi:10.15439/2017f395
dblp:conf/fedcsis/SzklannyK17
fatcat:nubqwqd6rnbnxabwtcutzr3gay
A hidden Markov-model-based trainable speech synthesizer
1999
Computer Speech and Language
used in a concatenation synthesizer. ...
During synthesis the required utterance, specified as a string of words of known phonetic pronounciation, is generated as a sequence of these clustered states using a TD-PSOLA waveform concatenation synthesizer ...
The dummy model was used in preference to simply removing the closures from the transcriptions to avoid problems in synthesis caused by alignment errors with neighbouring phones when closures were erroneously ...
doi:10.1006/csla.1999.0123
fatcat:pbjk4n3phfettk4fvijatc42zi
EdiTTS: Score-based Editing for Controllable Text-to-Speech
[article]
2022
arXiv
pre-print
We present EdiTTS, an off-the-shelf speech editing methodology based on score-based generative modeling for text-to-speech synthesis. ...
Through listening tests and speech-to-text back transcription, we show that EdiTTS outperforms existing baselines and produces robust samples that satisfy user-imposed requirements. ...
Previous works have used unit selection [12] or context-aware prosody correction [13] , but their reliance on traditional algorithms such as concatenation or TD-PSLA [14] complicates the synthesis ...
arXiv:2110.02584v3
fatcat:qwh2l52zwbbnxgrbcsnr764ski
An approach to building language-independent text-to-speech synthesis for Indian languages
2014
2014 Twentieth National Conference on Communications (NCC)
in the quality of synthesised speech. ...
A popular speech synthesis method is the HMM based speech synthesis method. Given the phone set and question set for a language, HMM based synthesis systems are built. ...
During synthesis phase, the waveform units corresponding to the test sentence are chosen from the database based on concatenation criteria and synthesised. ...
doi:10.1109/ncc.2014.6811356
dblp:conf/ncc/PrakashRNM14
fatcat:4alyokhptzbw3bakla3vzue2m4
« Previous
Showing results 1 — 15 out of 6,489 results