6,489 Hits in 4.0 sec

Synthesising Expressiveness in Peking Opera via Duration Informed Attention Network [article]

Yusong Wu, Shengchen Li, Chengzhu Yu, Heng Lu, Chao Weng, Liqiang Zhang, Dong Yu
2019 arXiv   pre-print
With the Duration Informed Attention Network (DurIAN), this paper makes use of musical note instead of pitch contours for expressive opera singing synthesis.  ...  opera singing synthesis.  ...  Synthesis In synthesising Peking opera singing voice, the music note retrieved from the score is used instead of the note transcription result used in training to validate whether the proposed system can  ... 
arXiv:1912.12010v1 fatcat:hhchac35yvdzzaaysky3hokx4a

Corpus-Based Unit Selection TTS for Hungarian [chapter]

Márk Fék, Péter Pesti, Géza Németh, Csaba Zainkó, Gábor Olaszy
2006 Lecture Notes in Computer Science  
The earlier unit concatenation TTS system scored 2.63, the formant synthesizer scored 1.24, and natural speech scored 4.86.  ...  The unit selection follows a top-down hierarchical scheme using words and speech sounds as units. A simple prosody model is used, based on the relative position of words within a prosodic phrase.  ...  To mark the sound boundaries in the speech waveform, a Hungarian speech recognizer was used in forced alignment mode [5] .  ... 
doi:10.1007/11846406_46 fatcat:oql75xm75nfdrekeijnwzrxkiu

Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation [article]

Alexandre Berard and Olivier Pietquin and Christophe Servan and Laurent Besacier
2016 arXiv   pre-print
This paper proposes a first attempt to build an end-to-end speech-to-text translation system, which does not use source language transcription during learning or decoding.  ...  For instance, in the former project DARPA TRANSTAC (speech translation from spoken Arabic dialects), a large effort was devoted to the collection of speech transcripts (and a prerequisite to obtain transcripts  ...  It is important to note that this is corpus-based concatenative speech synthesis (Schwarz, 2007) and not parametric synthesis.  ... 
arXiv:1612.01744v1 fatcat:deal7hswkrbcjfycvate4ednfy

Improving sentence-level alignment of speech with imperfect transcripts using utterance concatenation and VAD

Alexandru Moldovan, Adriana Stan, Mircea Giurgiu
2016 2016 IEEE 12th International Conference on Intelligent Computer Communication and Processing (ICCP)  
In this paper we build upon a previously developed tool, ALISA, which was developed to align speech with imperfect transcripts using only 10 minutes of manually labelled data, in any alphabetic language  ...  To overcome this problem, we propose two methods: one based on utterance concatenation, and one based on voice activity detection (VAD).  ...  The main approaches either restrict the language model to match the available transcription [6] - [10] , or use acoustic cues to align the speech and text [11] .  ... 
doi:10.1109/iccp.2016.7737141 dblp:conf/iccp2/MoldovanSG16 fatcat:gf3iaakepvep5p4xjrkqoh26ni

Developing a unit selection voice given audio without corresponding text

Tejas Godambe, Sai Krishna Rallabandi, Suryakanth V. Gangashetty, Ashraf Alkhairy, Afshan Jafri
2016 EURASIP Journal on Audio, Speech, and Music Processing  
But, a few problems are associated with readily using this data such as (1) these audio files are generally long, and audio-transcription alignment is memory intensive; (2) precise corresponding transcriptions  ...  Earlier works on long audio alignment addressing the first and second issue generally preferred reasonable transcripts and mainly focused on (1) less manual intervention, (2) mispronunciation detection  ...  Even in the case of Olive and lecture, respectively, we used hypotheses of the ASR system instead of force-aligned reference transcriptions from Project Gutenberg and Coursera because the reference transcriptions  ... 
doi:10.1186/s13636-016-0084-y fatcat:yh35ilqlxvbmfmk2chyjmwa6kq


Zeyu Jin, Gautham J. Mysore, Stephen Diverdi, Jingwan Lu, Adam Finkelstein
2017 ACM Transactions on Graphics  
the target speaker's speech samples to the transcript using a forced alignment algorithm [Sjölander 2003 ].  ...  They also show "suitablility" scores as colors in the transcript to indicate suitable points to place cuts.  ... 
doi:10.1145/3072959.3073702 fatcat:fofvc42v55hhlol4fqpmuyo3oi

A set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese

Fu-Chiang Chou, Chiu-Yu Tseng, Lin-Shan Lee
2002 IEEE Transactions on Speech and Audio Processing  
This paper presents a set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese.  ...  A large speech corpus produced by a single speaker is used, and the speech output is synthesized from waveform units of variable lengths, with desired linguistic properties, retrieved from this corpus.  ...  This is because this is basically an alignment problem and it is reasonable to consider all the given phonetic transcriptions to be correct.  ... 
doi:10.1109/tsa.2002.803437 fatcat:7x6s3g4fvbeprbx4ajtbzyieme

Development and Evaluation of Speech Synthesis Corpora for Latvian

Roberts Dargis, Peteris Paikens, Normunds Gruzitis, Ilze Auzina, Agate Akmane
2020 International Conference on Language Resources and Evaluation  
This paper presents an unsupervised approach to obtain a suitable corpus from unannotated recordings using automated speech recognition for transcription, as well as automated speaker segmentation and  ...  Recent advances in neural speech synthesis have enabled the development of such systems with a data-driven approach that does not require significant development of language-specific tools.  ...  System MOC Concatenative 1.9 Parametric 3.2 Tacotron 2 3.7 Table 2 : Mean opinion scores.  ... 
dblp:conf/lrec/DargisPGAA20 fatcat:aqx2smzvcjatlcn5p2lfgntkxq

Audio-visual Voice Conversion Using Deep Canonical Correlation Analysis for Deep Bottleneck Features

Satoshi Tamura, Kento Horio, Hajime Endo, Satoru Hayamizu, Tomoki Toda
2018 Interspeech 2018  
This paper proposes Audio-Visual Voice Conversion (AVVC) methods using Deep BottleNeck Features (DBNF) and Deep Canonical Correlation Analysis (DCCA).  ...  For subjective evaluation, Mean Opinion Score (MOS) was used.  ...  Evaluation In this paper, a Mel-Cepstrum Distortion (MCD) score was used for objective evaluation. A small MCD score means that the quality of generated speeches is successful.  ... 
doi:10.21437/interspeech.2018-2286 dblp:conf/interspeech/TamuraHEHT18 fatcat:qd4nkjqx4vdnpb2bdg7mbobbqm

Score-Informed Transcription For Automatic Piano Tutoring

Emmanouil Benetos, Simon Dixon, Anssi Klapuri
2012 Zenodo  
Using the manually-aligned score, F w = 92.79% while using the automatically-aligned score, F w = 89.04%.  ...  Without processing prResults with the 'strict' and 'relaxed' piano-rolls, the score-informed transcription results using manually-aligned scores reach F w = 94.92% and using automatically-aligned scores  ... 
doi:10.5281/zenodo.52547 fatcat:rszrhol6gvg3fnzrakd7qkok2i


Yen-Min Jasmina Khaw, Tien-Ping Tan
2015 Jurnal Teknologi  
The importance of this study is to propose a quick approach for aligning and building a good dialectal speech synthesis acoustic model by using a different source acoustic model.  ...  of speech synthesis system.  ...  There are many different approaches to speech synthesis: articulatory synthesis, formant synthesis, concatenative synthesis and hidden Markov model (HMM) synthesis [1] , [2] , [3] , [4] .  ... 
doi:10.11113/jt.v77.6514 fatcat:264q4cc6ujbdzchid5nsyxsqwy

Implementation and verification of speech database for unit selection speech synthesis

Krzysztof Szklanny, Sebastian Koszuta
2017 Proceedings of the 2017 Federated Conference on Computer Science and Information Systems  
The main aim of this study was to prepare a new speech database for the purpose of unit selection speech synthesis.  ...  The quality of the synthetic speech was compared to that of synthetic speech obtained in other Polish unit selection speech synthesis systems.  ...  INTRODUCTION NIT selection speech synthesis remains an effective and popular method of concatenative synthesis, yielding speech which is closest to natural sounding human speech.  ... 
doi:10.15439/2017f395 dblp:conf/fedcsis/SzklannyK17 fatcat:nubqwqd6rnbnxabwtcutzr3gay

A hidden Markov-model-based trainable speech synthesizer

R.E. Donovan, P.C. Woodland
1999 Computer Speech and Language  
used in a concatenation synthesizer.  ...  During synthesis the required utterance, specified as a string of words of known phonetic pronounciation, is generated as a sequence of these clustered states using a TD-PSOLA waveform concatenation synthesizer  ...  The dummy model was used in preference to simply removing the closures from the transcriptions to avoid problems in synthesis caused by alignment errors with neighbouring phones when closures were erroneously  ... 
doi:10.1006/csla.1999.0123 fatcat:pbjk4n3phfettk4fvijatc42zi

EdiTTS: Score-based Editing for Controllable Text-to-Speech [article]

Jaesung Tae, Hyeongju Kim, Taesu Kim
2022 arXiv   pre-print
We present EdiTTS, an off-the-shelf speech editing methodology based on score-based generative modeling for text-to-speech synthesis.  ...  Through listening tests and speech-to-text back transcription, we show that EdiTTS outperforms existing baselines and produces robust samples that satisfy user-imposed requirements.  ...  Previous works have used unit selection [12] or context-aware prosody correction [13] , but their reliance on traditional algorithms such as concatenation or TD-PSLA [14] complicates the synthesis  ... 
arXiv:2110.02584v3 fatcat:qwh2l52zwbbnxgrbcsnr764ski

An approach to building language-independent text-to-speech synthesis for Indian languages

Anusha Prakash, M Ramasubba Reddy, T Nagarajan, Hema A Murthy
2014 2014 Twentieth National Conference on Communications (NCC)  
in the quality of synthesised speech.  ...  A popular speech synthesis method is the HMM based speech synthesis method. Given the phone set and question set for a language, HMM based synthesis systems are built.  ...  During synthesis phase, the waveform units corresponding to the test sentence are chosen from the database based on concatenation criteria and synthesised.  ... 
doi:10.1109/ncc.2014.6811356 dblp:conf/ncc/PrakashRNM14 fatcat:4alyokhptzbw3bakla3vzue2m4
« Previous Showing results 1 — 15 out of 6,489 results