Lightly supervised GMM VAD to use audiobook for speech synthesiser

Yoshitaka Mamiya, Junichi Yamagishi, Oliver Watts, Robert A.J. Clark, Simon King, Adriana Stan
2013 2013 IEEE International Conference on Acoustics, Speech and Signal Processing  
Audiobooks have been focused on as promising data for training Text-to-Speech (TTS) systems. However, they usually do not have a correspondence between audio and text data. Moreover, they are usually divided only into chapter units. In practice, we have to make a correspondence of audio and text data before we use them for building TTS synthesisers. However aligning audio and text data is time-consuming and involves manual labor. It also requires persons skilled in speech processing.
more » ... we have proposed to use graphemes for automatically aligning speech and text data. This paper further integrates a lightly supervised voice activity detection (VAD) technique to detect sentence boundaries as a pre-processing step before the grapheme approach. This lightly supervised technique requires time stamps of speech and silence only for the first fifty sentences. Combining those, we can semiautomatically build TTS systems from audiobooks with minimum manual intervention. From subjective evaluations we analyse how the grapheme-based aligner and/or the proposed VAD technique impact the quality of HMM-based speech synthesisers trained on audiobooks. Index Termsvoice activity detection, lightly supervised, audiobook, HMM-based speech synthesis
doi:10.1109/icassp.2013.6639220 dblp:conf/icassp/MamiyaYWCKS13 fatcat:ostjrzz36fe7xgc227h6ddhpbi