Incremental TTS for Japanese Language

Tomoya Yanagita, Sakriani Sakti, Satoshi Nakamura
2018 Interspeech 2018  
Simultaneous lecture translation requires speech to be translated in real time before the speaker has spoken an entire sentence since a long delay will create difficulties for the listeners trying to follow the lecture. The challenge is to construct a full-fledged system with speech recognition, machine translation, and textto-speech synthesis (TTS) components that could produce highquality speech translations on the fly. Specifically for a TTS, this poses problems as a conventional framework
more » ... mmonly requires the language-dependent contextual linguistics of a full sentence to produce a natural-sounding speech waveform. Several studies have proposed ways for an incremental TTS (ITTS), in which it can estimate the target prosody from only partial knowledge of the sentence. However, most investigations are being done only in French, English, and German. French is a syllable-timed language and the others are stress-timed languages. The Japanese language, which is a mora-timed language, has not been investigated so far. In this paper, we evaluate the quality of Japanese synthesized speech based on various linguistic and temporal incremental units. Experimental results reveal that an accent phrase incremental unit (a group of moras) is essential for a Japanese ITTS as a trade-off between quality and synthesis units. Index Terms: Incremental speech synthesis, linguistic and temporal locality features, HMM based speech synthesis
doi:10.21437/interspeech.2018-1561 dblp:conf/interspeech/YanagitaS018 fatcat:w3umgk77sncqdchrdjveok6hui