Voice source model for continuous control of pitch period

Paul H. Milenkovic
1993 Journal of the Acoustical Society of America  
The voiced speech waveform may be synthesized by exciting an LPC vocal tract filter with a pulse waveform patterned after naturally occurring glottal airflow pulses. Such a pulse waveform may be generated by computing samples of a piecewise polynomial curve at equally spaced time intervals. In this type of synthesis, the pitch period is commonly restricted to an integer multiple of the sample interval. A method is presented for removing this restriction, permitting both pulse duration and pitch
more » ... duration and pitch period to be varied over continuous time. Aliasing distortion is prevented by computing the sample values of pulses that have been low-pass filtered in continuous time prior to sampling. Applications of this technique include modeling glottal pulses by least-squares fit to inverse filter waveforms, the synthesis of calibration waveforms for evaluating measures of speech waveform jitter, the perceptual evaluation of low levels of waveform jitter, and the synthesis of the singing voice. PACS numbers: 43.72.Ar, 43.72.Ja INTRODUCTION Voiced speech may be synthesized by applying excitation pulses to a digital filter according to the source-filter model of speech production. The natural glottal pulse contributes zeros to the speech spectrum (Mathews et al., 1961 ) which are understood to have perceptual relevance. Patterning the shape of the excitation pulses after the glottal pulses observed by inverse filtering was reported by Rosenberg ( 1971 ) and by Holmes (1973) to improve the naturalness of synthetic speech. We have reported (Milenkovic, 1986) on the use of pulses described by piecewise polynomial curves to model the voice source. The voice source signal in this instance refers to the first derivative of giottal airflow as obtained by inverse filtering the speech waveform measured with a pressure microphone; the model pulses were patterned after the shape of the first derivative of the glottal airflow pulse. The parameters controlling the pulse shape were adjusted to fit the model to the inverse filtered speech waveform in a leastsquares sense. Fully natural sounding speech synthesis may need to account for the effects of source-tract interaction. The vocal tract driving point impedance has a transform zero at dc related to the inertia of the air column, and this impedance zero accounts for the skewing of the glottal airflow pulse relative to a more symmetric glottai opening area pulse (Rothenberg, 1983). The vocal tract impedance also has peaks at the formant frequencies; these peaks result in formant ripple being added to the glottal airflow pulse (Ananthapadmanabha and'Fant, 1982 ). Recent examples of source interactive synthesizers are reported by Sondhi and Schroeter (1988) and by Pinto et al. (1989). The Sondhi-Schroeter synthesizer is based on an articulatory model that produces a vocal tract area function from which both the vocal tract transfer function as well as the driving point impedance may be computed. The synthesizer described by Pinto et al. is based on a formant model where the formant frequencies are used to specify an equivalent circuit model of the driving point impedance. Source interactive speech synthesis is restricted in its application by the need for articulatory information. The formant frequencies do not specify a unique vocal tract area function (Atal et al., 1978), and the different area functions with the same formant frequencies produce nearly similar transfer functions but widely different driving point impedances (Milenkovic , 1984) . While the impedance peak frequencies are coincident with the formant frequencies, the impedance zero frequencies are not set by the formant frequencies, and the location of these zeros has a profound effect on the amplitudes of the impedance peaks affecting the formant ripple. The effect of source-tract interaction on the acoustic waveform can, to a degree, be reproduced by a source-filter synthesizer. The skewing of the glottal airflow pulse is easily incorporated into a mathematical function describing a pulse shape. Formant ripple, however, is related to the dissipation of the formant oscillations into losses in the glottal constriction and subglottal acoustic system (Flanagan, 1972). Klatt and Klatt (1990) suggest using a pitch synchronous adjustment of bandwidths in a formant synthesizer to approximate this effect. On the other hand, the LPC multipulse synthesis method (Atal and Remde, 1982) is able to match the acoustic speech waveform in a least-squares sense using an LPC vocal tract filter that is fixed over one or more pitch period cycles. A model combining an LPC derived filter with a small number of source parameters may have sufficient degrees of freedom to approximate acoustic waveforms influenced by source-tract interaction. Efforts to improve the quality of source-filter synthesized speech remain a worthwhile endeavor. In light of work in source-interactive synthesizers, the source-filter model is
doi:10.1121/1.405557 pmid:8445119 fatcat:ynmx3bemzngfrpsdzgorq56diq