Perceptual enhancement of low rate speech coders [thesis]

Dipanjan Sen
Models of the human auditory system have been used with great success in many areas of speech communication, including speech recognition systems and broadband audio coders. In low rate speech coding systems, where it is desired to reduce the transmission rate of speech signals to below 64 kbps, there is only a scant use of such auditory models. In contrast, most modem speech coders are based on an underlying model of the human speech production mechanism. While speech coding algorithms such as
more » ... FS-1016, IMBE and LPC-10 have been used to transmit speech at rates as low as 2.4 kbps, the ensuing speech quality is far below the transparent quality achieved by broadband audio coders or the toll quality required by the ITU to recommend their use in public networks. Also, research into possible improvements of speech coders has always been hampered by the absence of a reliable objective measure of speech quality. This work is motivated by the hypothesis that the degradation in speech quality in low rate speech coders is in part due to the fact that current speech coders do not properly take into account the properties of the final receiver -the human hearing mechanism. It is therefore postulated that the use of an explicit auditory model to match speech coders based on voice production models to the human ear will result in a significant enhancement of speech quality. An auditory model is firstly evaluated in terms of its potential to increase the coding gain and/or enhance the speech quality of low rate speech coders. It is confirmed that there is significant potential improvement to be made by the use of auditory models. A speech coding algorithm termed Perceptually Enhanced Random Codebook Excited Linear Prediction (PERCELP) operating at 4.8 kbps is then developed, where the stochastic component of the excitation signal in the voice production model is optimised using auditory masking analysis. A method of searching for the excitation parameters in the perceptual domain, by optimising a perceptual criterion called Noise Above Masking (NAM), is developed. Informal listening tests indicate that the ensuing synthetic speech quality is much smoother and displays none of the "buzziness" that was present before the incorporation of the auditory analysis. It is therefore concluded that the use of the explicit auditory model in PERCELP has resulted in significant speech quality improvement. Furthermore, the same perceptual criterion used to optimise the coding algorithm (NAM), may be used to provide an objective measure of speech quality. Preliminary results indicate that the measure is able to evaluate coder performance better than the conventional Segmental Signal to Noise Ratio measure.
doi:10.26190/unsworks/10695 fatcat:ild64xmkirgz3fexkk6hwpr3ru