Estimating nonnegative matrix model activations with deep neural networks to increase perceptual speech quality

Donald S. Williamson, Yuxuan Wang, DeLiang Wang
2015 Journal of the Acoustical Society of America  
As a means of speech separation, time-frequency masking applies a gain function to the timefrequency representation of noisy speech. On the other hand, nonnegative matrix factorization (NMF) addresses separation by linearly combining basis vectors from speech and noise models to approximate noisy speech. This paper presents an approach for improving the perceptual quality of speech separated from background noise at low signal-to-noise ratios. An ideal ratio mask is estimated, which separates
more » ... eech from noise with reasonable sound quality. A deep neural network then approximates clean speech by estimating activation weights from the ratio-masked speech, where the weights linearly combine elements from a NMF speech model. Systematic comparisons using objective metrics, including the perceptual evaluation of speech quality, show that the proposed algorithm achieves higher speech quality than related masking and NMF methods. In addition, a listening test was performed and its results show that the output of the proposed algorithm is preferred over the comparison systems in terms of speech quality.
doi:10.1121/1.4928612 pmid:26428778 pmcid:PMC5392055 fatcat:2ydwgowujfbh7a4f5jaziqtmva